new

2024-01-11 18:22:08 -08:00 · 2024-01-11 18:22:08 -08:00 · 56c50ef3c5
commit 56c50ef3c5
10 changed files with 5394 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,81 @@
 # Logs
 logs
 *.log
 npm-debug.log*
 yarn-debug.log*
 yarn-error.log*
 # Runtime data
 pids
 *.pid
 *.seed
 *.pid.lock
 # Directory for instrumented libs generated by jscoverage/JSCover
 lib-cov
 # Coverage directory used by tools like istanbul
 coverage
 # nyc test coverage
 .nyc_output
 # Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files)
 .grunt
 # Bower dependency directory (https://bower.io/)
 bower_components
 # node-waf configuration
 .lock-wscript
 # Compiled binary addons (https://nodejs.org/api/addons.html)
 build/Release
 # Dependency directories
 node_modules/
 jspm_packages/
 # TypeScript v1 declaration files
 typings/
 # Optional npm cache directory
 .npm
 # Optional eslint cache
 .eslintcache
 # Optional REPL history
 .node_repl_history
 # Output of 'npm pack'
 *.tgz
 # Yarn Integrity file
 .yarn-integrity
 # dotenv environment variables file
 .env
 .env.test
 # parcel-bundler cache (https://parceljs.org/)
 .cache
 # next.js build output
 .next
 # nuxt.js build / generate output
 .nuxt
 # vuepress build output
 .vuepress/dist
 # Serverless directories
 .serverless/
 # FuseBox cache
 .fusebox/
 # DynamoDB Local files
 .dynamodb/
 node_modules/
--- a/README.md
+++ b/README.md
@ -0,0 +1,99 @@
 # Lightweight Browser-based NLP with Hugging Face Transformers
 This project uses Hugging Face's Transformers library in a browser environment to perform Natural Language Processing (NLP) tasks. Specifically, we use it to embed text and find similar sentences.
 ## Webpack Configuration
 We use Webpack to bundle our JavaScript code, including the Transformer.js library, into a single file that can be run in the browser. Our Webpack configuration includes settings for handling JavaScript files and other assets.
 To build the project, run the following command:
 ```bash
 npm run build
 ```
 This command will use Webpack to bundle the code according to the configuration specified in `webpack.config.js`.
 Here's a basic overview of our Webpack configuration:
 ```javascript
 const path = require('path');
 module.exports = {
  entry: './src/index.js',
  output: {
    filename: 'main.js',
    path: path.resolve(__dirname, 'dist'),
  },
  module: {
    rules: [
      {
        test: /\.js$/,
        exclude: /node_modules/,
        use: {
          loader: 'babel-loader',
        },
      },
    ],
  },
 };
 ```
 This configuration tells Webpack to start bundling from `src/index.js`, to output the bundled file as `dist/main.js`, and to use Babel to transpile our JavaScript code.
 ## Using Transformer.js for Text Embedding
 We use the `pipeline` function from the Transformer.js library to generate embeddings for text. An embedding is a way of representing text in a high-dimensional space that captures semantic meaning. It's often used in natural language processing (NLP) tasks.
 Here's an example of how we use Transformer.js to generate embeddings:
 ```javascript
 const pipe = await pipeline('feature-extraction', 'Supabase/gte-small');
 // Generate an embedding for each sentence
 const embeddings = await Promise.all(
  sentences.map((sentence) =>
    pipe(sentence, {
      pooling: 'mean',
      normalize: true,
    })
  )
 );
 ```
 ## Storing Text and Embeddings
 We use IndexedDB, a low-level API for client-side storage of significant amounts of structured data, to store the text and its corresponding embedding. We have created a custom `VectorStorage` class that handles the storage and retrieval of vectors in IndexedDB.
 ## Finding Similar Sentences
 Once we have the embeddings, we can use them to find sentences that are similar to a given query. We do this by calculating the cosine similarity between the query's embedding and the embeddings of each sentence. The sentence with the highest cosine similarity to the query is considered the most similar sentence.
 ```javascript
 // Generate an embedding for the query string
 const queryEmbedding = Array.from(
  (
    await pipe(event.data.query, {
      pooling: 'mean',
      normalize: true,
    })
  ).data
 );
 // Find the embedding that's most similar to the query embedding
 const index = self.embeddings.reduce((bestIndex, embedding, index) => {
  const similarity = cosineSimilarity(embedding, queryEmbedding);
  return similarity > cosineSimilarity(self.embeddings[bestIndex], queryEmbedding)
    ? index
    : bestIndex;
 }, 0);
 // Return the corresponding sentence
 postMessage(self.sentences[index]);
 ```
 This project is a demonstration of how powerful NLP tools like Hugging Face's Transformers library can be used in a lightweight, browser-based application.
 ## Source
 https://huggingface.co/Supabase/gte-small
--- a/dist/worker.bundle.js
+++ b/dist/worker.bundle.js
--- a/index.html
+++ b/index.html
@ -0,0 +1,80 @@
 <!-- https://huggingface.co/Supabase/gte-small -->
 <!-- this just runs the model without a web worker. not recommended since page will freeze when running inference -->
 <!-- <script type="module">
    import { pipeline } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.5.0';
    const pipe = await pipeline(
      'feature-extraction',
      'Supabase/gte-small',
    );
    // Generate the embedding from text
    const output = await pipe('Hello world', {
      pooling: 'mean',
      normalize: true,
    });
    // Extract the embedding output
    const embedding = Array.from(output.data);
    console.log(embedding);
    </script>
     -->
     <html>
        <head>
            <title>Web Worker Example</title>
        </head>
        <body>
            <h1>Web Worker Example</h1>
           <form id="text-form">
            <textarea id="text-input" required></textarea>
            <input type="file" id="file-input" accept=".txt">
            <button type="submit">Submit</button>
        </form>
        <form id="query-form" style="display: none;">
            <input type="text" id="query-input" required>
            <button type="submit">Query</button>
        </form>
        <div id="most-similar-sentences"></div>
        <script type="module">
            const worker = new Worker('./dist/worker.bundle.js');
            const mostSimDiv = document.getElementById('most-similar-sentences');
            document.getElementById('text-form').addEventListener('submit', async event => {
                event.preventDefault();
                mostSimDiv.innerText = ''
                let text;
                const file = document.getElementById('file-input').files[0];
                if (file) {
                    text = await file.text();
                } else {
                    text = document.getElementById('text-input').value;
                }
                worker.postMessage({ type: 'text', text });
                document.getElementById('query-form').style.display = 'block';
            });
            document.getElementById('query-form').addEventListener('submit', event => {
                event.preventDefault();
                mostSimDiv.innerText = 'thinking...';
                const query = document.getElementById('query-input').value;
                worker.postMessage({ type: 'query', query });
            });
            worker.onmessage = function(event) {
                console.log(event.data);
                mostSimDiv.innerText = event.data.join('\n');
            };
        </script>
        </body>
        </html>
--- a/package-lock.json
+++ b/package-lock.json
--- a/package.json
+++ b/package.json
@ -0,0 +1,24 @@
 {
  "name": "usingwebpack",
  "version": "1.0.0",
  "description": "",
  "main": "./workers/embeddingModel.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1",
    "build": "webpack"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "devDependencies": {
    "@babel/core": "^7.23.7",
    "@babel/preset-env": "^7.23.7",
    "babel-loader": "^9.1.3",
    "webpack": "^5.89.0",
    "webpack-cli": "^5.1.4"
  },
  "dependencies": {
    "@xenova/transformers": "^2.13.4",
    "idb": "^8.0.0"
  }
 }
--- a/webpack.config.js
+++ b/webpack.config.js
@ -0,0 +1,26 @@
 // webpack.config.js
 const path = require('path');
 module.exports = {
  mode: 'development', // add this line to set the mode to 'development'
  entry: './workers/embeddingModel.js', // replace with the path to your worker file
  output: {
    filename: 'worker.bundle.js',
    path: path.resolve(__dirname, 'dist'),
  },
  module: {
    rules: [
      {
        test: /\.js$/,
        exclude: /(node_modules)/,
        use: {
          loader: 'babel-loader',
          options: {
            presets: ['@babel/preset-env']
          }
        }
      }
    ]
  },
  target: 'webworker',
 };
--- a/workers/data/db.json
+++ b/workers/data/db.json
@ -0,0 +1 @@
 [{"text":"text2","vector":[4,5,6],"source":"source2"},{"text":"text2","vector":[4,5,6],"source":"source2"}]
--- a/workers/embeddingModel.js
+++ b/workers/embeddingModel.js
@ -0,0 +1,105 @@
 import { pipeline } from "@xenova/transformers";
 import { VectorStorage } from "./vectorDB";
 // Create an instance of VectorStorage
 const vectorStorage = new VectorStorage();
 self.onmessage = async function (event) {
  let sentences = [];
  console.log("event!!!", event);
  // try {
  const pipe = await pipeline("feature-extraction", "Supabase/gte-small");
  // } catch (error) {
  //     console.error('Error initializing pipeline:', error);
  //     return;
  // }
  if (event.data.type === "text") {
    console.log("event.data.text", event.data.text);
    // Check if the text is a string
    if (typeof event.data.text === "string") {
      // Split the text into sentences
      sentences = event.data.text.split(". ");
      // Rest of the code...
    } else {
      console.error("event.data.text is not a string:", event.data.text);
    }
    // Generate an embedding for each sentence
    const embeddings = await Promise.all(
      sentences.map((sentence) =>
        pipe(sentence, {
          pooling: "mean",
          normalize: true,
        })
      )
    );
    // Store the sentences and their corresponding embeddings
    self.sentences = sentences;
    self.embeddings = embeddings.map((output) => Array.from(output.data));
    // Save each sentence and its embedding to the vector storage
    self.sentences.forEach((sentence, index) => {
      vectorStorage.addVector(
        sentence,
        self.embeddings[index],
        "embeddingModel"
      );
    });
    // console.log("self.sentences", self.sentences);
    // console.log("self.embeddings", self.embeddings);
  } else if (event.data.type === "query") {
    // Generate an embedding for the query string
    const queryEmbedding = Array.from(
      (
        await pipe(event.data.query, {
          pooling: "mean",
          normalize: true,
        })
      ).data
    );
    // Find the embedding that's most similar to the query embedding
    // const index = self.embeddings.reduce((bestIndex, embedding, index) => {
    //   const similarity = cosineSimilarity(embedding, queryEmbedding);
    //   return similarity >
    //     cosineSimilarity(self.embeddings[bestIndex], queryEmbedding)
    //     ? index
    //     : bestIndex;
    // }, 0);
    // // Return the corresponding sentence
    // postMessage(self.sentences[index]);
    // Set the number of similar sentences to return
    const numSimilarSentences = 5;
    // Calculate the cosine similarity for each sentence
    const similarities = self.embeddings.map((embedding) =>
      cosineSimilarity(embedding, queryEmbedding)
    );
    // Create an array of indices sorted by their corresponding sentence's similarity to the query string
    const sortedIndices = Array.from(
      { length: similarities.length },
      (_, i) => i
    ).sort((a, b) => similarities[b] - similarities[a]);
    // Return the top n sentences
    postMessage(
      sortedIndices
        .slice(0, numSimilarSentences)
        .map((index) => self.sentences[index])
    );
  }
 };
 function cosineSimilarity(a, b) {
  const dotProduct = a.reduce((sum, a_i, i) => sum + a_i * b[i], 0);
  const magnitudeA = Math.sqrt(a.reduce((sum, a_i) => sum + a_i * a_i, 0));
  const magnitudeB = Math.sqrt(b.reduce((sum, b_i) => sum + b_i * b_i, 0));
  return dotProduct / (magnitudeA * magnitudeB);
 }
--- a/workers/vectorDB.js
+++ b/workers/vectorDB.js
@ -0,0 +1,90 @@
 // Import IndexedDB Promised library for easier IndexedDB usage
 import { openDB } from 'idb';
 export class VectorStorage {
  constructor() {
    // Open (or create) the database
    this.dbPromise = openDB('vectorStorage', 1, {
      upgrade(db) {
        if (!db.objectStoreNames.contains('vectors')) {
          db.createObjectStore('vectors', { keyPath: 'text' });
        }
      },
    });
  }
  async addVector(text, vector, source) {
    // Open a transaction, get the object store, and add the vector
    const db = await this.dbPromise;
    const tx = db.transaction('vectors', 'readwrite');
    const store = tx.objectStore('vectors');
    // Check if a vector with the same text already exists
    const existingVector = await store.get(text);
    if (existingVector) {
      // If a vector with the same text already exists, ignore the new vector
      console.log(`A vector with the text "${text}" already exists.`);
    } else {
      // Otherwise, add the new vector
      await store.put({ text, vector, source });
    }
    // Wait for the transaction to complete
    await tx.complete;
  }
 // Delete a vector
 async deleteVector(text) {
  // Open a transaction, get the object store, and delete the vector
  const db = await this.dbPromise;
  const tx = db.transaction('vectors', 'readwrite');
  const store = tx.objectStore('vectors');
  await store.delete(text);
  // Wait for the transaction to complete
  await tx.complete;
 }
  // Get a vector by its text
  getVectorByText(text) {
    return this.storage.find(item => item.text === text);
  }
  // Get a vector by its source
  getVectorBySource(source) {
    return this.storage.filter(item => item.source === source);
  }
    // Get all vectors
    getAllVectors() {
      return this.storage;
    }
    // Calculate cosine similarity
    cosineSimilarity(a, b) {
      const dotProduct = a.reduce((sum, a_i, i) => sum + a_i * b[i], 0);
      const magnitudeA = Math.sqrt(a.reduce((sum, a_i) => sum + a_i * a_i, 0));
      const magnitudeB = Math.sqrt(b.reduce((sum, b_i) => sum + b_i * b_i, 0));
      return dotProduct / (magnitudeA * magnitudeB);
    }
    // Search vectors by cosine similarity
    searchByCosineSimilarity(queryVector, k) {
      // Calculate the cosine similarity for each vector
      const similarities = this.storage.map(item => this.cosineSimilarity(item.vector, queryVector));
      // Create an array of indices sorted by their corresponding vector's similarity to the query vector
      const sortedIndices = Array.from({length: similarities.length}, (_, i) => i).sort((a, b) => similarities[b] - similarities[a]);
      // Return the top k items
      return sortedIndices.slice(0, k).map(index => this.storage[index]);
    }
  }
  // Usage
  // const vectorStorage = new VectorStorage();
  // vectorStorage.addVector('text1', [1, 2, 3], 'source1');
  // vectorStorage.addVector('text2', [4, 5, 6], 'source2');
  // console.log(vectorStorage.searchByCosineSimilarity([1, 2, 3], 1));  // Outputs: [ { text: 'text1', vector: [ 1, 2, 3 ], source: 'source1' } ]
  // console.log(vectorStorage.getAllVectors());  
  // vectorStorage.deleteVector('text1');
  // console.log(vectorStorage.getAllVectors());  // Outputs: [ { text: 'text2', vector: [ 4, 5, 6 ], source: 'source2' } ]
		`@ -0,0 +1 @@`
							`[{"text":"text2","vector":[4,5,6],"source":"source2"},{"text":"text2","vector":[4,5,6],"source":"source2"}]`