This commit is contained in:
cre8ture 2024-01-11 18:22:08 -08:00
commit 56c50ef3c5
10 changed files with 5394 additions and 0 deletions

81
.gitignore vendored Normal file
View File

@ -0,0 +1,81 @@
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
# Runtime data
pids
*.pid
*.seed
*.pid.lock
# Directory for instrumented libs generated by jscoverage/JSCover
lib-cov
# Coverage directory used by tools like istanbul
coverage
# nyc test coverage
.nyc_output
# Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files)
.grunt
# Bower dependency directory (https://bower.io/)
bower_components
# node-waf configuration
.lock-wscript
# Compiled binary addons (https://nodejs.org/api/addons.html)
build/Release
# Dependency directories
node_modules/
jspm_packages/
# TypeScript v1 declaration files
typings/
# Optional npm cache directory
.npm
# Optional eslint cache
.eslintcache
# Optional REPL history
.node_repl_history
# Output of 'npm pack'
*.tgz
# Yarn Integrity file
.yarn-integrity
# dotenv environment variables file
.env
.env.test
# parcel-bundler cache (https://parceljs.org/)
.cache
# next.js build output
.next
# nuxt.js build / generate output
.nuxt
# vuepress build output
.vuepress/dist
# Serverless directories
.serverless/
# FuseBox cache
.fusebox/
# DynamoDB Local files
.dynamodb/
node_modules/

99
README.md Normal file
View File

@ -0,0 +1,99 @@
# Lightweight Browser-based NLP with Hugging Face Transformers
This project uses Hugging Face's Transformers library in a browser environment to perform Natural Language Processing (NLP) tasks. Specifically, we use it to embed text and find similar sentences.
## Webpack Configuration
We use Webpack to bundle our JavaScript code, including the Transformer.js library, into a single file that can be run in the browser. Our Webpack configuration includes settings for handling JavaScript files and other assets.
To build the project, run the following command:
```bash
npm run build
```
This command will use Webpack to bundle the code according to the configuration specified in `webpack.config.js`.
Here's a basic overview of our Webpack configuration:
```javascript
const path = require('path');
module.exports = {
entry: './src/index.js',
output: {
filename: 'main.js',
path: path.resolve(__dirname, 'dist'),
},
module: {
rules: [
{
test: /\.js$/,
exclude: /node_modules/,
use: {
loader: 'babel-loader',
},
},
],
},
};
```
This configuration tells Webpack to start bundling from `src/index.js`, to output the bundled file as `dist/main.js`, and to use Babel to transpile our JavaScript code.
## Using Transformer.js for Text Embedding
We use the `pipeline` function from the Transformer.js library to generate embeddings for text. An embedding is a way of representing text in a high-dimensional space that captures semantic meaning. It's often used in natural language processing (NLP) tasks.
Here's an example of how we use Transformer.js to generate embeddings:
```javascript
const pipe = await pipeline('feature-extraction', 'Supabase/gte-small');
// Generate an embedding for each sentence
const embeddings = await Promise.all(
sentences.map((sentence) =>
pipe(sentence, {
pooling: 'mean',
normalize: true,
})
)
);
```
## Storing Text and Embeddings
We use IndexedDB, a low-level API for client-side storage of significant amounts of structured data, to store the text and its corresponding embedding. We have created a custom `VectorStorage` class that handles the storage and retrieval of vectors in IndexedDB.
## Finding Similar Sentences
Once we have the embeddings, we can use them to find sentences that are similar to a given query. We do this by calculating the cosine similarity between the query's embedding and the embeddings of each sentence. The sentence with the highest cosine similarity to the query is considered the most similar sentence.
```javascript
// Generate an embedding for the query string
const queryEmbedding = Array.from(
(
await pipe(event.data.query, {
pooling: 'mean',
normalize: true,
})
).data
);
// Find the embedding that's most similar to the query embedding
const index = self.embeddings.reduce((bestIndex, embedding, index) => {
const similarity = cosineSimilarity(embedding, queryEmbedding);
return similarity > cosineSimilarity(self.embeddings[bestIndex], queryEmbedding)
? index
: bestIndex;
}, 0);
// Return the corresponding sentence
postMessage(self.sentences[index]);
```
This project is a demonstration of how powerful NLP tools like Hugging Face's Transformers library can be used in a lightweight, browser-based application.
## Source
https://huggingface.co/Supabase/gte-small

525
dist/worker.bundle.js vendored Normal file

File diff suppressed because one or more lines are too long

80
index.html Normal file
View File

@ -0,0 +1,80 @@
<!-- https://huggingface.co/Supabase/gte-small -->
<!-- this just runs the model without a web worker. not recommended since page will freeze when running inference -->
<!-- <script type="module">
import { pipeline } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.5.0';
const pipe = await pipeline(
'feature-extraction',
'Supabase/gte-small',
);
// Generate the embedding from text
const output = await pipe('Hello world', {
pooling: 'mean',
normalize: true,
});
// Extract the embedding output
const embedding = Array.from(output.data);
console.log(embedding);
</script>
-->
<html>
<head>
<title>Web Worker Example</title>
</head>
<body>
<h1>Web Worker Example</h1>
<form id="text-form">
<textarea id="text-input" required></textarea>
<input type="file" id="file-input" accept=".txt">
<button type="submit">Submit</button>
</form>
<form id="query-form" style="display: none;">
<input type="text" id="query-input" required>
<button type="submit">Query</button>
</form>
<div id="most-similar-sentences"></div>
<script type="module">
const worker = new Worker('./dist/worker.bundle.js');
const mostSimDiv = document.getElementById('most-similar-sentences');
document.getElementById('text-form').addEventListener('submit', async event => {
event.preventDefault();
mostSimDiv.innerText = ''
let text;
const file = document.getElementById('file-input').files[0];
if (file) {
text = await file.text();
} else {
text = document.getElementById('text-input').value;
}
worker.postMessage({ type: 'text', text });
document.getElementById('query-form').style.display = 'block';
});
document.getElementById('query-form').addEventListener('submit', event => {
event.preventDefault();
mostSimDiv.innerText = 'thinking...';
const query = document.getElementById('query-input').value;
worker.postMessage({ type: 'query', query });
});
worker.onmessage = function(event) {
console.log(event.data);
mostSimDiv.innerText = event.data.join('\n');
};
</script>
</body>
</html>

4363
package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

24
package.json Normal file
View File

@ -0,0 +1,24 @@
{
"name": "usingwebpack",
"version": "1.0.0",
"description": "",
"main": "./workers/embeddingModel.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1",
"build": "webpack"
},
"keywords": [],
"author": "",
"license": "ISC",
"devDependencies": {
"@babel/core": "^7.23.7",
"@babel/preset-env": "^7.23.7",
"babel-loader": "^9.1.3",
"webpack": "^5.89.0",
"webpack-cli": "^5.1.4"
},
"dependencies": {
"@xenova/transformers": "^2.13.4",
"idb": "^8.0.0"
}
}

26
webpack.config.js Normal file
View File

@ -0,0 +1,26 @@
// webpack.config.js
const path = require('path');
module.exports = {
mode: 'development', // add this line to set the mode to 'development'
entry: './workers/embeddingModel.js', // replace with the path to your worker file
output: {
filename: 'worker.bundle.js',
path: path.resolve(__dirname, 'dist'),
},
module: {
rules: [
{
test: /\.js$/,
exclude: /(node_modules)/,
use: {
loader: 'babel-loader',
options: {
presets: ['@babel/preset-env']
}
}
}
]
},
target: 'webworker',
};

1
workers/data/db.json Normal file
View File

@ -0,0 +1 @@
[{"text":"text2","vector":[4,5,6],"source":"source2"},{"text":"text2","vector":[4,5,6],"source":"source2"}]

105
workers/embeddingModel.js Normal file
View File

@ -0,0 +1,105 @@
import { pipeline } from "@xenova/transformers";
import { VectorStorage } from "./vectorDB";
// Create an instance of VectorStorage
const vectorStorage = new VectorStorage();
self.onmessage = async function (event) {
let sentences = [];
console.log("event!!!", event);
// try {
const pipe = await pipeline("feature-extraction", "Supabase/gte-small");
// } catch (error) {
// console.error('Error initializing pipeline:', error);
// return;
// }
if (event.data.type === "text") {
console.log("event.data.text", event.data.text);
// Check if the text is a string
if (typeof event.data.text === "string") {
// Split the text into sentences
sentences = event.data.text.split(". ");
// Rest of the code...
} else {
console.error("event.data.text is not a string:", event.data.text);
}
// Generate an embedding for each sentence
const embeddings = await Promise.all(
sentences.map((sentence) =>
pipe(sentence, {
pooling: "mean",
normalize: true,
})
)
);
// Store the sentences and their corresponding embeddings
self.sentences = sentences;
self.embeddings = embeddings.map((output) => Array.from(output.data));
// Save each sentence and its embedding to the vector storage
self.sentences.forEach((sentence, index) => {
vectorStorage.addVector(
sentence,
self.embeddings[index],
"embeddingModel"
);
});
// console.log("self.sentences", self.sentences);
// console.log("self.embeddings", self.embeddings);
} else if (event.data.type === "query") {
// Generate an embedding for the query string
const queryEmbedding = Array.from(
(
await pipe(event.data.query, {
pooling: "mean",
normalize: true,
})
).data
);
// Find the embedding that's most similar to the query embedding
// const index = self.embeddings.reduce((bestIndex, embedding, index) => {
// const similarity = cosineSimilarity(embedding, queryEmbedding);
// return similarity >
// cosineSimilarity(self.embeddings[bestIndex], queryEmbedding)
// ? index
// : bestIndex;
// }, 0);
// // Return the corresponding sentence
// postMessage(self.sentences[index]);
// Set the number of similar sentences to return
const numSimilarSentences = 5;
// Calculate the cosine similarity for each sentence
const similarities = self.embeddings.map((embedding) =>
cosineSimilarity(embedding, queryEmbedding)
);
// Create an array of indices sorted by their corresponding sentence's similarity to the query string
const sortedIndices = Array.from(
{ length: similarities.length },
(_, i) => i
).sort((a, b) => similarities[b] - similarities[a]);
// Return the top n sentences
postMessage(
sortedIndices
.slice(0, numSimilarSentences)
.map((index) => self.sentences[index])
);
}
};
function cosineSimilarity(a, b) {
const dotProduct = a.reduce((sum, a_i, i) => sum + a_i * b[i], 0);
const magnitudeA = Math.sqrt(a.reduce((sum, a_i) => sum + a_i * a_i, 0));
const magnitudeB = Math.sqrt(b.reduce((sum, b_i) => sum + b_i * b_i, 0));
return dotProduct / (magnitudeA * magnitudeB);
}

90
workers/vectorDB.js Normal file
View File

@ -0,0 +1,90 @@
// Import IndexedDB Promised library for easier IndexedDB usage
import { openDB } from 'idb';
export class VectorStorage {
constructor() {
// Open (or create) the database
this.dbPromise = openDB('vectorStorage', 1, {
upgrade(db) {
if (!db.objectStoreNames.contains('vectors')) {
db.createObjectStore('vectors', { keyPath: 'text' });
}
},
});
}
async addVector(text, vector, source) {
// Open a transaction, get the object store, and add the vector
const db = await this.dbPromise;
const tx = db.transaction('vectors', 'readwrite');
const store = tx.objectStore('vectors');
// Check if a vector with the same text already exists
const existingVector = await store.get(text);
if (existingVector) {
// If a vector with the same text already exists, ignore the new vector
console.log(`A vector with the text "${text}" already exists.`);
} else {
// Otherwise, add the new vector
await store.put({ text, vector, source });
}
// Wait for the transaction to complete
await tx.complete;
}
// Delete a vector
async deleteVector(text) {
// Open a transaction, get the object store, and delete the vector
const db = await this.dbPromise;
const tx = db.transaction('vectors', 'readwrite');
const store = tx.objectStore('vectors');
await store.delete(text);
// Wait for the transaction to complete
await tx.complete;
}
// Get a vector by its text
getVectorByText(text) {
return this.storage.find(item => item.text === text);
}
// Get a vector by its source
getVectorBySource(source) {
return this.storage.filter(item => item.source === source);
}
// Get all vectors
getAllVectors() {
return this.storage;
}
// Calculate cosine similarity
cosineSimilarity(a, b) {
const dotProduct = a.reduce((sum, a_i, i) => sum + a_i * b[i], 0);
const magnitudeA = Math.sqrt(a.reduce((sum, a_i) => sum + a_i * a_i, 0));
const magnitudeB = Math.sqrt(b.reduce((sum, b_i) => sum + b_i * b_i, 0));
return dotProduct / (magnitudeA * magnitudeB);
}
// Search vectors by cosine similarity
searchByCosineSimilarity(queryVector, k) {
// Calculate the cosine similarity for each vector
const similarities = this.storage.map(item => this.cosineSimilarity(item.vector, queryVector));
// Create an array of indices sorted by their corresponding vector's similarity to the query vector
const sortedIndices = Array.from({length: similarities.length}, (_, i) => i).sort((a, b) => similarities[b] - similarities[a]);
// Return the top k items
return sortedIndices.slice(0, k).map(index => this.storage[index]);
}
}
// Usage
// const vectorStorage = new VectorStorage();
// vectorStorage.addVector('text1', [1, 2, 3], 'source1');
// vectorStorage.addVector('text2', [4, 5, 6], 'source2');
// console.log(vectorStorage.searchByCosineSimilarity([1, 2, 3], 1)); // Outputs: [ { text: 'text1', vector: [ 1, 2, 3 ], source: 'source1' } ]
// console.log(vectorStorage.getAllVectors());
// vectorStorage.deleteVector('text1');
// console.log(vectorStorage.getAllVectors()); // Outputs: [ { text: 'text2', vector: [ 4, 5, 6 ], source: 'source2' } ]