new
This commit is contained in:
commit
56c50ef3c5
|
@ -0,0 +1,81 @@
|
||||||
|
# Logs
|
||||||
|
logs
|
||||||
|
*.log
|
||||||
|
npm-debug.log*
|
||||||
|
yarn-debug.log*
|
||||||
|
yarn-error.log*
|
||||||
|
|
||||||
|
# Runtime data
|
||||||
|
pids
|
||||||
|
*.pid
|
||||||
|
*.seed
|
||||||
|
*.pid.lock
|
||||||
|
|
||||||
|
# Directory for instrumented libs generated by jscoverage/JSCover
|
||||||
|
lib-cov
|
||||||
|
|
||||||
|
# Coverage directory used by tools like istanbul
|
||||||
|
coverage
|
||||||
|
|
||||||
|
# nyc test coverage
|
||||||
|
.nyc_output
|
||||||
|
|
||||||
|
# Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files)
|
||||||
|
.grunt
|
||||||
|
|
||||||
|
# Bower dependency directory (https://bower.io/)
|
||||||
|
bower_components
|
||||||
|
|
||||||
|
# node-waf configuration
|
||||||
|
.lock-wscript
|
||||||
|
|
||||||
|
# Compiled binary addons (https://nodejs.org/api/addons.html)
|
||||||
|
build/Release
|
||||||
|
|
||||||
|
# Dependency directories
|
||||||
|
node_modules/
|
||||||
|
jspm_packages/
|
||||||
|
|
||||||
|
# TypeScript v1 declaration files
|
||||||
|
typings/
|
||||||
|
|
||||||
|
# Optional npm cache directory
|
||||||
|
.npm
|
||||||
|
|
||||||
|
# Optional eslint cache
|
||||||
|
.eslintcache
|
||||||
|
|
||||||
|
# Optional REPL history
|
||||||
|
.node_repl_history
|
||||||
|
|
||||||
|
# Output of 'npm pack'
|
||||||
|
*.tgz
|
||||||
|
|
||||||
|
# Yarn Integrity file
|
||||||
|
.yarn-integrity
|
||||||
|
|
||||||
|
# dotenv environment variables file
|
||||||
|
.env
|
||||||
|
.env.test
|
||||||
|
|
||||||
|
# parcel-bundler cache (https://parceljs.org/)
|
||||||
|
.cache
|
||||||
|
|
||||||
|
# next.js build output
|
||||||
|
.next
|
||||||
|
|
||||||
|
# nuxt.js build / generate output
|
||||||
|
.nuxt
|
||||||
|
# vuepress build output
|
||||||
|
.vuepress/dist
|
||||||
|
|
||||||
|
# Serverless directories
|
||||||
|
.serverless/
|
||||||
|
|
||||||
|
# FuseBox cache
|
||||||
|
.fusebox/
|
||||||
|
|
||||||
|
# DynamoDB Local files
|
||||||
|
.dynamodb/
|
||||||
|
|
||||||
|
node_modules/
|
|
@ -0,0 +1,99 @@
|
||||||
|
|
||||||
|
# Lightweight Browser-based NLP with Hugging Face Transformers
|
||||||
|
|
||||||
|
This project uses Hugging Face's Transformers library in a browser environment to perform Natural Language Processing (NLP) tasks. Specifically, we use it to embed text and find similar sentences.
|
||||||
|
|
||||||
|
## Webpack Configuration
|
||||||
|
|
||||||
|
We use Webpack to bundle our JavaScript code, including the Transformer.js library, into a single file that can be run in the browser. Our Webpack configuration includes settings for handling JavaScript files and other assets.
|
||||||
|
|
||||||
|
To build the project, run the following command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
npm run build
|
||||||
|
```
|
||||||
|
|
||||||
|
This command will use Webpack to bundle the code according to the configuration specified in `webpack.config.js`.
|
||||||
|
|
||||||
|
Here's a basic overview of our Webpack configuration:
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
const path = require('path');
|
||||||
|
|
||||||
|
module.exports = {
|
||||||
|
entry: './src/index.js',
|
||||||
|
output: {
|
||||||
|
filename: 'main.js',
|
||||||
|
path: path.resolve(__dirname, 'dist'),
|
||||||
|
},
|
||||||
|
module: {
|
||||||
|
rules: [
|
||||||
|
{
|
||||||
|
test: /\.js$/,
|
||||||
|
exclude: /node_modules/,
|
||||||
|
use: {
|
||||||
|
loader: 'babel-loader',
|
||||||
|
},
|
||||||
|
},
|
||||||
|
],
|
||||||
|
},
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
This configuration tells Webpack to start bundling from `src/index.js`, to output the bundled file as `dist/main.js`, and to use Babel to transpile our JavaScript code.
|
||||||
|
|
||||||
|
## Using Transformer.js for Text Embedding
|
||||||
|
|
||||||
|
We use the `pipeline` function from the Transformer.js library to generate embeddings for text. An embedding is a way of representing text in a high-dimensional space that captures semantic meaning. It's often used in natural language processing (NLP) tasks.
|
||||||
|
|
||||||
|
Here's an example of how we use Transformer.js to generate embeddings:
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
const pipe = await pipeline('feature-extraction', 'Supabase/gte-small');
|
||||||
|
|
||||||
|
// Generate an embedding for each sentence
|
||||||
|
const embeddings = await Promise.all(
|
||||||
|
sentences.map((sentence) =>
|
||||||
|
pipe(sentence, {
|
||||||
|
pooling: 'mean',
|
||||||
|
normalize: true,
|
||||||
|
})
|
||||||
|
)
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
## Storing Text and Embeddings
|
||||||
|
|
||||||
|
We use IndexedDB, a low-level API for client-side storage of significant amounts of structured data, to store the text and its corresponding embedding. We have created a custom `VectorStorage` class that handles the storage and retrieval of vectors in IndexedDB.
|
||||||
|
|
||||||
|
## Finding Similar Sentences
|
||||||
|
|
||||||
|
Once we have the embeddings, we can use them to find sentences that are similar to a given query. We do this by calculating the cosine similarity between the query's embedding and the embeddings of each sentence. The sentence with the highest cosine similarity to the query is considered the most similar sentence.
|
||||||
|
|
||||||
|
```javascript
|
||||||
|
// Generate an embedding for the query string
|
||||||
|
const queryEmbedding = Array.from(
|
||||||
|
(
|
||||||
|
await pipe(event.data.query, {
|
||||||
|
pooling: 'mean',
|
||||||
|
normalize: true,
|
||||||
|
})
|
||||||
|
).data
|
||||||
|
);
|
||||||
|
|
||||||
|
// Find the embedding that's most similar to the query embedding
|
||||||
|
const index = self.embeddings.reduce((bestIndex, embedding, index) => {
|
||||||
|
const similarity = cosineSimilarity(embedding, queryEmbedding);
|
||||||
|
return similarity > cosineSimilarity(self.embeddings[bestIndex], queryEmbedding)
|
||||||
|
? index
|
||||||
|
: bestIndex;
|
||||||
|
}, 0);
|
||||||
|
|
||||||
|
// Return the corresponding sentence
|
||||||
|
postMessage(self.sentences[index]);
|
||||||
|
```
|
||||||
|
|
||||||
|
This project is a demonstration of how powerful NLP tools like Hugging Face's Transformers library can be used in a lightweight, browser-based application.
|
||||||
|
|
||||||
|
## Source
|
||||||
|
https://huggingface.co/Supabase/gte-small
|
File diff suppressed because one or more lines are too long
|
@ -0,0 +1,80 @@
|
||||||
|
<!-- https://huggingface.co/Supabase/gte-small -->
|
||||||
|
<!-- this just runs the model without a web worker. not recommended since page will freeze when running inference -->
|
||||||
|
<!-- <script type="module">
|
||||||
|
|
||||||
|
import { pipeline } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.5.0';
|
||||||
|
|
||||||
|
const pipe = await pipeline(
|
||||||
|
'feature-extraction',
|
||||||
|
'Supabase/gte-small',
|
||||||
|
);
|
||||||
|
|
||||||
|
// Generate the embedding from text
|
||||||
|
const output = await pipe('Hello world', {
|
||||||
|
pooling: 'mean',
|
||||||
|
normalize: true,
|
||||||
|
});
|
||||||
|
|
||||||
|
// Extract the embedding output
|
||||||
|
const embedding = Array.from(output.data);
|
||||||
|
|
||||||
|
console.log(embedding);
|
||||||
|
|
||||||
|
</script>
|
||||||
|
-->
|
||||||
|
<html>
|
||||||
|
<head>
|
||||||
|
<title>Web Worker Example</title>
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
<h1>Web Worker Example</h1>
|
||||||
|
<form id="text-form">
|
||||||
|
<textarea id="text-input" required></textarea>
|
||||||
|
<input type="file" id="file-input" accept=".txt">
|
||||||
|
<button type="submit">Submit</button>
|
||||||
|
</form>
|
||||||
|
|
||||||
|
<form id="query-form" style="display: none;">
|
||||||
|
<input type="text" id="query-input" required>
|
||||||
|
<button type="submit">Query</button>
|
||||||
|
</form>
|
||||||
|
|
||||||
|
<div id="most-similar-sentences"></div>
|
||||||
|
|
||||||
|
<script type="module">
|
||||||
|
const worker = new Worker('./dist/worker.bundle.js');
|
||||||
|
const mostSimDiv = document.getElementById('most-similar-sentences');
|
||||||
|
|
||||||
|
document.getElementById('text-form').addEventListener('submit', async event => {
|
||||||
|
event.preventDefault();
|
||||||
|
mostSimDiv.innerText = ''
|
||||||
|
|
||||||
|
let text;
|
||||||
|
const file = document.getElementById('file-input').files[0];
|
||||||
|
if (file) {
|
||||||
|
text = await file.text();
|
||||||
|
} else {
|
||||||
|
text = document.getElementById('text-input').value;
|
||||||
|
}
|
||||||
|
|
||||||
|
worker.postMessage({ type: 'text', text });
|
||||||
|
document.getElementById('query-form').style.display = 'block';
|
||||||
|
});
|
||||||
|
|
||||||
|
document.getElementById('query-form').addEventListener('submit', event => {
|
||||||
|
event.preventDefault();
|
||||||
|
mostSimDiv.innerText = 'thinking...';
|
||||||
|
|
||||||
|
|
||||||
|
const query = document.getElementById('query-input').value;
|
||||||
|
worker.postMessage({ type: 'query', query });
|
||||||
|
});
|
||||||
|
|
||||||
|
worker.onmessage = function(event) {
|
||||||
|
console.log(event.data);
|
||||||
|
mostSimDiv.innerText = event.data.join('\n');
|
||||||
|
|
||||||
|
};
|
||||||
|
</script>
|
||||||
|
</body>
|
||||||
|
</html>
|
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,24 @@
|
||||||
|
{
|
||||||
|
"name": "usingwebpack",
|
||||||
|
"version": "1.0.0",
|
||||||
|
"description": "",
|
||||||
|
"main": "./workers/embeddingModel.js",
|
||||||
|
"scripts": {
|
||||||
|
"test": "echo \"Error: no test specified\" && exit 1",
|
||||||
|
"build": "webpack"
|
||||||
|
},
|
||||||
|
"keywords": [],
|
||||||
|
"author": "",
|
||||||
|
"license": "ISC",
|
||||||
|
"devDependencies": {
|
||||||
|
"@babel/core": "^7.23.7",
|
||||||
|
"@babel/preset-env": "^7.23.7",
|
||||||
|
"babel-loader": "^9.1.3",
|
||||||
|
"webpack": "^5.89.0",
|
||||||
|
"webpack-cli": "^5.1.4"
|
||||||
|
},
|
||||||
|
"dependencies": {
|
||||||
|
"@xenova/transformers": "^2.13.4",
|
||||||
|
"idb": "^8.0.0"
|
||||||
|
}
|
||||||
|
}
|
|
@ -0,0 +1,26 @@
|
||||||
|
// webpack.config.js
|
||||||
|
const path = require('path');
|
||||||
|
|
||||||
|
module.exports = {
|
||||||
|
mode: 'development', // add this line to set the mode to 'development'
|
||||||
|
entry: './workers/embeddingModel.js', // replace with the path to your worker file
|
||||||
|
output: {
|
||||||
|
filename: 'worker.bundle.js',
|
||||||
|
path: path.resolve(__dirname, 'dist'),
|
||||||
|
},
|
||||||
|
module: {
|
||||||
|
rules: [
|
||||||
|
{
|
||||||
|
test: /\.js$/,
|
||||||
|
exclude: /(node_modules)/,
|
||||||
|
use: {
|
||||||
|
loader: 'babel-loader',
|
||||||
|
options: {
|
||||||
|
presets: ['@babel/preset-env']
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
||||||
|
},
|
||||||
|
target: 'webworker',
|
||||||
|
};
|
|
@ -0,0 +1 @@
|
||||||
|
[{"text":"text2","vector":[4,5,6],"source":"source2"},{"text":"text2","vector":[4,5,6],"source":"source2"}]
|
|
@ -0,0 +1,105 @@
|
||||||
|
import { pipeline } from "@xenova/transformers";
|
||||||
|
import { VectorStorage } from "./vectorDB";
|
||||||
|
|
||||||
|
// Create an instance of VectorStorage
|
||||||
|
const vectorStorage = new VectorStorage();
|
||||||
|
|
||||||
|
self.onmessage = async function (event) {
|
||||||
|
let sentences = [];
|
||||||
|
|
||||||
|
console.log("event!!!", event);
|
||||||
|
|
||||||
|
// try {
|
||||||
|
const pipe = await pipeline("feature-extraction", "Supabase/gte-small");
|
||||||
|
// } catch (error) {
|
||||||
|
// console.error('Error initializing pipeline:', error);
|
||||||
|
// return;
|
||||||
|
// }
|
||||||
|
|
||||||
|
if (event.data.type === "text") {
|
||||||
|
console.log("event.data.text", event.data.text);
|
||||||
|
// Check if the text is a string
|
||||||
|
if (typeof event.data.text === "string") {
|
||||||
|
// Split the text into sentences
|
||||||
|
sentences = event.data.text.split(". ");
|
||||||
|
// Rest of the code...
|
||||||
|
} else {
|
||||||
|
console.error("event.data.text is not a string:", event.data.text);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Generate an embedding for each sentence
|
||||||
|
const embeddings = await Promise.all(
|
||||||
|
sentences.map((sentence) =>
|
||||||
|
pipe(sentence, {
|
||||||
|
pooling: "mean",
|
||||||
|
normalize: true,
|
||||||
|
})
|
||||||
|
)
|
||||||
|
);
|
||||||
|
|
||||||
|
// Store the sentences and their corresponding embeddings
|
||||||
|
self.sentences = sentences;
|
||||||
|
self.embeddings = embeddings.map((output) => Array.from(output.data));
|
||||||
|
|
||||||
|
// Save each sentence and its embedding to the vector storage
|
||||||
|
self.sentences.forEach((sentence, index) => {
|
||||||
|
vectorStorage.addVector(
|
||||||
|
sentence,
|
||||||
|
self.embeddings[index],
|
||||||
|
"embeddingModel"
|
||||||
|
);
|
||||||
|
});
|
||||||
|
|
||||||
|
// console.log("self.sentences", self.sentences);
|
||||||
|
// console.log("self.embeddings", self.embeddings);
|
||||||
|
} else if (event.data.type === "query") {
|
||||||
|
// Generate an embedding for the query string
|
||||||
|
const queryEmbedding = Array.from(
|
||||||
|
(
|
||||||
|
await pipe(event.data.query, {
|
||||||
|
pooling: "mean",
|
||||||
|
normalize: true,
|
||||||
|
})
|
||||||
|
).data
|
||||||
|
);
|
||||||
|
|
||||||
|
// Find the embedding that's most similar to the query embedding
|
||||||
|
// const index = self.embeddings.reduce((bestIndex, embedding, index) => {
|
||||||
|
// const similarity = cosineSimilarity(embedding, queryEmbedding);
|
||||||
|
// return similarity >
|
||||||
|
// cosineSimilarity(self.embeddings[bestIndex], queryEmbedding)
|
||||||
|
// ? index
|
||||||
|
// : bestIndex;
|
||||||
|
// }, 0);
|
||||||
|
|
||||||
|
// // Return the corresponding sentence
|
||||||
|
// postMessage(self.sentences[index]);
|
||||||
|
// Set the number of similar sentences to return
|
||||||
|
const numSimilarSentences = 5;
|
||||||
|
|
||||||
|
// Calculate the cosine similarity for each sentence
|
||||||
|
const similarities = self.embeddings.map((embedding) =>
|
||||||
|
cosineSimilarity(embedding, queryEmbedding)
|
||||||
|
);
|
||||||
|
|
||||||
|
// Create an array of indices sorted by their corresponding sentence's similarity to the query string
|
||||||
|
const sortedIndices = Array.from(
|
||||||
|
{ length: similarities.length },
|
||||||
|
(_, i) => i
|
||||||
|
).sort((a, b) => similarities[b] - similarities[a]);
|
||||||
|
|
||||||
|
// Return the top n sentences
|
||||||
|
postMessage(
|
||||||
|
sortedIndices
|
||||||
|
.slice(0, numSimilarSentences)
|
||||||
|
.map((index) => self.sentences[index])
|
||||||
|
);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
function cosineSimilarity(a, b) {
|
||||||
|
const dotProduct = a.reduce((sum, a_i, i) => sum + a_i * b[i], 0);
|
||||||
|
const magnitudeA = Math.sqrt(a.reduce((sum, a_i) => sum + a_i * a_i, 0));
|
||||||
|
const magnitudeB = Math.sqrt(b.reduce((sum, b_i) => sum + b_i * b_i, 0));
|
||||||
|
return dotProduct / (magnitudeA * magnitudeB);
|
||||||
|
}
|
|
@ -0,0 +1,90 @@
|
||||||
|
// Import IndexedDB Promised library for easier IndexedDB usage
|
||||||
|
import { openDB } from 'idb';
|
||||||
|
|
||||||
|
export class VectorStorage {
|
||||||
|
constructor() {
|
||||||
|
// Open (or create) the database
|
||||||
|
this.dbPromise = openDB('vectorStorage', 1, {
|
||||||
|
upgrade(db) {
|
||||||
|
if (!db.objectStoreNames.contains('vectors')) {
|
||||||
|
db.createObjectStore('vectors', { keyPath: 'text' });
|
||||||
|
}
|
||||||
|
},
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
async addVector(text, vector, source) {
|
||||||
|
// Open a transaction, get the object store, and add the vector
|
||||||
|
const db = await this.dbPromise;
|
||||||
|
const tx = db.transaction('vectors', 'readwrite');
|
||||||
|
const store = tx.objectStore('vectors');
|
||||||
|
|
||||||
|
// Check if a vector with the same text already exists
|
||||||
|
const existingVector = await store.get(text);
|
||||||
|
if (existingVector) {
|
||||||
|
// If a vector with the same text already exists, ignore the new vector
|
||||||
|
console.log(`A vector with the text "${text}" already exists.`);
|
||||||
|
} else {
|
||||||
|
// Otherwise, add the new vector
|
||||||
|
await store.put({ text, vector, source });
|
||||||
|
}
|
||||||
|
|
||||||
|
// Wait for the transaction to complete
|
||||||
|
await tx.complete;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Delete a vector
|
||||||
|
async deleteVector(text) {
|
||||||
|
// Open a transaction, get the object store, and delete the vector
|
||||||
|
const db = await this.dbPromise;
|
||||||
|
const tx = db.transaction('vectors', 'readwrite');
|
||||||
|
const store = tx.objectStore('vectors');
|
||||||
|
await store.delete(text);
|
||||||
|
|
||||||
|
// Wait for the transaction to complete
|
||||||
|
await tx.complete;
|
||||||
|
}
|
||||||
|
// Get a vector by its text
|
||||||
|
getVectorByText(text) {
|
||||||
|
return this.storage.find(item => item.text === text);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Get a vector by its source
|
||||||
|
getVectorBySource(source) {
|
||||||
|
return this.storage.filter(item => item.source === source);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Get all vectors
|
||||||
|
getAllVectors() {
|
||||||
|
return this.storage;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Calculate cosine similarity
|
||||||
|
cosineSimilarity(a, b) {
|
||||||
|
const dotProduct = a.reduce((sum, a_i, i) => sum + a_i * b[i], 0);
|
||||||
|
const magnitudeA = Math.sqrt(a.reduce((sum, a_i) => sum + a_i * a_i, 0));
|
||||||
|
const magnitudeB = Math.sqrt(b.reduce((sum, b_i) => sum + b_i * b_i, 0));
|
||||||
|
return dotProduct / (magnitudeA * magnitudeB);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Search vectors by cosine similarity
|
||||||
|
searchByCosineSimilarity(queryVector, k) {
|
||||||
|
// Calculate the cosine similarity for each vector
|
||||||
|
const similarities = this.storage.map(item => this.cosineSimilarity(item.vector, queryVector));
|
||||||
|
|
||||||
|
// Create an array of indices sorted by their corresponding vector's similarity to the query vector
|
||||||
|
const sortedIndices = Array.from({length: similarities.length}, (_, i) => i).sort((a, b) => similarities[b] - similarities[a]);
|
||||||
|
|
||||||
|
// Return the top k items
|
||||||
|
return sortedIndices.slice(0, k).map(index => this.storage[index]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Usage
|
||||||
|
// const vectorStorage = new VectorStorage();
|
||||||
|
// vectorStorage.addVector('text1', [1, 2, 3], 'source1');
|
||||||
|
// vectorStorage.addVector('text2', [4, 5, 6], 'source2');
|
||||||
|
// console.log(vectorStorage.searchByCosineSimilarity([1, 2, 3], 1)); // Outputs: [ { text: 'text1', vector: [ 1, 2, 3 ], source: 'source1' } ]
|
||||||
|
// console.log(vectorStorage.getAllVectors());
|
||||||
|
// vectorStorage.deleteVector('text1');
|
||||||
|
// console.log(vectorStorage.getAllVectors()); // Outputs: [ { text: 'text2', vector: [ 4, 5, 6 ], source: 'source2' } ]
|
Loading…
Reference in New Issue