[docs] Add tutorial + example app for server-side whisper (#147)
* Update typo in node tutorial * Create node audio processing tutorial * Point to tutorial in `read_audio` function * Rename `.md` to `.mdx` * Add node audio processing tutorial to table of contents * Add link to model in tutorial * Update error message grammar
This commit is contained in:
parent
35b9e21193
commit
573012b434
|
@ -17,6 +17,8 @@
|
|||
title: Building an Electron Application
|
||||
- local: tutorials/node
|
||||
title: Server-side Inference in Node.js
|
||||
- local: tutorials/node-audio-processing
|
||||
title: Server-side Audio Processing in Node.js
|
||||
title: Tutorials
|
||||
- sections:
|
||||
- local: api/transformers
|
||||
|
|
|
@ -0,0 +1,102 @@
|
|||
|
||||
# Server-side Audio Processing in Node.js
|
||||
A major benefit of writing code for the web is that you can access the multitude of APIs that are available in modern browsers. Unfortunately, when writing server-side code, we are not afforded such luxury, so we have to find another way. In this tutorial, we will design a simple Node.js application that uses Transformers.js for speech recognition with [Whisper](https://huggingface.co/Xenova/whisper-tiny.en), and in the process, learn how to process audio on the server.
|
||||
|
||||
The main problem we need to solve is that the [Web Audio API](https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API) is not available in Node.js, meaning we can't use the [`AudioContext`](https://developer.mozilla.org/en-US/docs/Web/API/AudioContext) class to process audio. So, we will need to install third-party libraries to obtain the raw audio data. For this example, we will only consider `.wav` files, but the same principles apply to other audio formats.
|
||||
|
||||
<Tip>
|
||||
|
||||
This tutorial will be written as an ES module, but you can easily adapt it to use CommonJS instead. For more information, see the [node tutorial](https://huggingface.co/docs/transformers.js/tutorials/node).
|
||||
|
||||
</Tip>
|
||||
|
||||
|
||||
**Useful links:**
|
||||
- [Source code](https://github.com/xenova/transformers.js/tree/main/examples/node-audio-processing)
|
||||
- [Documentation](https://huggingface.co/docs/transformers.js)
|
||||
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- [Node.js](https://nodejs.org/en/) version 16+
|
||||
- [npm](https://www.npmjs.com/) version 7+
|
||||
|
||||
|
||||
|
||||
## Getting started
|
||||
|
||||
Let's start by creating a new Node.js project and installing Transformers.js via [NPM](https://www.npmjs.com/package/@xenova/transformers):
|
||||
|
||||
```bash
|
||||
npm init -y
|
||||
npm i @xenova/transformers
|
||||
```
|
||||
|
||||
<Tip>
|
||||
|
||||
Remember to add `"type": "module"` to your `package.json` to indicate that your project uses ECMAScript modules.
|
||||
|
||||
</Tip>
|
||||
|
||||
|
||||
Next, let's install the [`wavefile`](https://www.npmjs.com/package/wavefile) package, which we will use for loading `.wav` files:
|
||||
|
||||
```bash
|
||||
npm i wavefile
|
||||
```
|
||||
|
||||
|
||||
## Creating the application
|
||||
|
||||
Start by creating a new file called `index.js`, which will be the entry point for our application. Let's also import the necessary modules:
|
||||
|
||||
```js
|
||||
import { pipeline } from '@xenova/transformers';
|
||||
import wavefile from 'wavefile';
|
||||
```
|
||||
|
||||
For this tutorial, we will use the `Xenova/whisper-tiny.en` model, but feel free to choose one of the other whisper models from the [Hugging Face Hub](https://huggingface.co/models?library=transformers.js&search=whisper). Let's create our pipeline with:
|
||||
```js
|
||||
let transcriber = await pipeline('automatic-speech-recognition', 'Xenova/whisper-tiny.en');
|
||||
```
|
||||
|
||||
Next, let's load an audio file and convert it to the format required by Transformers.js:
|
||||
```js
|
||||
// Load audio data
|
||||
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
|
||||
let buffer = Buffer.from(await fetch(url).then(x => x.arrayBuffer()))
|
||||
|
||||
// Read .wav file and convert it to required format
|
||||
let wav = new wavefile.WaveFile(buffer);
|
||||
wav.toBitDepth('32f'); // Pipeline expects input as a Float32Array
|
||||
wav.toSampleRate(16000); // Whisper expects audio with a sampling rate of 16000
|
||||
let audioData = wav.getSamples();
|
||||
if (Array.isArray(audioData)) {
|
||||
// For this demo, if there are multiple channels for the audio file, we just select the first one.
|
||||
// In practice, you'd probably want to convert all channels to a single channel (e.g., stereo -> mono).
|
||||
audioData = audioData[0];
|
||||
}
|
||||
```
|
||||
|
||||
Finally, let's run the model and measure execution duration.
|
||||
```js
|
||||
let start = performance.now();
|
||||
let output = await transcriber(audioData);
|
||||
let end = performance.now();
|
||||
console.log(`Execution duration: ${(end - start) / 1000} seconds`);
|
||||
console.log(output);
|
||||
```
|
||||
|
||||
You can now run the application with `node index.js`. Note that when running the script for the first time, it may take a while to download and cache the model. Subsequent requests will use the cached model, and model loading will be much faster.
|
||||
|
||||
You should see output similar to:
|
||||
```
|
||||
Execution duration: 0.6460317999720574 seconds
|
||||
{
|
||||
text: ' And so my fellow Americans ask not what your country can do for you. Ask what you can do for your country.'
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
That's it! You've successfully created a Node.js application that uses Transformers.js for speech recognition with Whisper. You can now use this as a starting point for your own applications.
|
||||
|
|
@ -45,7 +45,7 @@ We'll also create a helper class called `MyClassificationPipeline` control the l
|
|||
|
||||
### ECMAScript modules (ESM)
|
||||
|
||||
To indicate that your project uses ECMAScript modules, you need to add `type: "module"` to your `package.json`:
|
||||
To indicate that your project uses ECMAScript modules, you need to add `"type": "module"` to your `package.json`:
|
||||
|
||||
```json
|
||||
{
|
|
@ -0,0 +1,28 @@
|
|||
import { pipeline } from '@xenova/transformers';
|
||||
import wavefile from 'wavefile';
|
||||
|
||||
// Load model
|
||||
let transcriber = await pipeline('automatic-speech-recognition', 'Xenova/whisper-tiny.en');
|
||||
|
||||
// Load audio data
|
||||
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
|
||||
let buffer = Buffer.from(await fetch(url).then(x => x.arrayBuffer()))
|
||||
|
||||
// Read .wav file and convert it to required format
|
||||
let wav = new wavefile.WaveFile(buffer);
|
||||
wav.toBitDepth('32f'); // Pipeline expects input as a Float32Array
|
||||
wav.toSampleRate(16000); // Whisper expects audio with a sampling rate of 16000
|
||||
let audioData = wav.getSamples();
|
||||
if (Array.isArray(audioData)) {
|
||||
// For this demo, if there are multiple channels for the audio file, we just select the first one.
|
||||
// In practice, you'd probably want to convert all channels to a single channel (e.g., stereo -> mono).
|
||||
audioData = audioData[0];
|
||||
}
|
||||
|
||||
// Run model
|
||||
let start = performance.now();
|
||||
let output = await transcriber(audioData);
|
||||
let end = performance.now();
|
||||
console.log(`Execution duration: ${(end - start) / 1000} seconds`);
|
||||
console.log(output);
|
||||
// { text: ' And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.' }
|
|
@ -0,0 +1,17 @@
|
|||
{
|
||||
"name": "audio-processing",
|
||||
"version": "1.0.0",
|
||||
"description": "",
|
||||
"main": "index.js",
|
||||
"type": "module",
|
||||
"scripts": {
|
||||
"test": "echo \"Error: no test specified\" && exit 1"
|
||||
},
|
||||
"keywords": [],
|
||||
"author": "",
|
||||
"license": "ISC",
|
||||
"dependencies": {
|
||||
"@xenova/transformers": "^2.2.0",
|
||||
"wavefile": "^11.0.0"
|
||||
}
|
||||
}
|
|
@ -23,14 +23,14 @@ export async function read_audio(url, sampling_rate) {
|
|||
// Running in node or an environment without AudioContext
|
||||
throw Error(
|
||||
"Unable to load audio from path/URL since `AudioContext` is not available in your environment. " +
|
||||
"As a result, audio data must be passed directly to the processor. " +
|
||||
"If you are running in node.js, you can use an external library (e.g., https://github.com/audiojs/web-audio-api) to do this."
|
||||
"Instead, audio data should be passed directly to the pipeline/processor. " +
|
||||
"For more information and some example code, see https://huggingface.co/docs/transformers.js/tutorials/node-audio-processing."
|
||||
)
|
||||
}
|
||||
|
||||
const response = await (await getFile(url)).arrayBuffer();
|
||||
const audioCTX = new AudioContext({ sampleRate: sampling_rate });
|
||||
if(typeof sampling_rate === 'undefined') {
|
||||
if (typeof sampling_rate === 'undefined') {
|
||||
console.warn(`No sampling rate provided, using default of ${audioCTX.sampleRate}Hz.`)
|
||||
}
|
||||
const decoded = await audioCTX.decodeAudioData(response);
|
||||
|
|
Loading…
Reference in New Issue