[docs] Add tutorial + example app for server-side whisper (#147)

* Update typo in node tutorial

* Create node audio processing tutorial

* Point to tutorial in `read_audio` function

* Rename `.md` to `.mdx`

* Add node audio processing tutorial to table of contents

* Add link to model in tutorial

* Update error message grammar
This commit is contained in:
Joshua Lochner 2023-06-20 23:10:33 +02:00 committed by GitHub
parent 35b9e21193
commit 573012b434
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
6 changed files with 153 additions and 4 deletions

View File

@ -17,6 +17,8 @@
title: Building an Electron Application
- local: tutorials/node
title: Server-side Inference in Node.js
- local: tutorials/node-audio-processing
title: Server-side Audio Processing in Node.js
title: Tutorials
- sections:
- local: api/transformers

View File

@ -0,0 +1,102 @@
# Server-side Audio Processing in Node.js
A major benefit of writing code for the web is that you can access the multitude of APIs that are available in modern browsers. Unfortunately, when writing server-side code, we are not afforded such luxury, so we have to find another way. In this tutorial, we will design a simple Node.js application that uses Transformers.js for speech recognition with [Whisper](https://huggingface.co/Xenova/whisper-tiny.en), and in the process, learn how to process audio on the server.
The main problem we need to solve is that the [Web Audio API](https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API) is not available in Node.js, meaning we can't use the [`AudioContext`](https://developer.mozilla.org/en-US/docs/Web/API/AudioContext) class to process audio. So, we will need to install third-party libraries to obtain the raw audio data. For this example, we will only consider `.wav` files, but the same principles apply to other audio formats.
<Tip>
This tutorial will be written as an ES module, but you can easily adapt it to use CommonJS instead. For more information, see the [node tutorial](https://huggingface.co/docs/transformers.js/tutorials/node).
</Tip>
**Useful links:**
- [Source code](https://github.com/xenova/transformers.js/tree/main/examples/node-audio-processing)
- [Documentation](https://huggingface.co/docs/transformers.js)
## Prerequisites
- [Node.js](https://nodejs.org/en/) version 16+
- [npm](https://www.npmjs.com/) version 7+
## Getting started
Let's start by creating a new Node.js project and installing Transformers.js via [NPM](https://www.npmjs.com/package/@xenova/transformers):
```bash
npm init -y
npm i @xenova/transformers
```
<Tip>
Remember to add `"type": "module"` to your `package.json` to indicate that your project uses ECMAScript modules.
</Tip>
Next, let's install the [`wavefile`](https://www.npmjs.com/package/wavefile) package, which we will use for loading `.wav` files:
```bash
npm i wavefile
```
## Creating the application
Start by creating a new file called `index.js`, which will be the entry point for our application. Let's also import the necessary modules:
```js
import { pipeline } from '@xenova/transformers';
import wavefile from 'wavefile';
```
For this tutorial, we will use the `Xenova/whisper-tiny.en` model, but feel free to choose one of the other whisper models from the [Hugging Face Hub](https://huggingface.co/models?library=transformers.js&search=whisper). Let's create our pipeline with:
```js
let transcriber = await pipeline('automatic-speech-recognition', 'Xenova/whisper-tiny.en');
```
Next, let's load an audio file and convert it to the format required by Transformers.js:
```js
// Load audio data
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
let buffer = Buffer.from(await fetch(url).then(x => x.arrayBuffer()))
// Read .wav file and convert it to required format
let wav = new wavefile.WaveFile(buffer);
wav.toBitDepth('32f'); // Pipeline expects input as a Float32Array
wav.toSampleRate(16000); // Whisper expects audio with a sampling rate of 16000
let audioData = wav.getSamples();
if (Array.isArray(audioData)) {
// For this demo, if there are multiple channels for the audio file, we just select the first one.
// In practice, you'd probably want to convert all channels to a single channel (e.g., stereo -> mono).
audioData = audioData[0];
}
```
Finally, let's run the model and measure execution duration.
```js
let start = performance.now();
let output = await transcriber(audioData);
let end = performance.now();
console.log(`Execution duration: ${(end - start) / 1000} seconds`);
console.log(output);
```
You can now run the application with `node index.js`. Note that when running the script for the first time, it may take a while to download and cache the model. Subsequent requests will use the cached model, and model loading will be much faster.
You should see output similar to:
```
Execution duration: 0.6460317999720574 seconds
{
text: ' And so my fellow Americans ask not what your country can do for you. Ask what you can do for your country.'
}
```
That's it! You've successfully created a Node.js application that uses Transformers.js for speech recognition with Whisper. You can now use this as a starting point for your own applications.

View File

@ -45,7 +45,7 @@ We'll also create a helper class called `MyClassificationPipeline` control the l
### ECMAScript modules (ESM)
To indicate that your project uses ECMAScript modules, you need to add `type: "module"` to your `package.json`:
To indicate that your project uses ECMAScript modules, you need to add `"type": "module"` to your `package.json`:
```json
{

View File

@ -0,0 +1,28 @@
import { pipeline } from '@xenova/transformers';
import wavefile from 'wavefile';
// Load model
let transcriber = await pipeline('automatic-speech-recognition', 'Xenova/whisper-tiny.en');
// Load audio data
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
let buffer = Buffer.from(await fetch(url).then(x => x.arrayBuffer()))
// Read .wav file and convert it to required format
let wav = new wavefile.WaveFile(buffer);
wav.toBitDepth('32f'); // Pipeline expects input as a Float32Array
wav.toSampleRate(16000); // Whisper expects audio with a sampling rate of 16000
let audioData = wav.getSamples();
if (Array.isArray(audioData)) {
// For this demo, if there are multiple channels for the audio file, we just select the first one.
// In practice, you'd probably want to convert all channels to a single channel (e.g., stereo -> mono).
audioData = audioData[0];
}
// Run model
let start = performance.now();
let output = await transcriber(audioData);
let end = performance.now();
console.log(`Execution duration: ${(end - start) / 1000} seconds`);
console.log(output);
// { text: ' And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.' }

View File

@ -0,0 +1,17 @@
{
"name": "audio-processing",
"version": "1.0.0",
"description": "",
"main": "index.js",
"type": "module",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"keywords": [],
"author": "",
"license": "ISC",
"dependencies": {
"@xenova/transformers": "^2.2.0",
"wavefile": "^11.0.0"
}
}

View File

@ -23,14 +23,14 @@ export async function read_audio(url, sampling_rate) {
// Running in node or an environment without AudioContext
throw Error(
"Unable to load audio from path/URL since `AudioContext` is not available in your environment. " +
"As a result, audio data must be passed directly to the processor. " +
"If you are running in node.js, you can use an external library (e.g., https://github.com/audiojs/web-audio-api) to do this."
"Instead, audio data should be passed directly to the pipeline/processor. " +
"For more information and some example code, see https://huggingface.co/docs/transformers.js/tutorials/node-audio-processing."
)
}
const response = await (await getFile(url)).arrayBuffer();
const audioCTX = new AudioContext({ sampleRate: sampling_rate });
if(typeof sampling_rate === 'undefined') {
if (typeof sampling_rate === 'undefined') {
console.warn(`No sampling rate provided, using default of ${audioCTX.sampleRate}Hz.`)
}
const decoded = await audioCTX.decodeAudioData(response);