162 lines
6.7 KiB
Markdown
162 lines
6.7 KiB
Markdown
<!---
|
|
Copyright 2021 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
you may not use this file except in compliance with the License.
|
|
You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
See the License for the specific language governing permissions and
|
|
limitations under the License.
|
|
-->
|
|
|
|
# Speech Recognition Pre-Training
|
|
|
|
|
|
## Wav2Vec2 Speech Pre-Training
|
|
|
|
The script [`run_speech_wav2vec2_pretraining_no_trainer.py`](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py) can be used to pre-train a [Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html?highlight=wav2vec2) model from scratch.
|
|
|
|
In the script [`run_speech_wav2vec2_pretraining_no_trainer`](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py), a Wav2Vec2 model is pre-trained on audio data alone using [Wav2Vec2's contrastive loss objective](https://arxiv.org/abs/2006.11477).
|
|
|
|
The following examples show how to fine-tune a `"base"`-sized Wav2Vec2 model as well as a `"large"`-sized Wav2Vec2 model using [`accelerate`](https://github.com/huggingface/accelerate).
|
|
|
|
|
|
---
|
|
**NOTE 1**
|
|
|
|
Wav2Vec2's pre-training is known to be quite unstable.
|
|
It is advised to do a couple of test runs with a smaller dataset,
|
|
*i.e.* `--dataset_config_names clean clean`, `--dataset_split_names validation test`
|
|
to find good hyper-parameters for `learning_rate`, `batch_size`, `num_warmup_steps`,
|
|
and the optimizer.
|
|
A good metric to observe during training is the gradient norm which should ideally be between 0.5 and 2.
|
|
|
|
---
|
|
|
|
---
|
|
**NOTE 2**
|
|
|
|
When training a model on large datasets it is recommended to run the data preprocessing
|
|
in a first run in a **non-distributed** mode via `--preprocessing_only` so that
|
|
when running the model in **distributed** mode in a second step the preprocessed data
|
|
can easily be loaded on each distributed device.
|
|
|
|
---
|
|
|
|
### Demo
|
|
|
|
In this demo run we pre-train a `"base-sized"` Wav2Vec2 model simply only on the validation
|
|
and test data of [librispeech_asr](https://huggingface.co/datasets/librispeech_asr).
|
|
|
|
The demo is run on two Titan RTX (24 GB RAM each). In case you have less RAM available
|
|
per device, consider reducing `--batch_size` and/or the `--max_duration_in_seconds`.
|
|
|
|
|
|
```bash
|
|
accelerate launch run_wav2vec2_pretraining_no_trainer.py \
|
|
--dataset_name="librispeech_asr" \
|
|
--dataset_config_names clean clean \
|
|
--dataset_split_names validation test \
|
|
--model_name_or_path="patrickvonplaten/wav2vec2-base-v2" \
|
|
--output_dir="./wav2vec2-pretrained-demo" \
|
|
--max_train_steps="20000" \
|
|
--num_warmup_steps="32000" \
|
|
--gradient_accumulation_steps="8" \
|
|
--learning_rate="0.005" \
|
|
--weight_decay="0.01" \
|
|
--max_duration_in_seconds="20.0" \
|
|
--min_duration_in_seconds="2.0" \
|
|
--logging_steps="1" \
|
|
--saving_steps="10000" \
|
|
--per_device_train_batch_size="8" \
|
|
--per_device_eval_batch_size="8" \
|
|
--adam_beta1="0.9" \
|
|
--adam_beta2="0.98" \
|
|
--adam_epsilon="1e-06" \
|
|
--gradient_checkpointing \
|
|
--mask_time_prob="0.65" \
|
|
--mask_time_length="10"
|
|
```
|
|
|
|
The results of this run can be seen [here](https://wandb.ai/patrickvonplaten/wav2vec2-pretrained-demo/reports/Wav2Vec2-PreTraining-Demo-Run--VmlldzoxMDk3MjAw?accessToken=oa05s1y57lizo2ocxy3k01g6db1u4pt8m6ur2n8nl4cb0ug02ms2cw313kb8ruch).
|
|
|
|
### Base
|
|
|
|
To pre-train `"base-sized"` Wav2Vec2 model, *e.g.* [facebook/wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base)
|
|
on [librispeech_asr](https://huggingface.co/datasets/librispeech_asr), the following command can be run:
|
|
|
|
```bash
|
|
accelerate launch run_wav2vec2_pretraining_no_trainer.py \
|
|
--dataset_name=librispeech_asr \
|
|
--dataset_config_names clean clean other \
|
|
--dataset_split_names train.100 train.360 train.500 \
|
|
--model_name_or_path="patrickvonplaten/wav2vec2-base-v2" \
|
|
--output_dir="./wav2vec2-pretrained-demo" \
|
|
--max_train_steps="200000" \
|
|
--num_warmup_steps="32000" \
|
|
--gradient_accumulation_steps="4" \
|
|
--learning_rate="0.001" \
|
|
--weight_decay="0.01" \
|
|
--max_duration_in_seconds="20.0" \
|
|
--min_duration_in_seconds="2.0" \
|
|
--logging_steps="1" \
|
|
--saving_steps="10000" \
|
|
--per_device_train_batch_size="8" \
|
|
--per_device_eval_batch_size="8" \
|
|
--adam_beta1="0.9" \
|
|
--adam_beta2="0.98" \
|
|
--adam_epsilon="1e-06" \
|
|
--gradient_checkpointing \
|
|
--mask_time_prob="0.65" \
|
|
--mask_time_length="10"
|
|
```
|
|
|
|
The experiment was run on 8 GPU V100 (16 GB RAM each) for 4 days.
|
|
In case you have more than 8 GPUs available for a higher effective `batch_size`,
|
|
it is recommended to increase the `learning_rate` to `0.005` for faster convergence.
|
|
|
|
The results of this run can be seen [here](https://wandb.ai/patrickvonplaten/test/reports/Wav2Vec2-Base--VmlldzoxMTUyODQ0?accessToken=rg6e8u9yizx964k8q47zctq1m4afpvtn1i3qi9exgdmzip6xwkfzvagfajpzj55n) and the checkpoint pretrained for 85,000 steps can be accessed [here](https://huggingface.co/patrickvonplaten/wav2vec2-base-repro-960h-libri-85k-steps)
|
|
|
|
|
|
### Large
|
|
|
|
To pre-train `"large-sized"` Wav2Vec2 model, *e.g.* [facebook/wav2vec2-large-lv60](https://huggingface.co/facebook/wav2vec2-large-lv60),
|
|
on [librispeech_asr](https://huggingface.co/datasets/librispeech_asr), the following command can be run:
|
|
|
|
```bash
|
|
accelerate launch run_wav2vec2_pretraining_no_trainer.py \
|
|
--dataset_name=librispeech_asr \
|
|
--dataset_config_names clean clean other \
|
|
--dataset_split_names train.100 train.360 train.500 \
|
|
--output_dir=./test \
|
|
--max_train_steps=200000 \
|
|
--num_warmup_steps=32000 \
|
|
--gradient_accumulation_steps=8 \
|
|
--learning_rate=0.001 \
|
|
--weight_decay=0.01 \
|
|
--max_duration_in_seconds=20.0 \
|
|
--min_duration_in_seconds=2.0 \
|
|
--model_name_or_path=./
|
|
--logging_steps=1 \
|
|
--saving_steps=10000 \
|
|
--per_device_train_batch_size=2 \
|
|
--per_device_eval_batch_size=4 \
|
|
--adam_beta1=0.9 \
|
|
--adam_beta2=0.98 \
|
|
--adam_epsilon=1e-06 \
|
|
--gradient_checkpointing \
|
|
--mask_time_prob=0.65 \
|
|
--mask_time_length=10
|
|
```
|
|
|
|
The experiment was run on 8 GPU V100 (16 GB RAM each) for 7 days.
|
|
In case you have more than 8 GPUs available for a higher effective `batch_size`,
|
|
it is recommended to increase the `learning_rate` to `0.005` for faster convergence.
|
|
|
|
The results of this run can be seen [here](https://wandb.ai/patrickvonplaten/pretraining-wav2vec2/reports/Wav2Vec2-Large--VmlldzoxMTAwODM4?accessToken=wm3qzcnldrwsa31tkvf2pdmilw3f63d4twtffs86ou016xjbyilh55uoi3mo1qzc) and the checkpoint pretrained for 120,000 steps can be accessed [here](https://huggingface.co/patrickvonplaten/wav2vec2-large-repro-960h-libri-120k-steps)
|