4.5 KiB

Raw Permalink Blame History

Wav2Vec2 Contrastive Loss PreTraining examples

The following example showcases how to pretrain a wav2vec2 model using the JAX/Flax backend. Pretraining Wav2Vec2 is rather complex, so it is highly recommended to read the official paper.

JAX/Flax allows you to trace pure functions and compile them into efficient, fused accelerator code on both GPU and TPU. Models written in JAX/Flax are immutable and updated in a purely functional way which enables simple and efficient model parallelism.

run_wav2vec2_pretrain_flax.py is a lightweight example of how to download and preprocess a dataset from the 🤗 Datasets library or use your own files (jsonlines or csv), then pretrain the wav2vec2 architectures above on it.

For custom datasets in jsonlines format please see: the Datasets documentation and you also will find examples of these below.

Let's start by creating a model repository to save the trained model and logs. Here we call the model "wav2vec2-base-robust", but you can change the model name as you like.

You can do this either directly on huggingface.co (assuming that you are logged in) or via the command line:

huggingface-cli repo create wav2vec2-base-robust

Next we clone the model repository to add the tokenizer and model files.

git clone https://huggingface.co/<your-username>/wav2vec2-base-robust

To ensure that all tensorboard traces will be uploaded correctly, we need to track them. You can run the following command inside your model repo to do so.

cd wav2vec2-base-robust
git lfs track "*tfevents*"

Great, we have set up our model repository. During training, we will automatically push the training logs and model weights to the repo.

Next, let's add a symbolic link to the run_wav2vec2_pretrain_flax.

export MODEL_DIR="./wav2vec2-base-robust"
ln -s ~/transformers/examples/research_projects/jax-projects/wav2vec2/run_wav2vec2_pretrain_flax.py ./

Create the model configuration

Let's first create the model configuration and store it in the model repository. Note that many training parameters can be set in the model configuration including the configuration about the masking distribution (mask_time_length, mask_time_prob), dropout (attention_dropout, ...), the trade-off between the contrastive loss and the diversity loss, etc... Mostly likely you will need to change these parameters depending on your use case. Again, we highly recommend to read the official paper to better understand which parameters can be set for pretraining.

For this example, we will be using a "base"-sized model of Wav2Vec2 with robust layer norm and keep most of the default settings.

model_dir="./wav2vec2-base-robust"

from transformers import Wav2Vec2Config
config = Wav2Vec2Config.from_pretrained(
    "facebook/wav2vec2-base", 
    mask_time_length=10,
    mask_time_prob=0.05,
    diversity_loss_weight=0.1,
    num_negatives=100,
    do_stable_layer_norm=True,
    feat_extract_norm="layer",
)
config.save_pretrained(model_dir)

Create a feature extractor configuration

Before we can start the training, we need to define a feature extractor that takes care of normalization, etc...

Here we can also re-use the feature extractor of wav2vec2-base-960h while making sure that padding is allowed.

model_dir="./wav2vec2-base-robust"

from transformers import Wav2Vec2FeatureExtractor
config = Wav2Vec2FeatureExtractor.from_pretrained("facebook/wav2vec2-base", return_attention_mask=True)
config.save_pretrained(model_dir)

Train the model

Finally, we can run the example script to train the model:

./run_wav2vec2_pretrain_flax.py \
    --output_dir=${MODEL_DIR} \
    --num_train_epochs="5" \
    --per_device_train_batch_size="32" \
    --per_device_eval_batch_size="32" \
    --learning_rate="5e-4" \
    --weight_decay="0.01" \
    --warmup_steps="2000" \
    --model_name_or_path=${MODEL_DIR} \
    --dataset_name="librispeech_asr" \
    --dataset_config_name="clean" \
    --train_split_name="train.100" \
    --preprocessing_num_workers="4" \
    --max_duration_in_seconds="10.0" \
    --adam_beta1="0.9" \
    --adam_beta2="0.98" \
    --pad_to_multiple_of="16384" \
    --push_to_hub

Note that this script is not fully tested yet, so we cannot ensure that the above script leads to satisfying results.

4.5 KiB Raw Permalink Blame History

Wav2Vec2 Contrastive Loss PreTraining examples

Create the model configuration

Create a feature extractor configuration

Train the model

4.5 KiB

Raw Permalink Blame History