From 4ce5f36f78d5c5de6509616110fd4d3c97e2297c Mon Sep 17 00:00:00 2001 From: thomwolf Date: Wed, 28 Aug 2019 12:14:31 +0200 Subject: [PATCH] update readmes --- README.md | 5 ++-- examples/distillation/README.md | 43 ++++++++++++++++++--------------- 2 files changed, 26 insertions(+), 22 deletions(-) diff --git a/README.md b/README.md index fdb160d898..de69e69788 100644 --- a/README.md +++ b/README.md @@ -12,8 +12,9 @@ The library currently contains PyTorch implementations, pre-trained model weight 4. **[Transformer-XL](https://github.com/kimiyoung/transformer-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. 5. **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [​XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. 6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau. -7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du et al. -8. **[DilBERT](https://github.com/huggingface/pytorch-transformers/tree/master/examples/distillation)** (from HuggingFace), a smaller, faster, and lighter version of BERT leveraging knowledge distillation by Victor Sanh, Thomas Wolf and Lysandre Debut +7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. +8. **[DilBERT](https://github.com/huggingface/pytorch-transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the blogpost [Smaller, faster, cheaper, lighter: Introducing DilBERT, a distilled version of BERT](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-dilbert-a-distilled-version-of-bert-8cf3380435b5 +) by Victor Sanh, Lysandre Debut and Thomas Wolf. These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/pytorch-transformers/examples.html). diff --git a/examples/distillation/README.md b/examples/distillation/README.md index 2eb4b59f8a..c037bd0c24 100644 --- a/examples/distillation/README.md +++ b/examples/distillation/README.md @@ -1,23 +1,25 @@ # DilBERT -This section contains examples showcasing how to use DilBERT and the original code to train DilBERT. +This folder contains the original code used to train DilBERT as well as examples showcasing how to use DilBERT. -## What is DilBERT? +## What is DilBERT -DilBERT stands for DistiLlation-BERT. DilBERT is a small, fast, cheap and light Transformer model: it has 40% less parameters than `bert-base-uncased`, runs 40% faster while preserving 96% on the language understanding capabilties (as shown on the GLUE benchmark). DilBERT is trained by distillation: a technique to compress a large model called the teacher into a smaller model called the student. By applying this compression technique, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model, while being lighter, smaller and faster. Thus, DilBERT can be an interesting solution to put large Transformer model into production. +DilBERT stands for Distillated-BERT. DilBERT is a small, fast, cheap and light Transformer model based on Bert architecture. It has 40% less parameters than `bert-base-uncased`, runs 60% faster while preserving over 95% of Bert's performances as measured on the GLUE language understanding benchmark. DilBERT is trained using knowledge distillation, a technique to compress a large model called the teacher into a smaller model called the student. By distillating Bert, we obtain a smaller Transformer model that bears a lot of similarities with the original BERT model while being lighter, smaller and faster to run. DilBERT is thus an interesting option to put large-scaled trained Transformer model into production. -For more information on DilBERT, we refer to [our blog post](TODO(Link)). +For more information on DilBERT, please refer to our [detailed blog post](https://medium.com/huggingface/smaller-faster-cheaper-lighter-introducing-dilbert-a-distilled-version-of-bert-8cf3380435b5 +). -## How to use DilBERT? +## How to use DilBERT -PyTorch-Transformers includes two pre-trained models: -- `dilbert-base-uncased`: The language model pretrained by distillation under the supervision of `bert-base-uncased`. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters. -- `dilbert-base-uncased-distilled-squad`: The `dilbert-base-uncased` finetune by distillation on SQuAD. It reaches a F1 score of 86.2 on the dev set, while `bert-base-uncased` reaches a 88.5 F1 score. +PyTorch-Transformers includes two pre-trained DilBERT models, currently only provided for English (we are investigating the possibility to train and release a multilingual version of DilBERT): -Using DilBERT is really similar to using BERT. DilBERT uses the same tokenizer as BERT and more specifically `bert-base-uncased`. You should only use this tookenizer as the only pre-trained weights available for now are supervised by `bert-base-uncased`. +- `dilbert-base-uncased`: DilBERT English language model pretrained on the same data used to pretrain Bert (concatenation of the Toronto Book Corpus and full English Wikipedia) using distillation with the supervision of the `bert-base-uncased` version of Bert. The model has 6 layers, 768 dimension and 12 heads, totalizing 66M parameters. +- `dilbert-base-uncased-distilled-squad`: A finetuned version of `dilbert-base-uncased` finetuned using (a second step of) knwoledge distillation on SQuAD 1.0. This model reaches a F1 score of 86.2 on the dev set (for comparison, Bert `bert-base-uncased` version reaches a 88.5 F1 score). + +Using DilBERT is very similar to using BERT. DilBERT share the same tokenizer as BERT's `bert-base-uncased` even though we provide a link to this tokenizer under the `DilBertTokenizer` name to have a consistent naming between the library models. ```python -tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') +tokenizer = DilBertTokenizer.from_pretrained('dilbert-base-uncased') model = DilBertModel.from_pretrained('dilbert-base-uncased') input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) @@ -25,17 +27,17 @@ outputs = model(input_ids) last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple ``` -## How to train DilBERT? +## How to train DilBERT In the following, we will explain how you can train your own compressed model. ### A. Preparing the data -The weights we release are trained using a concatenation of Toronto Book Corpus and English Wikipedia (same training data as BERT). +The weights we release are trained using a concatenation of Toronto Book Corpus and English Wikipedia (same training data as the English version of BERT). To avoid processing the data several time, we do it once and for all before the training. From now on, will suppose that you have a text file `dump.txt` which contains one sequence per line (a sequence being composed of one of several coherent sentences). -First, we will binarize the data: we tokenize the data and associate each token to an id. +First, we will binarize the data, i.e. tokenize the data and convert each token in an index in our model's vocabulary. ```bash python scripts/binarized_data.py \ @@ -44,7 +46,7 @@ python scripts/binarized_data.py \ --dump_file data/binarized_text ``` -In the masked language modeling loss, we follow [XLM](https://github.com/facebookresearch/XLM) and smooth the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurences of each tokens in the data: +Our implementation of masked language modeling loss follows [XLM](https://github.com/facebookresearch/XLM)'s one and smoothes the probability of masking with a factor that put more emphasis on rare words. Thus we count the occurences of each tokens in the data: ```bash python scripts/token_counts.py \ @@ -54,19 +56,20 @@ python scripts/token_counts.py \ ### B. Training -Launching a distillation is really simple once you have setup the data: +Training with distillation is really simple once you have pre-processed the data: ```bash python train.py \ --dump_path serialization_dir/my_first_training \ --data_file data/binarized_text.bert-base-uncased.pickle \ --token_counts data/token_counts.bert-base-uncased.pickle \ - --force # It overwrites the `dump_path` if it already exists. -``` + --force # overwrites the `dump_path` if it already exists. +``` -By default, this will launch a training on a single GPU (even if more are available on the cluster). Other parameters are available in the command line, please refer to `train.py`. +By default, this will launch a training on a single GPU (even if more are available on the cluster). Other parameters are available in the command line, please look in `train.py` or run `python train.py --help` to list them. + +We highly encourage you to distributed training for training DilBert as the training corpus is quite large. Here's an example that runs a distributed training on a single node having 4 GPUs: -We also highly encourage using distributed training. Here's an example that launchs a distributed traininng on a single node with 4 GPUs: ```bash export NODE_RANK=0 export N_NODES=1 @@ -92,6 +95,6 @@ python -m torch.distributed.launch \ --dump_path serialization_dir/with_transform/last_word ``` -**Tips** Start the distillation from some sort of structure initialization is crucial to reach a good final performance. In our experiments, we use initialization from some of the layers of the teacher itself! Please refer to `scripts/extract_for_distil.py` to create a valid initialization checkpoint and add `from_pretrained_weights` and `from_pretrained_config` when launching your distillation! +**Tips** Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract_for_distil.py` to create a valid initialization checkpoint and use `--from_pretrained_weights` and `--from_pretrained_config` arguments to use this initialization for the distilled training! Happy distillation!