From d844db4005dead629cb31465ca8a0ef6212a2af4 Mon Sep 17 00:00:00 2001 From: VictorSanh Date: Fri, 11 Oct 2019 16:55:42 -0400 Subject: [PATCH] Add citation bibtex --- examples/distillation/README.md | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/examples/distillation/README.md b/examples/distillation/README.md index 86a729e830..0fbcb5628b 100644 --- a/examples/distillation/README.md +++ b/examples/distillation/README.md @@ -2,7 +2,7 @@ This folder contains the original code used to train Distil* as well as examples showcasing how to use DistilBERT and DistilGPT2. -**2019, October 3rd - Update** We release our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108) explaining our approach on **DistilBERT**. It includes updated results and further experiments. We applied the same method to GPT2 and release the weights of **DistilGPT2**. DistilGPT2 is two times faster and 33% smaller than GPT2. +**2019, October 3rd - Update** We release our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108) explaining our approach on **DistilBERT**. It includes updated results and further experiments. We applied the same method to GPT2 and release the weights of **DistilGPT2**. DistilGPT2 is two times faster and 33% smaller than GPT2. **The paper superseeds our [previous blogpost](https://medium.com/huggingface/distilbert-8cf3380435b5) with a different distillation loss and better performances. Please use the paper as a reference when comparing/reporting results on DistilBERT.** **2019, September 19th - Update:** We fixed bugs in the code and released an upadted version of the weights trained with a modification of the distillation loss. DistilBERT now reaches 97% of `BERT-base`'s performance on GLUE, and 86.9 F1 score on SQuAD v1.1 dev set (compared to 88.5 for `BERT-base`). We will publish a formal write-up of our approach in the near future! @@ -12,7 +12,7 @@ Distil* is a class of compressed models that started with DistilBERT. DistilBERT We have applied the same method to GPT2 and release the weights of the compressed model. On the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test set of 15.0 compared to 18.5 for DistilGPT2 (after fine-tuning on the train set). -For more information on DistilBERT, please refer to our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108). The paper superseeds our [previous blogpost](https://medium.com/huggingface/distilbert-8cf3380435b5) with a different distillation loss and better performances. +For more information on DistilBERT, please refer to our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108). Here are the results on the dev sets of GLUE: @@ -25,7 +25,7 @@ Here are the results on the dev sets of GLUE: This part of the library has only be tested with Python3.6+. There are few specific dependencies to install before launching a distillation, you can install them with the command `pip install -r requirements.txt`. -**Important note:** The training scripts have been updated to support PyTorch v1.2.0 (there are breakings changes compared to v1.1.0). It is important to note that there is a small internal bug in the current version of PyTorch available on pip that causes a memory leak in our training/distillation. It has been recently fixed and will likely be integrated into the next release. For the moment, we recommend to [compile PyTorch from source](https://github.com/pytorch/pytorch#from-source). Please refer to [issue 1179](https://github.com/huggingface/transformers/issues/1179) for more details. +**Important note:** The training scripts have been updated to support PyTorch v1.2.0 (there are breakings changes compared to v1.1.0). ## How to use DistilBERT @@ -134,3 +134,16 @@ python -m torch.distributed.launch \ **Tips:** Starting distillated training with good initialization of the model weights is crucial to reach decent performance. In our experiments, we initialized our model from a few layers of the teacher (Bert) itself! Please refer to `scripts/extract.py` and `scripts/extract_distilbert.py` to create a valid initialization checkpoint and use `--student_pretrained_weights` argument to use this initialization for the distilled training! Happy distillation! + +## Citation + +If you find the ressource useful, you should cite the following paper: + +``` +@inproceedings{sanh2019distilbert, + title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter}, + author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas}, + booktitle={NeurIPS EMC^2 Workshop}, + year={2019} +} +``` \ No newline at end of file