From 35071007cb1600acf7e8197e42f16e3698dc5f35 Mon Sep 17 00:00:00 2001 From: VictorSanh Date: Wed, 2 Oct 2019 23:01:36 -0400 Subject: [PATCH] =?UTF-8?q?incoming=20release=20=F0=9F=94=A5=20update=20li?= =?UTF-8?q?nks=20to=20arxiv=20preprint?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 3 +-- examples/distillation/README.md | 6 +++--- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index b5b7245bd9..4b4f6d5def 100644 --- a/README.md +++ b/README.md @@ -120,8 +120,7 @@ At some point in the future, you'll be able to seamlessly move from pre-training 5. **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [​XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. 6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau. 7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. -8. **[DistilBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the blogpost [Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT](https://medium.com/huggingface/distilbert-8cf3380435b5 -) by Victor Sanh, Lysandre Debut and Thomas Wolf. +8. **[DistilBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/master/examples/distillation). These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/transformers/examples.html). diff --git a/examples/distillation/README.md b/examples/distillation/README.md index ad439cf5f8..8436ab95bf 100644 --- a/examples/distillation/README.md +++ b/examples/distillation/README.md @@ -2,7 +2,7 @@ This folder contains the original code used to train Distil* as well as examples showcasing how to use DistilBERT and DistilGPT2. -**2019, October 3rd - Update** We release our [NeurIPS workshop paper](TODO LINK) explaining our approach on DistilBERT. It includes updated results and further experiments. We applied the same method to GPT2 and release the weights of DistilGPT2. DistilGPT2 is two times faster and 33% smaller than GPT2. +**2019, October 3rd - Update** We release our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108) explaining our approach on **DistilBERT**. It includes updated results and further experiments. We applied the same method to GPT2 and release the weights of **DistilGPT2**. DistilGPT2 is two times faster and 33% smaller than GPT2. **2019, September 19th - Update:** We fixed bugs in the code and released an upadted version of the weights trained with a modification of the distillation loss. DistilBERT now reaches 97% of `BERT-base`'s performance on GLUE, and 86.9 F1 score on SQuAD v1.1 dev set (compared to 88.5 for `BERT-base`). We will publish a formal write-up of our approach in the near future! @@ -12,7 +12,7 @@ Distil* is a class of compressed models that started with DistilBERT. DistilBERT We have applied the same method to GPT2 and release the weights of the compressed model. On the [WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) benchmark, GPT2 reaches a perplexity on the test of 15.84 compared to 19.91 for DistilGPT2 (after fine-tuning on the train set). -For more information on DistilBERT, please refer to our [NeurIPS workshop paper](TODO LINK). The paper superseeds our [previous blogpost](https://medium.com/huggingface/distilbert-8cf3380435b5) with a different distillation loss and better performances. +For more information on DistilBERT, please refer to our [NeurIPS workshop paper](https://arxiv.org/abs/1910.01108). The paper superseeds our [previous blogpost](https://medium.com/huggingface/distilbert-8cf3380435b5) with a different distillation loss and better performances. Here are the results on the dev sets of GLUE: @@ -88,7 +88,7 @@ python train.py \ --student_config training_configs/distilbert-base-uncased.json \ --teacher_type bert \ --teacher_name bert-base-uncased \ - --alpha_ce 0.33 --alpha_mlm 0.33 --alpha_cos 0.33 --mlm \ + --alpha_ce 5.0 --alpha_mlm 2.0 --alpha_cos 1.0 --mlm \ --freeze_pos_embs \ --dump_path serialization_dir/my_first_training \ --data_file data/binarized_text.bert-base-uncased.pickle \