transformers/examples/seq2seq/README.md

## Sequence to Sequence

This directory contains examples for finetuning and evaluating transformers on summarization and translation tasks.
Summarization support is more mature than translation support.
Please tag @sshleifer with any issues/unexpected behaviors, or send a PR!
For `bertabs` instructions, see `bertabs/README.md`. 


### Data
XSUM Data:
```bash
cd examples/seq2seq
wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/xsum.tar.gz
tar -xzvf xsum.tar.gz
export XSUM_DIR=${PWD}/xsum
```
this should make a directory called cnn_dm/ with files like `test.source`.
To use your own data, copy that files format. Each article to be summarized is on its own line.

CNN/DailyMail data
```bash
cd examples/seq2seq
wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm.tgz
tar -xzvf cnn_dm.tgz

export CNN_DIR=${PWD}/cnn_dm
```

WMT16 English-Romanian Translation Data:
```bash
cd examples/seq2seq
wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
tar -xzvf wmt_en_ro.tar.gz
export ENRO_DIR=${PWD}/wmt_en_ro
```

If you are using your own data, it must be formatted as one directory with 6 files: train.source, train.target, val.source, val.target, test.source, test.target.  
The `.source` files are the input, the `.target` files are the desired output.

 
### Tips and Tricks

General Tips:
- since you need to run from `examples/seq2seq`, and likely need to modify code, the easiest workflow is fork transformers, clone your fork, and run `pip install -e .` before you get started.   
- try `--freeze_encoder` or `--freeze_embeds` for faster training/larger batch size.  (3hr per epoch with bs=8, see the "xsum_shared_task" command below)
- `fp16_opt_level=O1` (the default works best).
- In addition to the pytorch-lightning .ckpt checkpoint, a transformers checkpoint will be saved.
Load it with `BartForConditionalGeneration.from_pretrained(f'{output_dir}/best_tfmr)`.
- At the moment, `--do_predict` does not work in a multi-gpu setting. You need to use `evaluate_checkpoint` or the `run_eval.py` code.
- This warning can be safely ignored: 
    > "Some weights of BartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-large-xsum and are newly initialized: ['final_logits_bias']"
- Both finetuning and eval are 30% faster with `--fp16`. For that you need to [install apex](https://github.com/NVIDIA/apex#quick-start).
- Read scripts before you run them! 

Summarization Tips:
- (summ) 1 epoch at batch size 1 for bart-large takes 24 hours and requires 13GB GPU RAM with fp16 on an NVIDIA-V100.
- If you want to run experiments on improving the summarization finetuning process, try the XSUM Shared Task (below). It's faster to train than CNNDM because the summaries are shorter.
- For CNN/DailyMail, the default `val_max_target_length` and `test_max_target_length` will truncate the ground truth labels, resulting in slightly higher rouge scores. To get accurate rouge scores, you should rerun calculate_rouge on the `{output_dir}/test_generations.txt` file saved by `trainer.test()`
- `--max_target_length=60 --val_max_target_length=60 --test_max_target_length=100 ` is a reasonable setting for XSUM.
- `wandb` can be used by specifying `--logger_name wandb`. It is useful for reproducibility. Specify the environment variable `WANDB_PROJECT='hf_xsum'` to do the XSUM shared task.
- If you are finetuning on your own dataset, start from `distilbart-cnn-12-6` if you want long summaries and `distilbart-xsum-12-6` if you want short summaries.
(It rarely makes sense to start from `bart-large` unless you are a researching finetuning methods). 

**Update 2018-07-18**
Datasets: Seq2SeqDataset will be used for all models besides MBart, for which MBartDataset will be used.**
A new dataset is needed to support multilingual tasks.

### Summarization Finetuning
Run/modify `finetune.sh`

The following command should work on a 16GB GPU:
```bash
./finetune.sh \
    --data_dir $XSUM_DIR \
    --train_batch_size=1 \
    --eval_batch_size=1 \
    --output_dir=xsum_results \
    --num_train_epochs 1 \
    --model_name_or_path facebook/bart-large
```

### Translation Finetuning

First, follow the wmt_en_ro download instructions.
Then you can finetune mbart_cc25 on english-romanian with the following command.
**Recommendation:** Read and potentially modify the fairly opinionated defaults in `train_mbart_cc25_enro.sh` script before running it. 
```bash
export ENRO_DIR=${PWD}/wmt_en_ro   # may need to be fixed depending on where you downloaded
export MAX_LEN=128
export BS=4
export GAS=8
./train_mbart_cc25_enro.sh --output_dir cc25_v1_frozen/
```


### Finetuning Outputs 
As you train, `output_dir` will be filled with files, that look kind of like this (comments are mine). 
Some of them are metrics, some of them are checkpoints, some of them are metadata. Here is a quick tour:

```bash
output_dir
├── best_tfmr  # this is a huggingface checkpoint generated by save_pretrained. It is the same model as the PL .ckpt file below
│   ├── config.json
│   ├── merges.txt
│   ├── pytorch_model.bin
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   └── vocab.json
├── git_log.json   # repo, branch, and commit hash
├── val_avg_rouge2=0.1984-step_count=11.ckpt  # this is a pytorch lightning checkpoint associated with the best val score.
├── metrics.json  # new validation metrics will continually be appended to this
├── student  # this is a huggingface checkpoint generated by SummarizationDistiller. It is the student before it gets finetuned.
│   ├── config.json
│   └── pytorch_model.bin
├── test_generations.txt 
# ^^ are the summaries or translations produced by your best checkpoint on the test data. Populated when training is done 
├── test_results.txt  # a convenience file with the test set metrics. This data is also in metrics.json['test']
├── hparams.pkl  # the command line args passed after some light preprocessing. Should be saved fairly quickly.
```
After training, you can recover the best checkpoint by running
```python
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(f'{output_dir}/best_tfmr')
```

### Evaluation Commands

To create summaries for each article in dataset, we use `run_eval.py`, here are a few commands that run eval for different tasks and models.
If 'translation' is in your task name, the computed metric will be BLEU. Otherwise, ROUGE will be used.

For t5, you need to specify --task translation_{src}_to_{tgt} as follows:
```bash
export DATA_DIR=wmt_en_ro
python run_eval.py t5-base \
    $DATA_DIR/val.source t5_val_generations.txt \
    --reference_path $DATA_DIR/val.target \
    --score_path enro_bleu.json \
    --task translation_en_to_ro \
    --n_obs 100 \
    --device cuda \
    --fp16 \
    --bs 32
```

This command works for MBART, although the BLEU score is suspiciously low.
```bash
export DATA_DIR=wmt_en_ro
python run_eval.py facebook/mbart-large-en-ro $DATA_DIR/val.source mbart_val_generations.txt \
    --reference_path $DATA_DIR/val.target \
    --score_path enro_bleu.json \
    --task translation \
    --n_obs 100 \
    --device cuda \
    --fp16 \
    --bs 32
```

Summarization (xsum will be very similar):
```bash
export DATA_DIR=cnn_dm
python run_eval.py sshleifer/distilbart-cnn-12-6 $DATA_DIR/val.source dbart_val_generations.txt \
    --reference_path $DATA_DIR/val.target \
    --score_path cnn_rouge.json \
    --task summarization \
    --n_obs 100 \
    --device cuda \
    --fp16 \
    --bs 32
```


### DistilBART
![DBART](https://huggingface.co/front/thumbnails/distilbart_large.png)

For the CNN/DailyMail dataset, (relatively longer, more extractive summaries), we found a simple technique that works:
you just copy alternating layers from `bart-large-cnn` and finetune more on the same data. 

For the XSUM dataset, that didn’t work as well so we used that same initialization strategy followed by a combination of Distillbert’s ce_loss and the hidden states MSE loss used in the tinybert paper.

You can see the performance tradeoffs of model sizes [here](https://docs.google.com/spreadsheets/d/1EkhDMwVO02m8jCD1cG3RoFPLicpcL1GQHTQjfvDYgIM/edit#gid=0).
and more granular timing results [here](https://docs.google.com/spreadsheets/d/1EkhDMwVO02m8jCD1cG3RoFPLicpcL1GQHTQjfvDYgIM/edit#gid=1753259047&range=B2:I23).

#### No Teacher Distillation
To run the simpler distilbart-cnn style distillation all you need is data, a GPU, and a properly initialized student.
You don't even need `distillation.py`.

Some [un-finetuned students](https://huggingface.co/models?search=sshleifer%2Fstudent) are available for replication purposes.
They are initialized by copying layers from the associated `bart-large-{cnn|xsum}` teacher using `--init_strategy alternate`. (You can read about that in `initialization_utils.py`)
The command that produced `sshleifer/distilbart-cnn-12-6` is
```bash
./train_distilbart_cnn.sh
```  
runtime: 6H on NVIDIA RTX 24GB GPU

*Note*: You can get the same simple distillation logic by using `./run_distiller.sh --no_teacher` followed by identical arguments as the ones in `train_distilbart_cnn.sh`.
If you are using `wandb` and comparing the two distillation methods, using this entry point will make your logs consistent,
because you will have the same hyperparameters logged in every run.

#### With a teacher
*Note* only BART variants are supported

In this method, we use try to enforce that the student and teacher produce similar encoder_outputs, logits, and hidden_states using `BartSummarizationDistiller`.
This is how `sshleifer/distilbart-xsum*` checkpoints were produced.

The command that produced `sshleifer/distilbart-xsum-12-6` is:

```bash
./train_distilbart_xsum.sh  
```

runtime: 13H on V-100 16GB GPU. 

### Contributing
- follow the standard contributing guidelines and code of conduct.
- add tests to `test_seq2seq_examples.py`
- To run only the seq2seq tests, you must be in the root of the repository and run:
```bash
pytest examples/seq2seq/  
```
-												Upload DistilBART artwork (#5394)


											
										
										
											2020-06-30 18:11:11 +08:00
+								## Sequence to Sequence
-												examples/seq2seq supports translation (#5202)


											
										
										
											2020-06-25 11:58:11 +08:00
+								This directory contains examples for finetuning and evaluating transformers on summarization and translation tasks.
 								Summarization support is more mature than translation support.
 								Please tag @sshleifer with any issues/unexpected behaviors, or send a PR!
 								For `bertabs` instructions, see `bertabs/README.md`.
-												[seq2seq docs] Move evaluation down, fix typo (#5365)


											
										
										
											2020-06-29 22:36:04 +08:00
-												examples/seq2seq supports translation (#5202)


											
										
										
											2020-06-25 11:58:11 +08:00
+								### Data
-												Seq2SeqDataset uses linecache to save memory by @Pradhy729 (#5792) 

Co-authored-by: Pradhy729 <49659913+Pradhy729@users.noreply.github.com>
											
										
										
											2020-07-19 01:57:33 +08:00
+								XSUM Data:
-												examples/seq2seq supports translation (#5202)


											
										
										
											2020-06-25 11:58:11 +08:00
+								```bash
 								cd examples/seq2seq
-												Seq2SeqDataset uses linecache to save memory by @Pradhy729 (#5792) 

Co-authored-by: Pradhy729 <49659913+Pradhy729@users.noreply.github.com>
											
										
										
											2020-07-19 01:57:33 +08:00
+								wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/xsum.tar.gz
 								tar -xzvf xsum.tar.gz
 								export XSUM_DIR=${PWD}/xsum
-												examples/seq2seq supports translation (#5202)


											
										
										
											2020-06-25 11:58:11 +08:00
+								```
 								this should make a directory called cnn_dm/ with files like `test.source`.
 								To use your own data, copy that files format. Each article to be summarized is on its own line.
-												Seq2SeqDataset uses linecache to save memory by @Pradhy729 (#5792) 

Co-authored-by: Pradhy729 <49659913+Pradhy729@users.noreply.github.com>
											
										
										
											2020-07-19 01:57:33 +08:00
+								CNN/DailyMail data
-												examples/seq2seq supports translation (#5202)


											
										
										
											2020-06-25 11:58:11 +08:00
+								```bash
 								cd examples/seq2seq
-												Seq2SeqDataset uses linecache to save memory by @Pradhy729 (#5792) 

Co-authored-by: Pradhy729 <49659913+Pradhy729@users.noreply.github.com>
											
										
										
											2020-07-19 01:57:33 +08:00
+								wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm.tgz
 								tar -xzvf cnn_dm.tgz
-												examples/seq2seq supports translation (#5202)


											
										
										
											2020-06-25 11:58:11 +08:00
-												Seq2SeqDataset uses linecache to save memory by @Pradhy729 (#5792) 

Co-authored-by: Pradhy729 <49659913+Pradhy729@users.noreply.github.com>
											
										
										
											2020-07-19 01:57:33 +08:00
+								export CNN_DIR=${PWD}/cnn_dm
 								```
-												examples/seq2seq supports translation (#5202)


											
										
										
											2020-06-25 11:58:11 +08:00
 								WMT16 English-Romanian Translation Data:
 								```bash
 								cd examples/seq2seq
 								wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
 								tar -xzvf wmt_en_ro.tar.gz
 								export ENRO_DIR=${PWD}/wmt_en_ro
 								```
 								If you are using your own data, it must be formatted as one directory with 6 files: train.source, train.target, val.source, val.target, test.source, test.target.
 								The `.source` files are the input, the `.target` files are the desired output.
-												Seq2SeqDataset uses linecache to save memory by @Pradhy729 (#5792) 

Co-authored-by: Pradhy729 <49659913+Pradhy729@users.noreply.github.com>
											
										
										
											2020-07-19 01:57:33 +08:00
-												Add mbart-large-cc25, support translation finetuning (#5129)

improve unittests for finetuning, especially w.r.t testing frozen parameters
fix freeze_embeds for T5
add streamlit setup.cfg
											
										
										
											2020-07-08 01:23:01 +08:00
+								### Tips and Tricks
 								General Tips:
 								- since you need to run from `examples/seq2seq`, and likely need to modify code, the easiest workflow is fork transformers, clone your fork, and run `pip install -e .` before you get started.
 								- try `--freeze_encoder` or `--freeze_embeds` for faster training/larger batch size.  (3hr per epoch with bs=8, see the "xsum_shared_task" command below)
 								- `fp16_opt_level=O1` (the default works best).
 								- In addition to the pytorch-lightning .ckpt checkpoint, a transformers checkpoint will be saved.
 								Load it with `BartForConditionalGeneration.from_pretrained(f'{output_dir}/best_tfmr)`.
 								- At the moment, `--do_predict` does not work in a multi-gpu setting. You need to use `evaluate_checkpoint` or the `run_eval.py` code.
 								- This warning can be safely ignored:
 								    > "Some weights of BartForConditionalGeneration were not initialized from the model checkpoint at facebook/bart-large-xsum and are newly initialized: ['final_logits_bias']"
 								- Both finetuning and eval are 30% faster with `--fp16`. For that you need to [install apex](https://github.com/NVIDIA/apex#quick-start).
 								- Read scripts before you run them!
 								Summarization Tips:
 								- (summ) 1 epoch at batch size 1 for bart-large takes 24 hours and requires 13GB GPU RAM with fp16 on an NVIDIA-V100.
 								- If you want to run experiments on improving the summarization finetuning process, try the XSUM Shared Task (below). It's faster to train than CNNDM because the summaries are shorter.
 								- For CNN/DailyMail, the default `val_max_target_length` and `test_max_target_length` will truncate the ground truth labels, resulting in slightly higher rouge scores. To get accurate rouge scores, you should rerun calculate_rouge on the `{output_dir}/test_generations.txt` file saved by `trainer.test()`
 								- `--max_target_length=60 --val_max_target_length=60 --test_max_target_length=100 ` is a reasonable setting for XSUM.
-												Lightning Updates for v0.8.5 (#5798)

Co-authored-by: Sam Shleifer <sshleifer@gmail.com>
											
										
										
											2020-07-18 10:43:06 +08:00
+								- `wandb` can be used by specifying `--logger_name wandb`. It is useful for reproducibility. Specify the environment variable `WANDB_PROJECT='hf_xsum'` to do the XSUM shared task.
-												Add mbart-large-cc25, support translation finetuning (#5129)

improve unittests for finetuning, especially w.r.t testing frozen parameters
fix freeze_embeds for T5
add streamlit setup.cfg
											
										
										
											2020-07-08 01:23:01 +08:00
+								- If you are finetuning on your own dataset, start from `distilbart-cnn-12-6` if you want long summaries and `distilbart-xsum-12-6` if you want short summaries.
 								(It rarely makes sense to start from `bart-large` unless you are a researching finetuning methods).
-												examples/seq2seq supports translation (#5202)


											
										
										
											2020-06-25 11:58:11 +08:00
-												Seq2SeqDataset uses linecache to save memory by @Pradhy729 (#5792) 

Co-authored-by: Pradhy729 <49659913+Pradhy729@users.noreply.github.com>
											
										
										
											2020-07-19 01:57:33 +08:00
+								**Update 2018-07-18**
 								Datasets: Seq2SeqDataset will be used for all models besides MBart, for which MBartDataset will be used.**
 								A new dataset is needed to support multilingual tasks.
-												examples/seq2seq supports translation (#5202)


											
										
										
											2020-06-25 11:58:11 +08:00
+								### Summarization Finetuning
 								Run/modify `finetune.sh`
 								The following command should work on a 16GB GPU:
 								```bash
 								./finetune.sh \
 								    --data_dir $XSUM_DIR \
 								    --train_batch_size=1 \
 								    --eval_batch_size=1 \
 								    --output_dir=xsum_results \
 								    --num_train_epochs 1 \
 								    --model_name_or_path facebook/bart-large
 								```
-												Add mbart-large-cc25, support translation finetuning (#5129)

improve unittests for finetuning, especially w.r.t testing frozen parameters
fix freeze_embeds for T5
add streamlit setup.cfg
											
										
										
											2020-07-08 01:23:01 +08:00
+								### Translation Finetuning
 								First, follow the wmt_en_ro download instructions.
 								Then you can finetune mbart_cc25 on english-romanian with the following command.
 								**Recommendation:** Read and potentially modify the fairly opinionated defaults in `train_mbart_cc25_enro.sh` script before running it.
 								```bash
 								export ENRO_DIR=${PWD}/wmt_en_ro   # may need to be fixed depending on where you downloaded
-												[seq2seq] MAX_LEN env var for MT commands (#5837)


											
										
										
											2020-07-18 10:51:31 +08:00
+								export MAX_LEN=128
-												Add mbart-large-cc25, support translation finetuning (#5129)

improve unittests for finetuning, especially w.r.t testing frozen parameters
fix freeze_embeds for T5
add streamlit setup.cfg
											
										
										
											2020-07-08 01:23:01 +08:00
+								export BS=4
 								export GAS=8
 								./train_mbart_cc25_enro.sh --output_dir cc25_v1_frozen/
 								```
-												examples/seq2seq supports translation (#5202)


											
										
										
											2020-06-25 11:58:11 +08:00
-												Add mbart-large-cc25, support translation finetuning (#5129)

improve unittests for finetuning, especially w.r.t testing frozen parameters
fix freeze_embeds for T5
add streamlit setup.cfg
											
										
										
											2020-07-08 01:23:01 +08:00
+								### Finetuning Outputs
-												examples/seq2seq supports translation (#5202)


											
										
										
											2020-06-25 11:58:11 +08:00
+								As you train, `output_dir` will be filled with files, that look kind of like this (comments are mine).
 								Some of them are metrics, some of them are checkpoints, some of them are metadata. Here is a quick tour:
 								```bash
 								output_dir
 								├── best_tfmr  # this is a huggingface checkpoint generated by save_pretrained. It is the same model as the PL .ckpt file below
 								│   ├── config.json
 								│   ├── merges.txt
 								│   ├── pytorch_model.bin
 								│   ├── special_tokens_map.json
 								│   ├── tokenizer_config.json
 								│   └── vocab.json
 								├── git_log.json   # repo, branch, and commit hash
 								├── val_avg_rouge2=0.1984-step_count=11.ckpt  # this is a pytorch lightning checkpoint associated with the best val score.
 								├── metrics.json  # new validation metrics will continually be appended to this
 								├── student  # this is a huggingface checkpoint generated by SummarizationDistiller. It is the student before it gets finetuned.
 								│   ├── config.json
 								│   └── pytorch_model.bin
 								├── test_generations.txt
 								# ^^ are the summaries or translations produced by your best checkpoint on the test data. Populated when training is done
 								├── test_results.txt  # a convenience file with the test set metrics. This data is also in metrics.json['test']
 								├── hparams.pkl  # the command line args passed after some light preprocessing. Should be saved fairly quickly.
 								```
 								After training, you can recover the best checkpoint by running
 								```python
 								from transformers import AutoModelForSeq2SeqLM
 								model = AutoModelForSeq2SeqLM.from_pretrained(f'{output_dir}/best_tfmr')
 								```
-												[seq2seq docs] Move evaluation down, fix typo (#5365)


											
										
										
											2020-06-29 22:36:04 +08:00
+								### Evaluation Commands
 								To create summaries for each article in dataset, we use `run_eval.py`, here are a few commands that run eval for different tasks and models.
 								If 'translation' is in your task name, the computed metric will be BLEU. Otherwise, ROUGE will be used.
 								For t5, you need to specify --task translation_{src}_to_{tgt} as follows:
 								```bash
 								export DATA_DIR=wmt_en_ro
-												[seq2seq] distillation.py accepts trainer arguments (#5865)


											
										
										
											2020-07-18 19:43:57 +08:00
+								python run_eval.py t5-base \
-												[seq2seq docs] Move evaluation down, fix typo (#5365)


											
										
										
											2020-06-29 22:36:04 +08:00
+								    $DATA_DIR/val.source t5_val_generations.txt \
 								    --reference_path $DATA_DIR/val.target \
 								    --score_path enro_bleu.json \
 								    --task translation_en_to_ro \
 								    --n_obs 100 \
 								    --device cuda \
 								    --fp16 \
 								    --bs 32
 								```
 								This command works for MBART, although the BLEU score is suspiciously low.
 								```bash
 								export DATA_DIR=wmt_en_ro
 								python run_eval.py facebook/mbart-large-en-ro $DATA_DIR/val.source mbart_val_generations.txt \
 								    --reference_path $DATA_DIR/val.target \
 								    --score_path enro_bleu.json \
 								    --task translation \
 								    --n_obs 100 \
 								    --device cuda \
 								    --fp16 \
 								    --bs 32
 								```
 								Summarization (xsum will be very similar):
 								```bash
 								export DATA_DIR=cnn_dm
 								python run_eval.py sshleifer/distilbart-cnn-12-6 $DATA_DIR/val.source dbart_val_generations.txt \
 								    --reference_path $DATA_DIR/val.target \
 								    --score_path cnn_rouge.json \
 								    --task summarization \
 								    --n_obs 100 \
 								    --device cuda \
 								    --fp16 \
 								    --bs 32
 								```
-												[examples/seq2seq] more README improvements (#5274)


											
										
										
											2020-06-25 22:13:01 +08:00
+								### DistilBART
-												Upload DistilBART artwork (#5394)


											
										
										
											2020-06-30 18:11:11 +08:00
+								![DBART](https://huggingface.co/front/thumbnails/distilbart_large.png)
-												examples/seq2seq supports translation (#5202)


											
										
										
											2020-06-25 11:58:11 +08:00
-												[examples/seq2seq] more README improvements (#5274)


											
										
										
											2020-06-25 22:13:01 +08:00
+								For the CNN/DailyMail dataset, (relatively longer, more extractive summaries), we found a simple technique that works:
 								you just copy alternating layers from `bart-large-cnn` and finetune more on the same data.
 								For the XSUM dataset, that didn’t work as well so we used that same initialization strategy followed by a combination of Distillbert’s ce_loss and the hidden states MSE loss used in the tinybert paper.
 								You can see the performance tradeoffs of model sizes [here](https://docs.google.com/spreadsheets/d/1EkhDMwVO02m8jCD1cG3RoFPLicpcL1GQHTQjfvDYgIM/edit#gid=0).
 								and more granular timing results [here](https://docs.google.com/spreadsheets/d/1EkhDMwVO02m8jCD1cG3RoFPLicpcL1GQHTQjfvDYgIM/edit#gid=1753259047&range=B2:I23).
-												examples/seq2seq supports translation (#5202)


											
										
										
											2020-06-25 11:58:11 +08:00
 								#### No Teacher Distillation
 								To run the simpler distilbart-cnn style distillation all you need is data, a GPU, and a properly initialized student.
 								You don't even need `distillation.py`.
 								Some [un-finetuned students](https://huggingface.co/models?search=sshleifer%2Fstudent) are available for replication purposes.
 								They are initialized by copying layers from the associated `bart-large-{cnn|xsum}` teacher using `--init_strategy alternate`. (You can read about that in `initialization_utils.py`)
 								The command that produced `sshleifer/distilbart-cnn-12-6` is
 								```bash
 								./train_distilbart_cnn.sh
 								```
 								runtime: 6H on NVIDIA RTX 24GB GPU
 								*Note*: You can get the same simple distillation logic by using `./run_distiller.sh --no_teacher` followed by identical arguments as the ones in `train_distilbart_cnn.sh`.
 								If you are using `wandb` and comparing the two distillation methods, using this entry point will make your logs consistent,
 								because you will have the same hyperparameters logged in every run.
 								#### With a teacher
 								*Note* only BART variants are supported
 								In this method, we use try to enforce that the student and teacher produce similar encoder_outputs, logits, and hidden_states using `BartSummarizationDistiller`.
 								This is how `sshleifer/distilbart-xsum*` checkpoints were produced.
 								The command that produced `sshleifer/distilbart-xsum-12-6` is:
 								```bash
 								./train_distilbart_xsum.sh
 								```
 								runtime: 13H on V-100 16GB GPU.
 								### Contributing
 								- follow the standard contributing guidelines and code of conduct.
 								- add tests to `test_seq2seq_examples.py`
 								- To run only the seq2seq tests, you must be in the root of the repository and run:
 								```bash
 								pytest examples/seq2seq/
 								```