[s2s] create doc for pegasus/fsmt replication (#7934)

This commit is contained in:
Stas Bekman 2020-10-20 12:07:52 -07:00 committed by GitHub
parent 96f4828ace
commit 0e24e4c136
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 18 additions and 4 deletions

View File

@ -15,7 +15,8 @@ For `bertabs` instructions, see [`bertabs/README.md`](bertabs/README.md).
## Datasets
#### XSUM:
#### XSUM
```bash
cd examples/seq2seq
wget https://cdn-datasets.huggingface.co/summarization/xsum.tar.gz
@ -26,6 +27,7 @@ this should make a directory called `xsum/` with files like `test.source`.
To use your own data, copy that files format. Each article to be summarized is on its own line.
#### CNN/DailyMail
```bash
cd examples/seq2seq
wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz
@ -35,7 +37,8 @@ export CNN_DIR=${PWD}/cnn_dm
```
this should make a directory called `cnn_dm/` with 6 files.
#### WMT16 English-Romanian Translation Data:
#### WMT16 English-Romanian Translation Data
download with this command:
```bash
wget https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz
@ -44,13 +47,25 @@ export ENRO_DIR=${PWD}/wmt_en_ro
```
this should make a directory called `wmt_en_ro/` with 6 files.
#### WMT English-German:
#### WMT English-German
```bash
wget https://cdn-datasets.huggingface.co/translation/wmt_en_de.tgz
tar -xzvf wmt_en_de.tgz
export DATA_DIR=${PWD}/wmt_en_de
```
#### FSMT datasets (wmt)
Refer to the scripts starting with `eval_` under:
https://github.com/huggingface/transformers/tree/master/scripts/fsmt
#### Pegasus (multiple datasets)
Multiple eval datasets are available for download from:
https://github.com/stas00/porting/tree/master/datasets/pegasus
#### Private Data
If you are using your own data, it must be formatted as one directory with 6 files:
@ -64,7 +79,6 @@ test.target
```
The `.source` files are the input, the `.target` files are the desired output.
### Tips and Tricks
General Tips: