[s2s] create doc for pegasus/fsmt replication (#7934)

2020-10-20 12:07:52 -07:00 · 2020-10-20 12:07:52 -07:00 · 0e24e4c136
parent 96f4828ace
commit 0e24e4c136
1 changed files with 18 additions and 4 deletions
--- a/examples/seq2seq/README.md
+++ b/examples/seq2seq/README.md
@ -15,7 +15,8 @@ For `bertabs` instructions, see [`bertabs/README.md`](bertabs/README.md).

 ## Datasets

-#### XSUM:
+#### XSUM
+
 ```bash
 cd examples/seq2seq
 wget https://cdn-datasets.huggingface.co/summarization/xsum.tar.gz
@ -26,6 +27,7 @@ this should make a directory called `xsum/` with files like `test.source`.
 To use your own data, copy that files format. Each article to be summarized is on its own line.

 #### CNN/DailyMail
+
 ```bash
 cd examples/seq2seq
 wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz
@ -35,7 +37,8 @@ export CNN_DIR=${PWD}/cnn_dm
 ```
 this should make a directory called `cnn_dm/` with 6 files.

-#### WMT16 English-Romanian Translation Data:
+#### WMT16 English-Romanian Translation Data
+
 download with this command:
 ```bash
 wget https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz
@ -44,13 +47,25 @@ export ENRO_DIR=${PWD}/wmt_en_ro
 ```
 this should make a directory called `wmt_en_ro/` with 6 files.

-#### WMT English-German:
+#### WMT English-German
+
 ```bash
 wget https://cdn-datasets.huggingface.co/translation/wmt_en_de.tgz
 tar -xzvf wmt_en_de.tgz
 export DATA_DIR=${PWD}/wmt_en_de
 ```

+#### FSMT datasets (wmt)
+
+Refer to the scripts starting with `eval_` under:
+https://github.com/huggingface/transformers/tree/master/scripts/fsmt
+
+#### Pegasus (multiple datasets)
+
+Multiple eval datasets are available for download from: 
+https://github.com/stas00/porting/tree/master/datasets/pegasus
+
+
 #### Private Data

 If you are using your own data, it must be formatted as one directory with 6 files:
@ -64,7 +79,6 @@ test.target
 ```
 The `.source` files are the input, the `.target` files are the desired output.

-
 ### Tips and Tricks

 General Tips: