@@ -12,33 +12,21 @@ https://arxiv.org/abs/1806.00187
1212
1313### Training a new model on WMT'16 En-De
1414
15- First download the [ preprocessed WMT'16 En-De data provided by Google] ( https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8 ) .
16-
17- Then:
18-
19- ##### 1. Extract the WMT'16 En-De data
20- ``` bash
21- TEXT=wmt16_en_de_bpe32k
22- mkdir -p $TEXT
23- tar -xzvf wmt16_en_de.tar.gz -C $TEXT
24- ```
25-
26- ##### 2. Preprocess the dataset with a joined dictionary
15+ ##### 1. Preprocess the dataset with a joined dictionary (optional)
2716``` bash
28- RAW=raw
2917TOK=tok
3018BIN=bin
3119rm -rf $TOK $BIN
3220mkdir -p $TOK $BIN
3321# train
34- cp wmt16_en_de_bpe32k /train.tok.clean.bpe.32000.en $TOK /train.bpe.source
35- cp wmt16_en_de_bpe32k /train.tok.clean.bpe.32000.de $TOK /train.bpe.target
22+ wget -O $TOK /train.bpe.source https://fastseq.blob.core.windows.net/data/tasks/wmt16_en_de_bpe32k/tok /train.bpe.source
23+ wget -O $TOK /train.bpe.target https://fastseq.blob.core.windows.net/data/tasks/wmt16_en_de_bpe32k/tok /train.bpe.target
3624# val
37- cp wmt16_en_de_bpe32k/newstest2013.tok. bpe.32000.en $TOK /val.bpe.source
38- cp wmt16_en_de_bpe32k/newstest2013.tok. bpe.32000.de $TOK /val.bpe.target
25+ wget -O $TOK /val. bpe.source https://fastseq.blob.core.windows.net/data/tasks/wmt16_en_de_bpe32k/tok /val.bpe.source
26+ wget -O $TOK /val. bpe.target https://fastseq.blob.core.windows.net/data/tasks/wmt16_en_de_bpe32k/tok /val.bpe.target
3927# test
40- cat wmt16_en_de_bpe32k/newstest201[456].tok. bpe.32000.en > $TOK /test.bpe.source
41- cat wmt16_en_de_bpe32k/newstest201[456].tok. bpe.32000.de > $TOK /test.bpe.target
28+ wget -O $TOK /test. bpe.source https://fastseq.blob.core.windows.net/data/tasks/wmt16_en_de_bpe32k/tok /test.bpe.source
29+ wget -O $TOK /test. bpe.target https://fastseq.blob.core.windows.net/data/tasks/wmt16_en_de_bpe32k/tok /test.bpe.target
4230fairseq-preprocess \
4331 --source-lang source --target-lang target \
4432 --validpref $TOK /val.bpe \
@@ -50,7 +38,21 @@ fairseq-preprocess \
5038 --workers 20
5139```
5240
53- ##### 3. Train a model (optional)
41+ Or you can download the preprocessed data directly
42+ ``` bash
43+ TOK=tok
44+ BIN=bin
45+ rm -rf $TOK $BIN
46+ mkdir -p $TOK $BIN
47+ wget -O $BIN /dict.source.txt https://fastseq.blob.core.windows.net/data/tasks/wmt16_en_de_bpe32k/bin/dict.source.txt
48+ wget -O $BIN /dict.target.txt https://fastseq.blob.core.windows.net/data/tasks/wmt16_en_de_bpe32k/bin/dict.target.txt
49+ wget -O $BIN /test.source-target.source.bin https://fastseq.blob.core.windows.net/data/tasks/wmt16_en_de_bpe32k/bin/test.source-target.source.bin
50+ wget -O $BIN /test.source-target.source.idx https://fastseq.blob.core.windows.net/data/tasks/wmt16_en_de_bpe32k/bin/test.source-target.source.idx
51+ wget -O $BIN /test.source-target.target.bin https://fastseq.blob.core.windows.net/data/tasks/wmt16_en_de_bpe32k/bin/test.source-target.target.bin
52+ wget -O $BIN /test.source-target.target.idx https://fastseq.blob.core.windows.net/data/tasks/wmt16_en_de_bpe32k/bin/test.source-target.target.idx
53+ ```
54+
55+ ##### 2. Train a model (optional)
5456``` bash
5557fairseq-train \
5658 bin/ \
0 commit comments