@@ -79,7 +79,7 @@ The latest wikipedia dump can be downloaded [at this link](https://dumps.wikimed
7979or via command line:
8080
8181``` shell
82- curl https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
82+ curl -O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
8383```
8484The dump can be extracted with the ` wikiextractor ` tool.
8585
@@ -126,7 +126,12 @@ The `create_vocabulary.py` script allows you to compute your own WordPiece
126126vocabulary for use with BERT. In most cases however, it is desirable to use the
127127standard BERT vocabularies from the original models. You can download the
128128English uncased vocabulary
129- [ here] ( https://storage.googleapis.com/tensorflow/keras-nlp/examples/bert/bert_vocab_uncased.txt ) .
129+ [ here] ( https://storage.googleapis.com/tensorflow/keras-nlp/examples/bert/bert_vocab_uncased.txt ) ,
130+ or in your terminal run:
131+
132+ ``` shell
133+ curl -O https://storage.googleapis.com/tensorflow/keras-nlp/examples/bert/bert_vocab_uncased.txt
134+ ```
130135
131136### Tokenize, mask, and combine sentences into training examples
132137
@@ -169,7 +174,7 @@ for file in path/to/sentence-split-data/*; do
169174 output=" path/to/pretraining-data/$( basename -- " $file " .txt) .tfrecord"
170175 python examples/bert/create_pretraining_data.py \
171176 --input_files ${file} \
172- --vocab_file vocab .txt \
177+ --vocab_file bert_vocab_uncased .txt \
173178 --output_file ${output}
174179done
175180```
@@ -183,7 +188,7 @@ for file in path/to/sentence-split-data/*; do
183188 output=" path/to/pretraining-data/$( basename -- " $file " .txt) .tfrecord"
184189 echo python examples/bert/create_pretraining_data.py \
185190 --input_files ${file} \
186- --vocab_file vocab .txt \
191+ --vocab_file bert_vocab_uncased .txt \
187192 --output_file ${output}
188193done | parallel -j ${NUM_JOBS}
189194```
0 commit comments