@@ -229,6 +229,73 @@ python3 train.py \
229
229
230
230
```
231
231
232
+ ### Pre-train a BERT from scratch
233
+
232
234
</details >
233
235
234
- Note: More examples about pre-training will come soon.
236
+ This example pre-trains a BERT model with Wikipedia and Books datasets used by
237
+ the original BERT paper.
238
+ The [ BERT repo] ( https://github.com/tensorflow/models/blob/master/official/nlp/data/create_pretraining_data.py )
239
+ contains detailed information about the Wikipedia dump and
240
+ [ BookCorpus] ( https://yknzhu.wixsite.com/mbweb ) . Of course, the pre-training
241
+ recipe is generic and you can apply the same recipe to your own corpus.
242
+
243
+ Please use the script
244
+ [ ` create_pretraining_data.py ` ] ( https://github.com/tensorflow/models/blob/master/official/nlp/data/create_pretraining_data.py )
245
+ which is essentially branched from [ BERT research repo] ( https://github.com/google-research/bert )
246
+ to get processed pre-training data and it adapts to TF2 symbols and python3
247
+ compatibility.
248
+
249
+ Running the pre-training script requires an input and output directory, as well
250
+ as a vocab file. Note that ` max_seq_length ` will need to match the sequence
251
+ length parameter you specify when you run pre-training.
252
+
253
+ ``` shell
254
+ export WORKING_DIR=' local disk or cloud location'
255
+ export BERT_DIR=' local disk or cloud location'
256
+ python models/official/nlp/data/create_pretraining_data.py \
257
+ --input_file=$WORKING_DIR /input/input.txt \
258
+ --output_file=$WORKING_DIR /output/tf_examples.tfrecord \
259
+ --vocab_file=$BERT_DIR /wwm_uncased_L-24_H-1024_A-16/vocab.txt \
260
+ --do_lower_case=True \
261
+ --max_seq_length=512 \
262
+ --max_predictions_per_seq=76 \
263
+ --masked_lm_prob=0.15 \
264
+ --random_seed=12345 \
265
+ --dupe_factor=5
266
+ ```
267
+
268
+ Then, you can update the yaml configuration file, e.g.
269
+ ` configs/experiments/wiki_books_pretrain.yaml ` to specify your data paths and
270
+ update masking-related hyper parameters to match with your specification for
271
+ the pretraining data. When your data have multiple shards, you can
272
+ use ` * ` to include multiple files.
273
+
274
+ To train different BERT sizes, you need to adjust:
275
+
276
+ ```
277
+ model:
278
+ cls_heads: [{activation: tanh, cls_token_idx: 0, dropout_rate: 0.1, inner_dim: 768, name: next_sentence, num_classes: 2}]
279
+ ```
280
+
281
+ to match the hidden dimensions.
282
+
283
+ Then, you can start the training and evaluation jobs, which runs the
284
+ [ ` bert/pretraining ` ] ( https://github.com/tensorflow/models/blob/master/official/nlp/configs/pretraining_experiments.py#L51 )
285
+ experiment:
286
+
287
+ ``` shell
288
+ export OUTPUT_DIR=gs://some_bucket/my_output_dir
289
+ export PARAMS=$PARAMS ,runtime.distribution_strategy=tpu
290
+
291
+ python3 train.py \
292
+ --experiment=bert/pretraining \
293
+ --mode=train_and_eval \
294
+ --model_dir=$OUTPUT_DIR \
295
+ --config_file=configs/models/bert_en_uncased_base.yaml \
296
+ --config_file=configs/experiments/wiki_books_pretrain.yaml \
297
+ --tpu=${TPU_NAME} \
298
+ --params_override=$PARAMS
299
+ ```
300
+
301
+ Note: More examples about pre-training with TFDS datesets will come soon.
0 commit comments