Skip to content

Commit 1db7588

Browse files
saberkuntensorflower-gardener
authored andcommitted
[Docs] Add BERT pre-training experiment documentation to train.md
#10074 PiperOrigin-RevId: 463195812
1 parent e22d7d8 commit 1db7588

File tree

2 files changed

+116
-1
lines changed

2 files changed

+116
-1
lines changed
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
task:
2+
init_checkpoint: ''
3+
model:
4+
cls_heads: [{activation: tanh, cls_token_idx: 0, dropout_rate: 0.1, inner_dim: 768, name: next_sentence, num_classes: 2}]
5+
train_data:
6+
drop_remainder: true
7+
global_batch_size: 512
8+
input_path: '[Your proceed wiki data path]*,[Your proceed books data path]*'
9+
is_training: true
10+
max_predictions_per_seq: 76
11+
seq_length: 512
12+
use_next_sentence_label: true
13+
use_position_id: false
14+
use_v2_feature_names: true
15+
validation_data:
16+
drop_remainder: false
17+
global_batch_size: 512
18+
input_path: '[Your proceed wiki data path]-00000-of-00500,[Your proceed books data path]-00000-of-00500'
19+
is_training: false
20+
max_predictions_per_seq: 76
21+
seq_length: 512
22+
use_next_sentence_label: true
23+
use_position_id: false
24+
use_v2_feature_names: true
25+
trainer:
26+
checkpoint_interval: 20000
27+
max_to_keep: 5
28+
optimizer_config:
29+
learning_rate:
30+
polynomial:
31+
cycle: false
32+
decay_steps: 1000000
33+
end_learning_rate: 0.0
34+
initial_learning_rate: 0.0001
35+
power: 1.0
36+
type: polynomial
37+
optimizer:
38+
type: adamw
39+
warmup:
40+
polynomial:
41+
power: 1
42+
warmup_steps: 10000
43+
type: polynomial
44+
steps_per_loop: 1000
45+
summary_interval: 1000
46+
train_steps: 1000000
47+
validation_interval: 1000
48+
validation_steps: 64

official/nlp/docs/train.md

Lines changed: 68 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -229,6 +229,73 @@ python3 train.py \
229229

230230
```
231231

232+
### Pre-train a BERT from scratch
233+
232234
</details>
233235

234-
Note: More examples about pre-training will come soon.
236+
This example pre-trains a BERT model with Wikipedia and Books datasets used by
237+
the original BERT paper.
238+
The [BERT repo](https://github.com/tensorflow/models/blob/master/official/nlp/data/create_pretraining_data.py)
239+
contains detailed information about the Wikipedia dump and
240+
[BookCorpus](https://yknzhu.wixsite.com/mbweb). Of course, the pre-training
241+
recipe is generic and you can apply the same recipe to your own corpus.
242+
243+
Please use the script
244+
[`create_pretraining_data.py`](https://github.com/tensorflow/models/blob/master/official/nlp/data/create_pretraining_data.py)
245+
which is essentially branched from [BERT research repo](https://github.com/google-research/bert)
246+
to get processed pre-training data and it adapts to TF2 symbols and python3
247+
compatibility.
248+
249+
Running the pre-training script requires an input and output directory, as well
250+
as a vocab file. Note that `max_seq_length` will need to match the sequence
251+
length parameter you specify when you run pre-training.
252+
253+
```shell
254+
export WORKING_DIR='local disk or cloud location'
255+
export BERT_DIR='local disk or cloud location'
256+
python models/official/nlp/data/create_pretraining_data.py \
257+
--input_file=$WORKING_DIR/input/input.txt \
258+
--output_file=$WORKING_DIR/output/tf_examples.tfrecord \
259+
--vocab_file=$BERT_DIR/wwm_uncased_L-24_H-1024_A-16/vocab.txt \
260+
--do_lower_case=True \
261+
--max_seq_length=512 \
262+
--max_predictions_per_seq=76 \
263+
--masked_lm_prob=0.15 \
264+
--random_seed=12345 \
265+
--dupe_factor=5
266+
```
267+
268+
Then, you can update the yaml configuration file, e.g.
269+
`configs/experiments/wiki_books_pretrain.yaml` to specify your data paths and
270+
update masking-related hyper parameters to match with your specification for
271+
the pretraining data. When your data have multiple shards, you can
272+
use `*` to include multiple files.
273+
274+
To train different BERT sizes, you need to adjust:
275+
276+
```
277+
model:
278+
cls_heads: [{activation: tanh, cls_token_idx: 0, dropout_rate: 0.1, inner_dim: 768, name: next_sentence, num_classes: 2}]
279+
```
280+
281+
to match the hidden dimensions.
282+
283+
Then, you can start the training and evaluation jobs, which runs the
284+
[`bert/pretraining`](https://github.com/tensorflow/models/blob/master/official/nlp/configs/pretraining_experiments.py#L51)
285+
experiment:
286+
287+
```shell
288+
export OUTPUT_DIR=gs://some_bucket/my_output_dir
289+
export PARAMS=$PARAMS,runtime.distribution_strategy=tpu
290+
291+
python3 train.py \
292+
--experiment=bert/pretraining \
293+
--mode=train_and_eval \
294+
--model_dir=$OUTPUT_DIR \
295+
--config_file=configs/models/bert_en_uncased_base.yaml \
296+
--config_file=configs/experiments/wiki_books_pretrain.yaml \
297+
--tpu=${TPU_NAME} \
298+
--params_override=$PARAMS
299+
```
300+
301+
Note: More examples about pre-training with TFDS datesets will come soon.

0 commit comments

Comments
 (0)