-
Notifications
You must be signed in to change notification settings - Fork 51
Description
Describe the bug
After building index for openwebtext, building the trainer fails (at line 161 of train.py) because no validation dataset is constructed. I believe this is because the lm_dataset object is built with huggingface's load_dataset on the openwebtext named dataset, and it has no validation split. The validation_ratio quinine config option is only used in building the custom_eval_datasets, not the lm_dataset object, so it is not used to portion out part of openwebtext as a validation set.
To Reproduce
Replace datasets/wikitext2.yaml with datasets/openwebtext.yaml in mistral-micro.yaml (and make other artefact location changes) and run
deepspeed --num_gpus 4 --num_nodes 1 --master_addr machine1 train.py --config conf/mistral-micro.yaml --nnodes 1 --nproc_per_node 4 --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 4 --training_arguments.deepspeed conf/deepspeed/z2-small-conf.json --run_id repro-bug-openweb-novalid
Expected behavior
No failure occurs at line 161 of train.py when lm_dataset['validation'] is expressed.