No validation set in openwebtext leads to failure.

**Describe the bug**
After building index for openwebtext, building the trainer fails (at line 161 of `train.py`) because no `validation` dataset is constructed. I believe this is because the `lm_dataset` object is built with huggingface's `load_dataset` on the `openwebtext` named dataset, and it has no validation split. The `validation_ratio` quinine config option is only used in building the `custom_eval_datasets`, not the `lm_dataset` object, so it is not used to portion out part of `openwebtext` as a validation set.

**To Reproduce**
Replace `datasets/wikitext2.yaml` with `datasets/openwebtext.yaml` in `mistral-micro.yaml` (and make other artefact location changes) and run 

    deepspeed --num_gpus 4 --num_nodes 1 --master_addr machine1 train.py --config conf/mistral-micro.yaml --nnodes 1 --nproc_per_node 4 --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 4 --training_arguments.deepspeed conf/deepspeed/z2-small-conf.json --run_id repro-bug-openweb-novalid

**Expected behavior**
No failure occurs at line `161` of `train.py` when `lm_dataset['validation']` is expressed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No validation set in openwebtext leads to failure. #195

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

No validation set in openwebtext leads to failure. #195

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions