diff --git a/training/README.md b/training/README.md index b0badcc..bb5ffd9 100644 --- a/training/README.md +++ b/training/README.md @@ -31,4 +31,20 @@ torchrun --nnodes=24 --node_rank=0 --nproc_per_node=8 \ --report_interval=100 \ --checkpoint_interval=20000 \ ``` -To reproduce the exact model as Bamba-9B, you can find the training configs [here](data/README.md). +To reproduce the exact model as Bamba-9B, or train using your own data or models, further config details are [here](data/README.md). + +## Continuing Training + +Training can be continued from a completed run's final saved checkpoint in multiple ways: + 1. Restore the model state only. + 2. Restore the model and optimizer state. + 3. Restore the entire model, optimizer, and dataloader state. + +If the completed run was configured with `--ckpt_save_path="/path/to/prev/ckpt"`, then a single +`consolidated.00.pth` file containing the final model weights only is created under `"/path/to/pref/ckpt/pth"`, while +sharded checkpoint files which also capture the optimizer and dataloader state exist under +`"/path/to/prev/ckpt/checkpoints"`. The three scenarios above are then achieved by specifying: + 1. **Model Only**: `--ckpt_load_path="/path/to/prev/ckpt/pth/consolidated.00.pth"` + 2. **Model + Optimizer**: `--ckpt_load_path="/path/to/prev/ckpt/"` + 3. **Model + Optimizer + Dataloader**: `--ckpt_load_path="/path/to/prev/ckpt/" --resuming_dataset` +