foundation-model-stack · daviswer · Jan 16, 2025 · Jan 16, 2025 · Jan 16, 2025 · Jan 16, 2025
diff --git a/training/README.md b/training/README.md
@@ -31,4 +31,20 @@ torchrun --nnodes=24 --node_rank=0 --nproc_per_node=8 \
       --report_interval=100 \
       --checkpoint_interval=20000 \
 ```
-To reproduce the exact model as Bamba-9B, you can find the training configs [here](data/README.md).
+To reproduce the exact model as Bamba-9B, or train using your own data or models, further config details are [here](data/README.md).
+
+## Continuing Training
+
+Training can be continued from a completed run's final saved checkpoint in multiple ways:
+ 1. Restore the model state only.
+ 2. Restore the model and optimizer state.
+ 3. Restore the entire model, optimizer, and dataloader state.
+
+If the completed run was configured with `--ckpt_save_path="/path/to/prev/ckpt"`, then a single
+`consolidated.00.pth` file containing the final model weights only is created under `"/path/to/pref/ckpt/pth"`, while
+sharded checkpoint files which also capture the optimizer and dataloader state exist under
+`"/path/to/prev/ckpt/checkpoints"`. The three scenarios above are then achieved by specifying:
+ 1. **Model Only**: `--ckpt_load_path="/path/to/prev/ckpt/pth/consolidated.00.pth"`
+ 2. **Model + Optimizer**: `--ckpt_load_path="/path/to/prev/ckpt/"`
+ 3. **Model + Optimizer + Dataloader**: `--ckpt_load_path="/path/to/prev/ckpt/" --resuming_dataset`
+