Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion training/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,20 @@ torchrun --nnodes=24 --node_rank=0 --nproc_per_node=8 \
--report_interval=100 \
--checkpoint_interval=20000 \
```
To reproduce the exact model as Bamba-9B, you can find the training configs [here](data/README.md).
To reproduce the exact model as Bamba-9B, or train using your own data or models, further config details are [here](data/README.md).

## Continuing Training

Training can be continued from a completed run's final saved checkpoint in multiple ways:
1. Restore the model state only.
2. Restore the model and optimizer state.
3. Restore the entire model, optimizer, and dataloader state.

If the completed run was configured with `--ckpt_save_path="/path/to/prev/ckpt"`, then a single
`consolidated.00.pth` file containing the final model weights only is created under `"/path/to/pref/ckpt/pth"`, while
sharded checkpoint files which also capture the optimizer and dataloader state exist under
`"/path/to/prev/ckpt/checkpoints"`. The three scenarios above are then achieved by specifying:
1. **Model Only**: `--ckpt_load_path="/path/to/prev/ckpt/pth/consolidated.00.pth"`
2. **Model + Optimizer**: `--ckpt_load_path="/path/to/prev/ckpt/"`
3. **Model + Optimizer + Dataloader**: `--ckpt_load_path="/path/to/prev/ckpt/" --resuming_dataset`