Skip to content

Commit 1698fa0

Browse files
committed
update
1 parent 00327f8 commit 1698fa0

File tree

1 file changed

+7
-5
lines changed

1 file changed

+7
-5
lines changed

train/tr11-176B-ml/README.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ Hardware:
4545

4646
Software:
4747

48-
- [Megatron-DeepSpeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed) @ `ds_ckpt_reshape-with-layer-norm-auto-sync` PR branch
48+
- [Megatron-DeepSpeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed) @ `ds_ckpt_reshape` PR branch
4949
- [DeepSpeed](https://github.com/microsoft/DeepSpeed) @ olruwase/elastic-ckpt-refresh PR branch
5050
- [PyTorch](https://github.com/pytorch/pytorch)-1.11 w/ CUDA-11.5
5151
- [apex](https://github.com/NVIDIA/apex) @ master
@@ -679,15 +679,17 @@ So say you want to switch from 48 to 24 nodes.
679679
1. allocate a new cpu node
680680

681681
```
682-
srun --pty --account=six@cpu --nodes=1 --ntasks=1 --partition=cpu_p1 --cpus-per-task=40 --time 6:00:00 --hint=nomultithread --tasks-per-node=1 bash
682+
srun --pty --account=six@cpu --nodes=1 --ntasks=1 --partition=cpu_p1 --cpus-per-task=40 --time 6:00:00 --hint=nomultithread --tasks-per-node=1 bash --rcfile $six_ALL_CCFRWORK/start-tr11-176B-ml
683+
683684
```
684685

685-
2. convert the checkpoint, e.g. for `global_step90751`
686+
2. convert the checkpoint, e.g. for `global_step94767`
686687

687688
```
689+
cd $six_ALL_CCFRWORK/code/tr11-176B-ml/Megatron-DeepSpeed-checkpoint-reshape
688690
/usr/bin/time -v python tools/convert_checkpoint/ds_to_universal.py \
689-
--input_folder $six_ALL_CCFRSCRATCH/checkpoints/tr11-176B-ml/checkpoints/main/global_step90751 \
690-
--output_folder $six_ALL_CCFRSCRATCH/checkpoints/tr11-176B-ml/checkpoints/main/global_step90751_universal \
691+
--input_folder $six_ALL_CCFRSCRATCH/checkpoints/tr11-176B-ml/checkpoints/main/global_step94767 \
692+
--output_folder $six_ALL_CCFRSCRATCH/checkpoints/tr11-176B-ml/checkpoints/main/global_step94767_universal \
691693
--num_extract_workers 10 --num_merge_workers 4
692694
```
693695

0 commit comments

Comments
 (0)