File tree Expand file tree Collapse file tree 1 file changed +7
-5
lines changed Expand file tree Collapse file tree 1 file changed +7
-5
lines changed Original file line number Diff line number Diff line change @@ -45,7 +45,7 @@ Hardware:
4545
4646Software:
4747
48- - [ Megatron-DeepSpeed] ( https://github.com/bigscience-workshop/Megatron-DeepSpeed ) @ ` ds_ckpt_reshape-with-layer-norm-auto-sync ` PR branch
48+ - [ Megatron-DeepSpeed] ( https://github.com/bigscience-workshop/Megatron-DeepSpeed ) @ ` ds_ckpt_reshape ` PR branch
4949- [ DeepSpeed] ( https://github.com/microsoft/DeepSpeed ) @ olruwase/elastic-ckpt-refresh PR branch
5050- [ PyTorch] ( https://github.com/pytorch/pytorch ) -1.11 w/ CUDA-11.5
5151- [ apex] ( https://github.com/NVIDIA/apex ) @ master
@@ -679,15 +679,17 @@ So say you want to switch from 48 to 24 nodes.
6796791 . allocate a new cpu node
680680
681681```
682- srun --pty --account=six@cpu --nodes=1 --ntasks=1 --partition=cpu_p1 --cpus-per-task=40 --time 6:00:00 --hint=nomultithread --tasks-per-node=1 bash
682+ srun --pty --account=six@cpu --nodes=1 --ntasks=1 --partition=cpu_p1 --cpus-per-task=40 --time 6:00:00 --hint=nomultithread --tasks-per-node=1 bash --rcfile $six_ALL_CCFRWORK/start-tr11-176B-ml
683+
683684```
684685
685- 2 . convert the checkpoint, e.g. for ` global_step90751 `
686+ 2 . convert the checkpoint, e.g. for ` global_step94767 `
686687
687688```
689+ cd $six_ALL_CCFRWORK/code/tr11-176B-ml/Megatron-DeepSpeed-checkpoint-reshape
688690/usr/bin/time -v python tools/convert_checkpoint/ds_to_universal.py \
689- --input_folder $six_ALL_CCFRSCRATCH/checkpoints/tr11-176B-ml/checkpoints/main/global_step90751 \
690- --output_folder $six_ALL_CCFRSCRATCH/checkpoints/tr11-176B-ml/checkpoints/main/global_step90751_universal \
691+ --input_folder $six_ALL_CCFRSCRATCH/checkpoints/tr11-176B-ml/checkpoints/main/global_step94767 \
692+ --output_folder $six_ALL_CCFRSCRATCH/checkpoints/tr11-176B-ml/checkpoints/main/global_step94767_universal \
691693--num_extract_workers 10 --num_merge_workers 4
692694```
693695
You can’t perform that action at this time.
0 commit comments