Checkpoint reshaping

stas00 · stas00 · commit b01e3afc9699 · 2022-06-29T16:21:18.000-07:00
diff --git a/train/tr11-176B-ml/README.md b/train/tr11-176B-ml/README.md
@@ -45,8 +45,8 @@ Hardware:
 
 Software:
 
-- [Megatron-DeepSpeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed) @ master / BigScience fork - currently using `layer-norm-auto-syn` PR branch
-- [DeepSpeed](https://github.com/microsoft/DeepSpeed) @ master (soon) at the moment 93e9307d609620943565e639f30ef15513c76f4f
+- [Megatron-DeepSpeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed) @  `ds_ckpt_reshape-with-layer-norm-auto-sync` PR branch
+- [DeepSpeed](https://github.com/microsoft/DeepSpeed) @ olruwase/elastic-ckpt-refresh PR branch
 - [PyTorch](https://github.com/pytorch/pytorch)-1.11 w/ CUDA-11.5
 - [apex](https://github.com/NVIDIA/apex) @ master
 
@@ -649,6 +649,58 @@ NHIDDEN=14336; NLAYERS=70; SEQ_LEN=2048; VOCAB_SIZE=250680; python -c "h=$NHIDDE
 BF16 Transformer block size: 4.59GB, the rest is: 6.75GB, total 328.34GB
 ```
 
+### Checkpoint reshaping
+
+It's not trivial to switch from one 3D topology to another due to TP and DP logic of Deepspeed. So we developed a special mechanism called universal checkpoint which converts whatever topology the last checkpoint was created with into a universal checkpoint which has each weight and optimizer state as a separate file. This is done after careful merging of weights split across TP ranks (some weights are averaged, some are concatenated on the first and some on the second dimension. And then DP ZeRO sharding gets unsharded. So this universal checkpoint can now be used to start any new topology or to create a HF Transformers checkpoint. Note that all weights are in fp32 - so no data is lost.
+
+
+As this is all new currently this requires that the code runs on the following 2 branches
+- `microsoft/DeepSpeed|olruwase/elastic-ckpt-refresh`
+- `bigscience-workshop/Megatron-DeepSpeed|ds_ckpt_reshape-with-layer-norm-auto-sync`
+
+The latter is really `bigscience-workshop/Megatron-DeepSpeed|ds_ckpt_reshape` but since we also have another bandaid branch that is being used it's merged with`layer-norm-auto-sync`.
+
+So say you want to switch from 48 to 24 nodes.
+
+1. allocate a new cpu node
+
+```
+srun --pty --account=six@cpu --nodes=1 --ntasks=1 --partition=cpu_p1 --cpus-per-task=40 --time 6:00:00 --hint=nomultithread  --tasks-per-node=1 bash
+```
+
+2. convert the checkpoint, e.g. for `global_step90751`
+
+```
+/usr/bin/time -v python tools/convert_checkpoint/ds_to_universal.py \
+--input_folder $six_ALL_CCFRSCRATCH/checkpoints/tr11-176B-ml/checkpoints/main/global_step90751 \
+--output_folder $six_ALL_CCFRSCRATCH/checkpoints/tr11-176B-ml/checkpoints/main/global_step90751_universal \
+--num_extract_workers 10 --num_merge_workers 4
+```
+
+it takes about 50min for 176B
+
+3. now edit the normal slurm script
+
+a. change its topology to the desired one.
+
+b. add: `--universal-checkpoint` to the script
+
+c. start the slurm job normally with the edited script
+
+You should be running with the new topology - it's expected that a tiny difference should be seen in lm loss, due to averaging of TP slices.
+
+4. using a kill-switch or any other way save a new checkpoint which will be a normal Megatron-Deepspeed checkpoint
+
+5. remove `--universal-checkpoint` from the script
+
+6. resume training normally
+
+the stages 5-6 are important, because currently there is a `latest-universal` tag in addition to `latest` which will not be updated by the main training, it's generated by `ds_to_universal.py` - so if you stop and start while still having `--universal-checkpoint` arg in the slurm script it'll restart from the same checkpoint as the first time and we don't want that.
+
+So basically the conversion to universal is a transitional process which takes just a single step and saving a new checkpoint in the new topology - no longer universal. As you can tell converting to the universal checkpoint is a very slow and expensive process and we can't afford it on every save/load checkpoint point.
+
+
+
 ### Times
 
 - 1 train iteration ~100sec