You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
BF16 Transformer block size: 4.59GB, the rest is: 6.75GB, total 328.34GB
650
650
```
651
651
652
+
### Checkpoint reshaping
653
+
654
+
It's not trivial to switch from one 3D topology to another due to TP and DP logic of Deepspeed. So we developed a special mechanism called universal checkpoint which converts whatever topology the last checkpoint was created with into a universal checkpoint which has each weight and optimizer state as a separate file. This is done after careful merging of weights split across TP ranks (some weights are averaged, some are concatenated on the first and some on the second dimension. And then DP ZeRO sharding gets unsharded. So this universal checkpoint can now be used to start any new topology or to create a HF Transformers checkpoint. Note that all weights are in fp32 - so no data is lost.
655
+
656
+
657
+
As this is all new currently this requires that the code runs on the following 2 branches
The latter is really `bigscience-workshop/Megatron-DeepSpeed|ds_ckpt_reshape` but since we also have another bandaid branch that is being used it's merged with`layer-norm-auto-sync`.
c. start the slurm job normally with the edited script
689
+
690
+
You should be running with the new topology - it's expected that a tiny difference should be seen in lm loss, due to averaging of TP slices.
691
+
692
+
4. using a kill-switch or any other way save a new checkpoint which will be a normal Megatron-Deepspeed checkpoint
693
+
694
+
5. remove `--universal-checkpoint` from the script
695
+
696
+
6. resume training normally
697
+
698
+
the stages 5-6 are important, because currently there is a `latest-universal` tag in addition to `latest` which will not be updated by the main training, it's generated by `ds_to_universal.py` - so if you stop and start while still having `--universal-checkpoint` arg in the slurm script it'll restart from the same checkpoint as the first time and we don't want that.
699
+
700
+
So basically the conversion to universal is a transitional process which takes just a single step and saving a new checkpoint in the new topology - no longer universal. As you can tell converting to the universal checkpoint is a very slow and expensive process and we can't afford it on every save/load checkpoint point.
0 commit comments