Skip to content

Commit 00327f8

Browse files
committed
update
1 parent b01e3af commit 00327f8

File tree

1 file changed

+14
-0
lines changed

1 file changed

+14
-0
lines changed

train/tr11-176B-ml/README.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -649,6 +649,20 @@ NHIDDEN=14336; NLAYERS=70; SEQ_LEN=2048; VOCAB_SIZE=250680; python -c "h=$NHIDDE
649649
BF16 Transformer block size: 4.59GB, the rest is: 6.75GB, total 328.34GB
650650
```
651651

652+
### Important checkpoints
653+
654+
The first epoch finished at:
655+
656+
```
657+
[default7]: iteration 85376/ 115311 | consumed samples: 158692272 | consumed tokens: 325001773056 | elapsed time per iteration (s)
658+
: 104.70 | learning rate: 1.150E-05 | global batch size: 2048 | lm loss: 1.979558E+00 | grad norm: 0.132 | num zeros: 0.0 | number of sk
659+
ipped iterations: 0 | number of nan iterations: 0 | samples per second: 19.561 | TFLOPs: 149.77
660+
```
661+
662+
So if someone wants the nearest checkpoint that is guaranteed to have had seen only one pass of data is 85k.
663+
664+
665+
652666
### Checkpoint reshaping
653667

654668
It's not trivial to switch from one 3D topology to another due to TP and DP logic of Deepspeed. So we developed a special mechanism called universal checkpoint which converts whatever topology the last checkpoint was created with into a universal checkpoint which has each weight and optimizer state as a separate file. This is done after careful merging of weights split across TP ranks (some weights are averaged, some are concatenated on the first and some on the second dimension. And then DP ZeRO sharding gets unsharded. So this universal checkpoint can now be used to start any new topology or to create a HF Transformers checkpoint. Note that all weights are in fp32 - so no data is lost.

0 commit comments

Comments
 (0)