You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The latter is really `bigscience-workshop/Megatron-DeepSpeed|ds_ckpt_reshape` but since we also have another bandaid branch that is being used it's merged with`layer-norm-auto-sync`.
Copy file name to clipboardExpand all lines: train/tr11-176B-ml/chronicles.md
+77Lines changed: 77 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -615,3 +615,80 @@ KillOnBadExit=1
615
615
```
616
616
617
617
So really the only new thing I added was `NCCL_ASYNC_ERROR_HANDLING=1`
618
+
619
+
620
+
### 2022-06-28 finished epoch 1 training
621
+
622
+
We finished going through all of the data on June 28 - finished a few days early - Yay!
623
+
624
+
We are going to give it a few more days with epoch 2 - while we still have resources and make the model even better.
625
+
626
+
627
+
628
+
### 2022-07-04 switched from 48 to 24 nodes
629
+
630
+
Our allocation of 52 nodes has expired and so we switched the training to 24 nodes, after going through a conversion to the universal checkpoint, which took about 45min.
0 commit comments