Skip to content

Commit b43180a

Browse files
committed
various updates
1 parent 95f8a8b commit b43180a

File tree

2 files changed

+83
-4
lines changed

2 files changed

+83
-4
lines changed

train/tr11-176B-ml/README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,11 @@ Auto-regressive objective using regular Megatron-LM GPT2 language model w/ multi
88

99
Model size: 176B
1010

11-
The training started on March 11, 2022 11:42am PST
11+
Brief chronology:
12+
13+
1. The training started on March 11, 2022 11:42am PST
14+
2. Epoch one finished on June 28, 2022, (iteration 85376) and then we continued a bit more as we still had the resources
15+
3. The training switched from 48 to 24 nodes on July 4, 2022 9pm PST
1216

1317
To calculate how many days left to 341B-token goal - take the current consumed tokens and feed it to (e.g. with 192755367936)
1418

@@ -670,9 +674,7 @@ It's not trivial to switch from one 3D topology to another due to TP and DP logi
670674

671675
As this is all new currently this requires that the code runs on the following 2 branches
672676
- `microsoft/DeepSpeed|olruwase/elastic-ckpt-refresh`
673-
- `bigscience-workshop/Megatron-DeepSpeed|ds_ckpt_reshape-with-layer-norm-auto-sync`
674-
675-
The latter is really `bigscience-workshop/Megatron-DeepSpeed|ds_ckpt_reshape` but since we also have another bandaid branch that is being used it's merged with`layer-norm-auto-sync`.
677+
- `bigscience-workshop/Megatron-DeepSpeed||ds_ckpt_reshape`
676678

677679
So say you want to switch from 48 to 24 nodes.
678680

train/tr11-176B-ml/chronicles.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -615,3 +615,80 @@ KillOnBadExit=1
615615
```
616616

617617
So really the only new thing I added was `NCCL_ASYNC_ERROR_HANDLING=1`
618+
619+
620+
### 2022-06-28 finished epoch 1 training
621+
622+
We finished going through all of the data on June 28 - finished a few days early - Yay!
623+
624+
We are going to give it a few more days with epoch 2 - while we still have resources and make the model even better.
625+
626+
627+
628+
### 2022-07-04 switched from 48 to 24 nodes
629+
630+
Our allocation of 52 nodes has expired and so we switched the training to 24 nodes, after going through a conversion to the universal checkpoint, which took about 45min.
631+
632+
Using these 2 branches:
633+
- `microsoft/DeepSpeed|olruwase/elastic-ckpt-refresh`
634+
- `bigscience-workshop/Megatron-DeepSpeed||ds_ckpt_reshape`
635+
636+
The following was done:
637+
638+
1. allocated a new cpu node
639+
640+
```
641+
srun --pty --account=six@cpu --nodes=1 --ntasks=1 --partition=cpu_p1 --cpus-per-task=40 --time 6:00:00 --hint=nomultithread --tasks-per-node=1 bash --rcfile $six_ALL_CCFRWORK/start-tr11-176B-ml
642+
643+
```
644+
645+
2. converted the checkpoint `global_step94767` (last on 48 nodes) to universal checkpoint format
646+
647+
```
648+
cd $six_ALL_CCFRWORK/code/tr11-176B-ml/Megatron-DeepSpeed-checkpoint-reshape
649+
/usr/bin/time -v python tools/convert_checkpoint/ds_to_universal.py \
650+
--input_folder $six_ALL_CCFRSCRATCH/checkpoints/tr11-176B-ml/checkpoints/main/global_step94767 \
651+
--output_folder $six_ALL_CCFRSCRATCH/checkpoints/tr11-176B-ml/checkpoints/main/global_step94767_universal \
652+
--num_extract_workers 10 --num_merge_workers 4
653+
```
654+
655+
Took 47 minutes.
656+
657+
```
658+
User time (seconds): 9864.93
659+
System time (seconds): 6987.00
660+
Percent of CPU this job got: 586%
661+
Elapsed (wall clock) time (h:mm:ss or m:ss): 47:55.47
662+
Average shared text size (kbytes): 0
663+
Average unshared data size (kbytes): 0
664+
Average stack size (kbytes): 0
665+
Average total size (kbytes): 0
666+
Maximum resident set size (kbytes): 65719976
667+
Average resident set size (kbytes): 0
668+
Major (requiring I/O) page faults: 0
669+
Minor (reclaiming a frame) page faults: 17097777
670+
Voluntary context switches: 1082873
671+
Involuntary context switches: 526464
672+
Swaps: 0
673+
File system inputs: 0
674+
File system outputs: 0
675+
Socket messages sent: 0
676+
Socket messages received: 0
677+
Signals delivered: 0
678+
Page size (bytes): 4096
679+
Exit status: 0
680+
```
681+
682+
3. Edited the normal slurm script
683+
684+
a. changed its topology to 24 nodes
685+
686+
b. added `--universal-checkpoint` to the script
687+
688+
c. started the slurm job normally with the edited script
689+
690+
4. using a kill-switch saved a new checkpoint `global_step94768` which will be a normal Megatron-Deepspeed checkpoint
691+
692+
5. removed `--universal-checkpoint` from the slurm script
693+
694+
6. resumed training normally

0 commit comments

Comments
 (0)