-
Couldn't load subscription status.
- Fork 36
Open
Description
I'm running darshan with an application launched by deepspeed, which launches subprocesses to carry out some checkpointing I/O.
When the application completes, darshan only gets partway through core_shutdown before it gets killed, thus losing the desired logs in the process.
As a test we edited darshan_core_shutdown in darshan-core to immediately return instead of running any of it's code, after which we can see the desired logs in /tmp and they are in a valid state to be parsed with darshan-parser.
We are using darshan 3.4.7
I include some screenshots below of some outputs after execution with these two methods:
Unedited darshan-core-shutdown:
`[DEEPSPEED DEBUG] Darshan is loaded: **/tmp/mgossman_python_id1382732_mmap-log-7779381447136745040-0.darshan**
[DEEPSPEED DEBUG] Darshan is loaded: /home/mgossman/restart_perf/software/installs/darshan/lib/libdarshan.so.0.0.0
[DEEPSPEED DEBUG] Darshan is loaded: /home/mgossman/restart_perf/software/installs/darshan/lib/libdarshan.so.0.0.0
[DEEPSPEED DEBUG] Darshan is loaded: /home/mgossman/restart_perf/software/installs/darshan/lib/libdarshan.so.0.0.0
[2025-07-14 19:27:22,111] [INFO] [torch_checkpoint_engine.py:49:save] [Torch] Saving /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4/global_step1/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2025-07-14 19:27:45,907] [INFO] [torch_checkpoint_engine.py:51:save] [Torch] Saved /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4/global_step1/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2025-07-14 19:27:45,911] [INFO] [engine.py:3701:_save_zero_checkpoint] zero checkpoint saved /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4/global_step1/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2025-07-14 19:27:45,911] [INFO] [torch_checkpoint_engine.py:61:commit] [Torch] Checkpoint global_step1 is ready now!
successfully saved checkpoint at iteration 1 to /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4
[SAVE CHECKPOINT] Save checkpoint took 27.97262 seconds.
Checkpoint Save GB: 22.271, GB/Sec: 0.8, Latency(second): 27.973
(min, max) time across ranks (ms):
save-checkpoint ................................: (27972.78, 27972.78)
[2025-07-14 19:27:46,227] [INFO] [logging.py:107:log_dist] [Rank 0] step=2, skipped=0, lr=[0.0002918585038060976, 0.0002918585038060976], mom=[(0.9, 0.95), (0.9, 0.95)]
Exiting after iteration 1 for testing restart...
[2025-07-14 19:27:49,341] [INFO] [launch.py:351:main] Process 1382732 exits successfully.
(dspeed_env) mgossman@x3004c0s37b0n0:/restart_perf/llm-restart-perf> darshan-parser /tmp/mgossman_python_id1382732_mmap-log-7779381447136745040-0.darshan
Error: unable to inflate darshan log data.
Error: failed to read darshan log file job data.
(dspeed_env) mgossman@x3004c0s37b0n0:~/restart_perf/llm-restart-perf>
Edited darhshan_core_shutdown:
[DEEPSPEED DEBUG] searching for darshan in checkpointng procs memory (/proc/self/map_files)
[DEEPSPEED DEBUG] Darshan is loaded: **/tmp/mgossman_python_id1386493_mmap-log-8406181767851323125-0.darshan**
[DEEPSPEED DEBUG] Darshan is loaded: /home/mgossman/restart_perf/software/installs/darshan/lib/libdarshan.so.0.0.0
[DEEPSPEED DEBUG] Darshan is loaded: /home/mgossman/restart_perf/software/installs/darshan/lib/libdarshan.so.0.0.0
[DEEPSPEED DEBUG] Darshan is loaded: /home/mgossman/restart_perf/software/installs/darshan/lib/libdarshan.so.0.0.0
[2025-07-14 19:32:14,616] [INFO] [torch_checkpoint_engine.py:49:save] [Torch] Saving /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4/global_step1/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2025-07-14 19:32:38,551] [INFO] [torch_checkpoint_engine.py:51:save] [Torch] Saved /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4/global_step1/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2025-07-14 19:32:38,555] [INFO] [engine.py:3701:_save_zero_checkpoint] zero checkpoint saved /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4/global_step1/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2025-07-14 19:32:38,555] [INFO] [torch_checkpoint_engine.py:61:commit] [Torch] Checkpoint global_step1 is ready now!
successfully saved checkpoint at iteration 1 to /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4
[SAVE CHECKPOINT] Save checkpoint took 28.02339 seconds.
Checkpoint Save GB: 22.271, GB/Sec: 0.79, Latency(second): 28.024
(min, max) time across ranks (ms):
save-checkpoint ................................: (28023.56, 28023.56)
[2025-07-14 19:32:38,864] [INFO] [logging.py:107:log_dist] [Rank 0] step=2, skipped=0, lr=[0.0002918585038060976, 0.0002918585038060976], mom=[(0.9, 0.95), (0.9, 0.95)]
Exiting after iteration 1 for testing restart...
[2025-07-14 19:32:42,771] [INFO] [launch.py:351:main] Process 1386493 exits successfully.
(dspeed_env) mgossman@x3004c0s37b0n0:/restart_perf/llm-restart-perf> **darshan-parser /tmp/mgossman_python_id1386493_mmap-log-8406181767851323125-0.darshan > output.txt**
(dspeed_env) mgossman@x3004c0s37b0n0:~/restart_perf/llm-restart-perf> head -n 10 output.txt
# darshan log version: 3.41
# compression method: NONE
# exe: /home/mgossman/venvs/dspeed_env/bin/python -u /home/mgossman/restart_perf/Megatron-DeepSpeed//pretrain_gpt.py --local_rank=0 --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 24 --hidden-size 2048 --num-attention-heads 32 --micro-batch-size 1 --global-batch-size 1 --ffn-hidden-size 8192 --seq-length 2048 --max-position-embeddings 2048 --train-iters 10 --save /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4 --load /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4 --data-path /home/mgossman/restart_perf/dataset/my-gpt2_text_document --vocab-file /home/mgossman/restart_perf/dataset/gpt2-vocab.json --merge-file /home/mgossman/restart_perf/dataset/gpt2-merges.txt --data-impl mmap --tokenizer-type GPTSentencePieceTokenizer --tokenizer-model /home/mgossman/restart_perf/dataset/tokenizer.model --split 949,50,1 --distributed-backend nccl --lr 3e-4 --lr-decay-style cosine --min-lr 3e-5 --weight-decay 0.1 --clip-grad 1 --lr-warmup-iters 1 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.95 --log-interval 1 --save-interval 1 --eval-interval 1000 --eval-iters 0 --bf16 --no-query-key-layer-scaling --attention-dropout 0 --hidden-dropout 0 --use-rotary-position-embeddings --untie-embeddings-and-output-weights --swiglu --normalization rmsnorm --disable-bias-linear --num-key-value-heads 8 --deepspeed --exit-interval 20 --deepspeed_config=/grand/VeloC/mikailg/DeepSpeed-restart-perf/1B-outputs/tp1_pp1_dp1-iter2/ds_config.json --zero-stage=1 --checkpoint-activations --deepspeed-activation-checkpointing --no-pipeline-parallel
# uid: 35495
# jobid: 1386493
# start_time: 1752521515
# start_time_asci: Mon Jul 14 19:31:55 2025
# end_time: 0
# end_time_asci: Thu Jan 1 00:00:00 1970
# nprocs: 1`
```
Metadata
Metadata
Assignees
Labels
No labels