Skip to content

Processes being abruptly killed in the window of time between the start of shutdown and the shutdown completing #1052

@mikail-g

Description

@mikail-g

I'm running darshan with an application launched by deepspeed, which launches subprocesses to carry out some checkpointing I/O.

When the application completes, darshan only gets partway through core_shutdown before it gets killed, thus losing the desired logs in the process.

As a test we edited darshan_core_shutdown in darshan-core to immediately return instead of running any of it's code, after which we can see the desired logs in /tmp and they are in a valid state to be parsed with darshan-parser.

We are using darshan 3.4.7

I include some screenshots below of some outputs after execution with these two methods:

Unedited darshan-core-shutdown:

`[DEEPSPEED DEBUG] Darshan is loaded: **/tmp/mgossman_python_id1382732_mmap-log-7779381447136745040-0.darshan**
[DEEPSPEED DEBUG] Darshan is loaded: /home/mgossman/restart_perf/software/installs/darshan/lib/libdarshan.so.0.0.0
[DEEPSPEED DEBUG] Darshan is loaded: /home/mgossman/restart_perf/software/installs/darshan/lib/libdarshan.so.0.0.0
[DEEPSPEED DEBUG] Darshan is loaded: /home/mgossman/restart_perf/software/installs/darshan/lib/libdarshan.so.0.0.0
[2025-07-14 19:27:22,111] [INFO] [torch_checkpoint_engine.py:49:save] [Torch] Saving /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4/global_step1/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2025-07-14 19:27:45,907] [INFO] [torch_checkpoint_engine.py:51:save] [Torch] Saved /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4/global_step1/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2025-07-14 19:27:45,911] [INFO] [engine.py:3701:_save_zero_checkpoint] zero checkpoint saved /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4/global_step1/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2025-07-14 19:27:45,911] [INFO] [torch_checkpoint_engine.py:61:commit] [Torch] Checkpoint global_step1 is ready now!
  successfully saved checkpoint at iteration       1 to /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4
[SAVE CHECKPOINT] Save checkpoint took 27.97262 seconds.
Checkpoint Save GB: 22.271, GB/Sec: 0.8, Latency(second): 27.973
(min, max) time across ranks (ms):
    save-checkpoint ................................: (27972.78, 27972.78)
[2025-07-14 19:27:46,227] [INFO] [logging.py:107:log_dist] [Rank 0] step=2, skipped=0, lr=[0.0002918585038060976, 0.0002918585038060976], mom=[(0.9, 0.95), (0.9, 0.95)]
Exiting after iteration 1 for testing restart...
[2025-07-14 19:27:49,341] [INFO] [launch.py:351:main] Process 1382732 exits successfully.

(dspeed_env) mgossman@x3004c0s37b0n0:/restart_perf/llm-restart-perf> darshan-parser /tmp/mgossman_python_id1382732_mmap-log-7779381447136745040-0.darshan  
Error: unable to inflate darshan log data.
Error: failed to read darshan log file job data.
(dspeed_env) mgossman@x3004c0s37b0n0:~/restart_perf/llm-restart-perf>

Edited darhshan_core_shutdown:

[DEEPSPEED DEBUG] searching for darshan in checkpointng procs memory (/proc/self/map_files)
[DEEPSPEED DEBUG] Darshan is loaded: **/tmp/mgossman_python_id1386493_mmap-log-8406181767851323125-0.darshan**
[DEEPSPEED DEBUG] Darshan is loaded: /home/mgossman/restart_perf/software/installs/darshan/lib/libdarshan.so.0.0.0
[DEEPSPEED DEBUG] Darshan is loaded: /home/mgossman/restart_perf/software/installs/darshan/lib/libdarshan.so.0.0.0
[DEEPSPEED DEBUG] Darshan is loaded: /home/mgossman/restart_perf/software/installs/darshan/lib/libdarshan.so.0.0.0
[2025-07-14 19:32:14,616] [INFO] [torch_checkpoint_engine.py:49:save] [Torch] Saving /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4/global_step1/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2025-07-14 19:32:38,551] [INFO] [torch_checkpoint_engine.py:51:save] [Torch] Saved /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4/global_step1/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2025-07-14 19:32:38,555] [INFO] [engine.py:3701:_save_zero_checkpoint] zero checkpoint saved /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4/global_step1/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2025-07-14 19:32:38,555] [INFO] [torch_checkpoint_engine.py:61:commit] [Torch] Checkpoint global_step1 is ready now!
  successfully saved checkpoint at iteration       1 to /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4
[SAVE CHECKPOINT] Save checkpoint took 28.02339 seconds.
Checkpoint Save GB: 22.271, GB/Sec: 0.79, Latency(second): 28.024
(min, max) time across ranks (ms):
    save-checkpoint ................................: (28023.56, 28023.56)
[2025-07-14 19:32:38,864] [INFO] [logging.py:107:log_dist] [Rank 0] step=2, skipped=0, lr=[0.0002918585038060976, 0.0002918585038060976], mom=[(0.9, 0.95), (0.9, 0.95)]
Exiting after iteration 1 for testing restart...
[2025-07-14 19:32:42,771] [INFO] [launch.py:351:main] Process 1386493 exits successfully.
(dspeed_env) mgossman@x3004c0s37b0n0:/restart_perf/llm-restart-perf> **darshan-parser /tmp/mgossman_python_id1386493_mmap-log-8406181767851323125-0.darshan > output.txt**
(dspeed_env) mgossman@x3004c0s37b0n0:~/restart_perf/llm-restart-perf> head -n 10 output.txt 
# darshan log version: 3.41
# compression method: NONE
# exe: /home/mgossman/venvs/dspeed_env/bin/python -u /home/mgossman/restart_perf/Megatron-DeepSpeed//pretrain_gpt.py --local_rank=0 --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 24 --hidden-size 2048 --num-attention-heads 32 --micro-batch-size 1 --global-batch-size 1 --ffn-hidden-size 8192 --seq-length 2048 --max-position-embeddings 2048 --train-iters 10 --save /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4 --load /grand/VeloC/mikailg/DeepSpeed-restart-perf/modelsize1_tp1_pp1_dp4 --data-path /home/mgossman/restart_perf/dataset/my-gpt2_text_document --vocab-file /home/mgossman/restart_perf/dataset/gpt2-vocab.json --merge-file /home/mgossman/restart_perf/dataset/gpt2-merges.txt --data-impl mmap --tokenizer-type GPTSentencePieceTokenizer --tokenizer-model /home/mgossman/restart_perf/dataset/tokenizer.model --split 949,50,1 --distributed-backend nccl --lr 3e-4 --lr-decay-style cosine --min-lr 3e-5 --weight-decay 0.1 --clip-grad 1 --lr-warmup-iters 1 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.95 --log-interval 1 --save-interval 1 --eval-interval 1000 --eval-iters 0 --bf16 --no-query-key-layer-scaling --attention-dropout 0 --hidden-dropout 0 --use-rotary-position-embeddings --untie-embeddings-and-output-weights --swiglu --normalization rmsnorm --disable-bias-linear --num-key-value-heads 8 --deepspeed --exit-interval 20 --deepspeed_config=/grand/VeloC/mikailg/DeepSpeed-restart-perf/1B-outputs/tp1_pp1_dp1-iter2/ds_config.json --zero-stage=1 --checkpoint-activations --deepspeed-activation-checkpointing --no-pipeline-parallel 
# uid: 35495
# jobid: 1386493
# start_time: 1752521515
# start_time_asci: Mon Jul 14 19:31:55 2025
# end_time: 0
# end_time_asci: Thu Jan  1 00:00:00 1970
# nprocs: 1`
```

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions