Integrate CUDA Graph 4D in test_megatron_e2e_pipeline.py by GindaChen · Pull Request #121 · hao-ai-lab/DistCA

GindaChen · 2025-12-14T02:18:08Z

Integrating #119 into main fixing TP.

Test

Single Node

TP=4
PP=2

torchrun test_megatron_e2e_pipeline.py \
        --num-nodes 1\
        --num-gpus-per-node 8 \
        --pp-size $PP \
        --tp-size $TP \
        --num-microbatch 2

Multi Node

salloc a slurm job. Get the slurm job id and the head node ip
In another terminal do the following

JOBID=$SLURM_JOB_ID #<slurm-job-id>
HEAD_NODE=<ip-of-the-head-node> 
NNODES=4
NGPU_PER_NODE=8
TP=8
PP=2
NGPUS=$((NNODES * NGPU_PER_NODE))

srun --jobid=$JOBID -N $NNODES -G $NGPUS -w $HEAD_NODE \
  --ntasks=$NNODES --ntasks-per-node=1 \
  bash -lc '
    torchrun \
      --nproc_per_node='"$NGPU_PER_NODE"' \
      --nnodes='"$NNODES"' \
      --node_rank=$SLURM_NODEID \
      --rdzv-backend=c10d \
      --rdzv-endpoint='"$HEAD_NODE"':29500 \
      --rdzv-id=pp01 \
      test_megatron_e2e_pipeline.py \
        --num-nodes '"$NNODES"' \
        --num-gpus-per-node '"$NGPU_PER_NODE"' \
        --pp-size '"$PP"' \
        --tp-size '"$TP"' \
        --num-microbatch 2
  '

Next PR:

Fix numerical error when opening CUDA Graph. This PR can still be merged with this error because if CUDA Graph is not enabled this error won't show up.
Integrate to the test_megatron_e2e_pipeline_cp.py (because this is where we do e2e training)

Known issue

Mismatched elements: 2094282 / 2097152 (99.9%)
Greatest absolute difference: 5.9111328125 at index (1743, 216) (up to 0.0011 allowed)
Greatest relative difference: inf at index (0, 27) (up to 0.0011 allowed)    raise error_metas[0].to_error(msg)

Note

Adds CUDA Graph-backed pre/post-attention execution to transformer layers and ping-pong block, with test harness and utilities updated to enable and validate it.

Runtime/Transformer:
- Add CUDA Graph paths in d2/runtime/megatron/base_transformer_layer.py with init_pre_attn_cuda_graph, init_post_attn_cuda_graph, and graphed pre/post-attention helpers.
- Wire CUDA Graph execution into ping-pong flow via new _forward_pre_attn_cuda_graph and _forward_post_attn_cuda_graph.
Ping-Pong Scheduling:
- Refactor tick_ops with comm/compute split helpers and add CUDA Graph variants: forward_pre_core_attn_cuda_graph, forward_post_core_attn_cuda_graph, and forward_post_then_pre_core_attn_cuda_graph.
- In transformer_block, add init_layer_cuda_graphs() and route tick_nonca_compute to CUDA Graph path when enabled; requires D2_SEQ_LEN.
Pipeline/FB Schedule:
- Extend PP schedule hooks and dummy backward integration; add NVTX ranges and optional dummy backward function in forward_backward_func.
Tests:
- Enable CUDA Graphs in tests/test_megatron_e2e.py and tests/test_megatron_e2e_pipeline.py (set D2_SEQ_LEN, call decoder.init_layer_cuda_graphs(), adjust buffer size and microbatch count).
Utils:
- Make network inspect file writes robust with try/except in d2/utils/network_inspect.py.
- Update dummy doc-len generation in tests/test_util.py to use total_token_on_rank // dp_size for CUDA Graph compatibility.

^{Written by Cursor Bugbot for commit fa9cf1e. This will update automatically on new commits. Configure here.}

…planner

cursor

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

cursor · 2025-12-14T02:22:52Z

d2/runtime/megatron/ping_pong/transformer_block.py

+        self.tick_nonca_compute = tick_nonca_compute_cuda_graph if self.use_cuda_graph else tick_nonca_compute
+
+    def init_layer_cuda_graphs(self):
+        self.use_cuda_graph = True


Bug: CUDA graph function reference not updated after initialization

In init_value(), self.tick_nonca_compute is set based on self.use_cuda_graph which is False at initialization. When init_layer_cuda_graphs() is called later to enable CUDA graphs, it sets use_cuda_graph = True but never updates the tick_nonca_compute attribute. This means the CUDA graph version (tick_nonca_compute_cuda_graph) is never used despite being initialized.

cursor · 2025-12-14T02:22:53Z

d2/runtime/megatron/ping_pong/tick_ops.py

+    if is_last_layer_post_attn:
+        return forward_post_core_attn_cuda_graph(layer, arg_group)
+    if prev_layer is None:
+        return forward_pre_core_attn(layer, arg_group)


Bug: Wrong function called for first layer in CUDA graph mode

In tick_nonca_compute_cuda_graph, when prev_layer is None (first layer case), it calls forward_pre_core_attn (non-CUDA-graph version) instead of forward_pre_core_attn_cuda_graph. This defeats the purpose of using CUDA graphs and creates inconsistency where the first layer uses the non-graphed path while other layers use the graphed path.

cursor · 2025-12-14T02:22:54Z

tests/test_megatron_e2e.py

    set_random_seed(seed, set_megatron=False)

+    # torch.distributed.breakpoint()
+    worker.train_module[0].module.module.decoder.init_layer_cuda_graphs()  # FIXME: hardcode for now, where to put?


Bug: Missing D2_SEQ_LEN environment variable before initialization

The call to init_layer_cuda_graphs() is added without setting the D2_SEQ_LEN environment variable beforehand. The init_layer_cuda_graphs() method requires this environment variable and will raise ValueError("D2_SEQ_LEN is not set"). Unlike test_megatron_e2e_pipeline.py which sets this variable before initialization, this file does not.

cursor · 2025-12-14T02:22:54Z

d2/runtime/megatron/ping_pong/tick_ops.py

+def forward_post_then_pre_core_attn_cuda_graph(layer: TransformerLayer, args: Dict[str, Any]):
+    log_memory_usage(f"(L{layer.layer_number}) forward_post_then_pre_core_attn:(start)")
+    assert args["context"] is None and args["context_mask"] is None, "not supported in cudagraph"
+    forward_post_core_attn_comm(layer, args)


Bug: Previous layer not passed to CUDA graph communication function

In forward_post_then_pre_core_attn_cuda_graph, the function forward_post_core_attn_comm is called with layer (the current layer), but it should use the previous layer for processing the previous layer's attention output. The tick_nonca_compute_cuda_graph function has access to prev_layer but doesn't pass it to forward_post_then_pre_core_attn_cuda_graph. In the non-CUDA-graph version tick_nonca_compute, forward_post_core_attn(prev_layer, arg_group) correctly uses prev_layer. This causes layer._post_attn_to_mlp and config values to be read from the wrong layer.

Additional Locations (1)

d2/runtime/megatron/ping_pong/tick_ops.py#L295-L296

cursor

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Bug: Commented import breaks function that still uses module

The import wlbllm.registry statement is commented out, but wlbllm.registry is still used in wlb_swap_next_forward_metadata() and wlb_swap_next_backward_metadata() functions. When these functions are called (when WLBLLM_MODE=1), they will raise a NameError because the module was never imported.

d2/runtime/megatron/forward_backward_func.py#L92-L101

https://github.com/GindaChen/d2/blob/fa9cf1ef0a1c35fb8639ece84530fcf66feb2c85/d2/runtime/megatron/forward_backward_func.py#L92-L101

GindaChen · 2025-12-17T00:54:10Z

Superseded by #124

wu-qing-157 and others added 5 commits November 17, 2025 12:11

save for merge

a212d82

Merge branch 'pb/ilp_planner' of github.com:GindaChen/d2 into pb/ilp_…

c76df6a

…planner

clean code

1213e31

Fix some hardcoding

d6c9d92

Make TP running but we have numerical error

fa9cf1e

GindaChen requested a review from wu-qing-157 December 14, 2025 02:18

cursor bot reviewed Dec 14, 2025

View reviewed changes

GindaChen closed this Dec 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate CUDA Graph 4D in test_megatron_e2e_pipeline.py#121

Integrate CUDA Graph 4D in test_megatron_e2e_pipeline.py#121
GindaChen wants to merge 5 commits intomainfrom
junda/cuda-graph-fix-tp

GindaChen commented Dec 14, 2025 •

edited

Loading

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Dec 14, 2025

Uh oh!

cursor bot Dec 14, 2025

Uh oh!

cursor bot Dec 14, 2025

Uh oh!

cursor bot Dec 14, 2025

Uh oh!

cursor bot left a comment

Uh oh!

GindaChen commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

GindaChen commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test

Single Node

Multi Node

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This PR is being reviewed by Cursor Bugbot

Uh oh!

cursor bot Dec 14, 2025

Choose a reason for hiding this comment

Bug: CUDA graph function reference not updated after initialization

Uh oh!

cursor bot Dec 14, 2025

Choose a reason for hiding this comment

Bug: Wrong function called for first layer in CUDA graph mode

Uh oh!

cursor bot Dec 14, 2025

Choose a reason for hiding this comment

Bug: Missing D2_SEQ_LEN environment variable before initialization

Uh oh!

cursor bot Dec 14, 2025

Choose a reason for hiding this comment

Bug: Previous layer not passed to CUDA graph communication function

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This PR is being reviewed by Cursor Bugbot

Bug: Commented import breaks function that still uses module

Uh oh!

GindaChen commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GindaChen commented Dec 14, 2025 •

edited

Loading