Skip to content

Integrate CUDA Graph 4D in test_megatron_e2e_pipeline.py#121

Closed
GindaChen wants to merge 5 commits intomainfrom
junda/cuda-graph-fix-tp
Closed

Integrate CUDA Graph 4D in test_megatron_e2e_pipeline.py#121
GindaChen wants to merge 5 commits intomainfrom
junda/cuda-graph-fix-tp

Conversation

@GindaChen
Copy link
Collaborator

@GindaChen GindaChen commented Dec 14, 2025

Integrating #119 into main fixing TP.

Test

Single Node

TP=4
PP=2

torchrun test_megatron_e2e_pipeline.py \
        --num-nodes 1\
        --num-gpus-per-node 8 \
        --pp-size $PP \
        --tp-size $TP \
        --num-microbatch 2

Multi Node

  1. salloc a slurm job. Get the slurm job id and the head node ip
  2. In another terminal do the following
JOBID=$SLURM_JOB_ID #<slurm-job-id>
HEAD_NODE=<ip-of-the-head-node> 
NNODES=4
NGPU_PER_NODE=8
TP=8
PP=2
NGPUS=$((NNODES * NGPU_PER_NODE))

srun --jobid=$JOBID -N $NNODES -G $NGPUS -w $HEAD_NODE \
  --ntasks=$NNODES --ntasks-per-node=1 \
  bash -lc '
    torchrun \
      --nproc_per_node='"$NGPU_PER_NODE"' \
      --nnodes='"$NNODES"' \
      --node_rank=$SLURM_NODEID \
      --rdzv-backend=c10d \
      --rdzv-endpoint='"$HEAD_NODE"':29500 \
      --rdzv-id=pp01 \
      test_megatron_e2e_pipeline.py \
        --num-nodes '"$NNODES"' \
        --num-gpus-per-node '"$NGPU_PER_NODE"' \
        --pp-size '"$PP"' \
        --tp-size '"$TP"' \
        --num-microbatch 2
  '

Next PR:

  • Fix numerical error when opening CUDA Graph. This PR can still be merged with this error because if CUDA Graph is not enabled this error won't show up.
  • Integrate to the test_megatron_e2e_pipeline_cp.py (because this is where we do e2e training)

Known issue

Mismatched elements: 2094282 / 2097152 (99.9%)
Greatest absolute difference: 5.9111328125 at index (1743, 216) (up to 0.0011 allowed)
Greatest relative difference: inf at index (0, 27) (up to 0.0011 allowed)    raise error_metas[0].to_error(msg)

Note

Adds CUDA Graph-backed pre/post-attention execution to transformer layers and ping-pong block, with test harness and utilities updated to enable and validate it.

  • Runtime/Transformer:
    • Add CUDA Graph paths in d2/runtime/megatron/base_transformer_layer.py with init_pre_attn_cuda_graph, init_post_attn_cuda_graph, and graphed pre/post-attention helpers.
    • Wire CUDA Graph execution into ping-pong flow via new _forward_pre_attn_cuda_graph and _forward_post_attn_cuda_graph.
  • Ping-Pong Scheduling:
    • Refactor tick_ops with comm/compute split helpers and add CUDA Graph variants: forward_pre_core_attn_cuda_graph, forward_post_core_attn_cuda_graph, and forward_post_then_pre_core_attn_cuda_graph.
    • In transformer_block, add init_layer_cuda_graphs() and route tick_nonca_compute to CUDA Graph path when enabled; requires D2_SEQ_LEN.
  • Pipeline/FB Schedule:
    • Extend PP schedule hooks and dummy backward integration; add NVTX ranges and optional dummy backward function in forward_backward_func.
  • Tests:
    • Enable CUDA Graphs in tests/test_megatron_e2e.py and tests/test_megatron_e2e_pipeline.py (set D2_SEQ_LEN, call decoder.init_layer_cuda_graphs(), adjust buffer size and microbatch count).
  • Utils:
    • Make network inspect file writes robust with try/except in d2/utils/network_inspect.py.
    • Update dummy doc-len generation in tests/test_util.py to use total_token_on_rank // dp_size for CUDA Graph compatibility.

Written by Cursor Bugbot for commit fa9cf1e. This will update automatically on new commits. Configure here.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

self.tick_nonca_compute = tick_nonca_compute_cuda_graph if self.use_cuda_graph else tick_nonca_compute

def init_layer_cuda_graphs(self):
self.use_cuda_graph = True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: CUDA graph function reference not updated after initialization

In init_value(), self.tick_nonca_compute is set based on self.use_cuda_graph which is False at initialization. When init_layer_cuda_graphs() is called later to enable CUDA graphs, it sets use_cuda_graph = True but never updates the tick_nonca_compute attribute. This means the CUDA graph version (tick_nonca_compute_cuda_graph) is never used despite being initialized.

Fix in Cursor Fix in Web

if is_last_layer_post_attn:
return forward_post_core_attn_cuda_graph(layer, arg_group)
if prev_layer is None:
return forward_pre_core_attn(layer, arg_group)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Wrong function called for first layer in CUDA graph mode

In tick_nonca_compute_cuda_graph, when prev_layer is None (first layer case), it calls forward_pre_core_attn (non-CUDA-graph version) instead of forward_pre_core_attn_cuda_graph. This defeats the purpose of using CUDA graphs and creates inconsistency where the first layer uses the non-graphed path while other layers use the graphed path.

Fix in Cursor Fix in Web

set_random_seed(seed, set_megatron=False)

# torch.distributed.breakpoint()
worker.train_module[0].module.module.decoder.init_layer_cuda_graphs() # FIXME: hardcode for now, where to put?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing D2_SEQ_LEN environment variable before initialization

The call to init_layer_cuda_graphs() is added without setting the D2_SEQ_LEN environment variable beforehand. The init_layer_cuda_graphs() method requires this environment variable and will raise ValueError("D2_SEQ_LEN is not set"). Unlike test_megatron_e2e_pipeline.py which sets this variable before initialization, this file does not.

Fix in Cursor Fix in Web

def forward_post_then_pre_core_attn_cuda_graph(layer: TransformerLayer, args: Dict[str, Any]):
log_memory_usage(f"(L{layer.layer_number}) forward_post_then_pre_core_attn:(start)")
assert args["context"] is None and args["context_mask"] is None, "not supported in cudagraph"
forward_post_core_attn_comm(layer, args)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Previous layer not passed to CUDA graph communication function

In forward_post_then_pre_core_attn_cuda_graph, the function forward_post_core_attn_comm is called with layer (the current layer), but it should use the previous layer for processing the previous layer's attention output. The tick_nonca_compute_cuda_graph function has access to prev_layer but doesn't pass it to forward_post_then_pre_core_attn_cuda_graph. In the non-CUDA-graph version tick_nonca_compute, forward_post_core_attn(prev_layer, arg_group) correctly uses prev_layer. This causes layer._post_attn_to_mlp and config values to be read from the wrong layer.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.


Bug: Commented import breaks function that still uses module

The import wlbllm.registry statement is commented out, but wlbllm.registry is still used in wlb_swap_next_forward_metadata() and wlb_swap_next_backward_metadata() functions. When these functions are called (when WLBLLM_MODE=1), they will raise a NameError because the module was never imported.

d2/runtime/megatron/forward_backward_func.py#L92-L101

https://github.com/GindaChen/d2/blob/fa9cf1ef0a1c35fb8639ece84530fcf66feb2c85/d2/runtime/megatron/forward_backward_func.py#L92-L101

Fix in Cursor Fix in Web


@GindaChen
Copy link
Collaborator Author

Superseded by #124

@GindaChen GindaChen closed this Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants