Integrate CUDA Graph 4D in test_megatron_e2e_pipeline.py#121
Integrate CUDA Graph 4D in test_megatron_e2e_pipeline.py#121
Conversation
There was a problem hiding this comment.
This PR is being reviewed by Cursor Bugbot
Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
| self.tick_nonca_compute = tick_nonca_compute_cuda_graph if self.use_cuda_graph else tick_nonca_compute | ||
|
|
||
| def init_layer_cuda_graphs(self): | ||
| self.use_cuda_graph = True |
There was a problem hiding this comment.
Bug: CUDA graph function reference not updated after initialization
In init_value(), self.tick_nonca_compute is set based on self.use_cuda_graph which is False at initialization. When init_layer_cuda_graphs() is called later to enable CUDA graphs, it sets use_cuda_graph = True but never updates the tick_nonca_compute attribute. This means the CUDA graph version (tick_nonca_compute_cuda_graph) is never used despite being initialized.
| if is_last_layer_post_attn: | ||
| return forward_post_core_attn_cuda_graph(layer, arg_group) | ||
| if prev_layer is None: | ||
| return forward_pre_core_attn(layer, arg_group) |
There was a problem hiding this comment.
Bug: Wrong function called for first layer in CUDA graph mode
In tick_nonca_compute_cuda_graph, when prev_layer is None (first layer case), it calls forward_pre_core_attn (non-CUDA-graph version) instead of forward_pre_core_attn_cuda_graph. This defeats the purpose of using CUDA graphs and creates inconsistency where the first layer uses the non-graphed path while other layers use the graphed path.
| set_random_seed(seed, set_megatron=False) | ||
|
|
||
| # torch.distributed.breakpoint() | ||
| worker.train_module[0].module.module.decoder.init_layer_cuda_graphs() # FIXME: hardcode for now, where to put? |
There was a problem hiding this comment.
Bug: Missing D2_SEQ_LEN environment variable before initialization
The call to init_layer_cuda_graphs() is added without setting the D2_SEQ_LEN environment variable beforehand. The init_layer_cuda_graphs() method requires this environment variable and will raise ValueError("D2_SEQ_LEN is not set"). Unlike test_megatron_e2e_pipeline.py which sets this variable before initialization, this file does not.
| def forward_post_then_pre_core_attn_cuda_graph(layer: TransformerLayer, args: Dict[str, Any]): | ||
| log_memory_usage(f"(L{layer.layer_number}) forward_post_then_pre_core_attn:(start)") | ||
| assert args["context"] is None and args["context_mask"] is None, "not supported in cudagraph" | ||
| forward_post_core_attn_comm(layer, args) |
There was a problem hiding this comment.
Bug: Previous layer not passed to CUDA graph communication function
In forward_post_then_pre_core_attn_cuda_graph, the function forward_post_core_attn_comm is called with layer (the current layer), but it should use the previous layer for processing the previous layer's attention output. The tick_nonca_compute_cuda_graph function has access to prev_layer but doesn't pass it to forward_post_then_pre_core_attn_cuda_graph. In the non-CUDA-graph version tick_nonca_compute, forward_post_core_attn(prev_layer, arg_group) correctly uses prev_layer. This causes layer._post_attn_to_mlp and config values to be read from the wrong layer.
Additional Locations (1)
There was a problem hiding this comment.
This PR is being reviewed by Cursor Bugbot
Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
Bug: Commented import breaks function that still uses module
The import wlbllm.registry statement is commented out, but wlbllm.registry is still used in wlb_swap_next_forward_metadata() and wlb_swap_next_backward_metadata() functions. When these functions are called (when WLBLLM_MODE=1), they will raise a NameError because the module was never imported.
d2/runtime/megatron/forward_backward_func.py#L92-L101
|
Superseded by #124 |
Integrating #119 into main fixing TP.
Test
Single Node
TP=4 PP=2 torchrun test_megatron_e2e_pipeline.py \ --num-nodes 1\ --num-gpus-per-node 8 \ --pp-size $PP \ --tp-size $TP \ --num-microbatch 2Multi Node
salloca slurm job. Get the slurm job id and the head node ipNext PR:
test_megatron_e2e_pipeline_cp.py(because this is where we do e2e training)Known issue
Note
Adds CUDA Graph-backed pre/post-attention execution to transformer layers and ping-pong block, with test harness and utilities updated to enable and validate it.
d2/runtime/megatron/base_transformer_layer.pywithinit_pre_attn_cuda_graph,init_post_attn_cuda_graph, and graphed pre/post-attention helpers._forward_pre_attn_cuda_graphand_forward_post_attn_cuda_graph.tick_opswith comm/compute split helpers and add CUDA Graph variants:forward_pre_core_attn_cuda_graph,forward_post_core_attn_cuda_graph, andforward_post_then_pre_core_attn_cuda_graph.transformer_block, addinit_layer_cuda_graphs()and routetick_nonca_computeto CUDA Graph path when enabled; requiresD2_SEQ_LEN.forward_backward_func.tests/test_megatron_e2e.pyandtests/test_megatron_e2e_pipeline.py(setD2_SEQ_LEN, calldecoder.init_layer_cuda_graphs(), adjust buffer size and microbatch count).d2/utils/network_inspect.py.tests/test_util.pyto usetotal_token_on_rank // dp_sizefor CUDA Graph compatibility.Written by Cursor Bugbot for commit fa9cf1e. This will update automatically on new commits. Configure here.