Enable CUDA graph for ADAM optimizer by vasunvidia · Pull Request #3429 · NVIDIA/Megatron-LM

vasunvidia · 2026-02-14T05:55:55Z

Add wait stream before copying next batch to CG input

Add OptimizerCudaGraphWrapper to CUDA graph optimizer

Cleanup

What does this PR do ?

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2026-02-14T05:55:59Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Add wait stream before copying next batch to CG input Add OptimizerCudaGraphWrapper to CUDA graph optimizer Cleanup

asolergi-nv · 2026-03-24T15:52:24Z

/claude review

claude · 2026-03-24T15:53:45Z

megatron/core/optimizer/optimizer_cuda_graph.py

+            torch.cuda.synchronize()
+            torch.distributed.barrier()
+            logger.info(f'Optimizer CUDA graph capture done!!!')
+        if OptimizerCudaGraphWrapper.cuda_graph is None:


Bug: On the capture iteration (curr_iteration == cuda_graph_warmup_steps), the optimizer step runs once during graph capture (line 45), then falls through to the else branch (line 52) which calls replay() — executing the optimizer step a second time. This will silently corrupt training on that iteration.

This if should be elif:

Suggested change

if OptimizerCudaGraphWrapper.cuda_graph is None:

elif OptimizerCudaGraphWrapper.cuda_graph is None:

claude · 2026-03-24T15:54:05Z

megatron/core/optimizer_cuda_graph.py

+
+import torch
+
+from megatron.core.tensor_parallel.random import get_all_rng_states


Unused import — get_all_rng_states is never referenced in this file.

asolergi-nv · 2026-03-24T15:59:06Z

/ok to test 5284819

megatron/core/optimizer/optimizer_cuda_graph.py

jaredcasper · 2026-03-25T16:35:22Z

megatron/core/optimizer_cuda_graph.py

@@ -0,0 +1,70 @@
+# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+
+"""Full iteration CUDA graph for training."""


This description doesn't make much sense given the name of the file.

jaredcasper · 2026-03-25T16:58:51Z

megatron/core/optimizer/optimizer_config.py

    config_logger_dir: str = ""
    """When non-empty, dumps entry-point configs to config_logger_dir"""

+    on_device_clip_grad: bool = False


Is there a downside to doing this? Should we just default to on, or even just do it with no knob?

Added a commit to remove on_device_clip_grad knob and use on_device_clip_grad if the kernel is present.

maanug-nv

LGTM. left 1 nit. approving to unblock, but please add the comment before merging.

maanug-nv · 2026-03-30T21:03:23Z

megatron/training/training.py

        cuda_graph_helper.delete_cuda_graphs()

+    if args.optimizer_cuda_graph:
+        del optimizer.step


seems a bit unintuitive , can we just add a comment explaining that it resets it back from the wrapper to the normal method impl ?

Added comment.

ko3n1g · 2026-03-30T22:12:06Z

/ok to test 131c546

Phlip79 · 2026-03-31T19:00:15Z

/ok to test b7066fc

jaredcasper · 2026-03-31T20:42:34Z

megatron/core/optimizer/optimizer_cuda_graph.py

@@ -0,0 +1,61 @@
+# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.


Wrong year.

megatron/core/optimizer/__init__.py

Phlip79 · 2026-04-02T00:50:50Z

/ok to test 83d686c

Phlip79 · 2026-04-02T18:41:53Z

/ok to test 180dd44

Phlip79 · 2026-04-03T04:45:59Z

/ok to test cc1e076

Use param_group.get('lr') instead of param_group['lr'] to avoid KeyError on the first step() call from __init__, where param groups may not yet have an 'lr' key. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Phlip79 · 2026-04-03T21:39:06Z

/ok to test f0dd8ab

svcnvidia-nemo-ci · 2026-04-03T22:52:25Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23965168729

Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com> Co-authored-by: Philip Petrakian <ppetrakian@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

vasunvidia requested review from a team as code owners February 14, 2026 05:55

ko3n1g requested a review from a team February 14, 2026 05:56

Phlip79 added the Final Review PR is in the "final review" stage label Mar 4, 2026

Initial commit to enable CUDA graph for ADAM optimizer

d64787e

Add wait stream before copying next batch to CG input Add OptimizerCudaGraphWrapper to CUDA graph optimizer Cleanup

vasunvidia force-pushed the vrengasamy/optimizer_cuda_graph_main branch from 86eb0f6 to d64787e Compare March 19, 2026 17:19

vasunvidia requested review from a team as code owners March 19, 2026 17:19

svcnvidia-nemo-ci removed the Final Review PR is in the "final review" stage label Mar 19, 2026

gautham-kollu requested review from deepakn94, gautham-kollu and jiemingz and removed request for deepakn94 March 23, 2026 16:47

gautham-kollu mentioned this pull request Mar 23, 2026

Enable optimizer CUDA graph NVIDIA-NeMo/Megatron-Bridge#2411

Open

5 tasks

erhoo82 added complexity: low 26.04 this PR is high priority and should be merged asap Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. labels Mar 23, 2026

Merge branch 'main' into vrengasamy/optimizer_cuda_graph_main

5284819

claude bot reviewed Mar 24, 2026

View reviewed changes

svcnvidia-nemo-ci added this to the Core 0.16 milestone Mar 24, 2026

jaredcasper reviewed Mar 25, 2026

View reviewed changes

Phlip79 removed the Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. label Mar 25, 2026

vasunvidia added 2 commits March 30, 2026 13:57

Remove knob for on_device_clip_grad; address review comments

7d75b62

Cleanup and import fix

131c546

maanug-nv approved these changes Mar 30, 2026

View reviewed changes

Fix Lint error

b7066fc

Phlip79 requested review from a team March 31, 2026 19:02

jaredcasper reviewed Mar 31, 2026

View reviewed changes

jiemingz requested a review from a team April 1, 2026 12:02

Address review comment

83d686c

vasunvidia requested a review from jaredcasper April 1, 2026 21:23

jaredcasper approved these changes Apr 2, 2026

View reviewed changes

svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Apr 2, 2026

ko3n1g added the core_r0.17.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label Apr 2, 2026

Use logger.info instead of print

180dd44

copy-pr-bot bot temporarily deployed to test April 2, 2026 18:42 Inactive

ericharper approved these changes Apr 3, 2026

View reviewed changes

svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels Apr 3, 2026

Merge branch 'main' into vrengasamy/optimizer_cuda_graph_main

cc1e076

copy-pr-bot bot temporarily deployed to test April 3, 2026 04:46 Inactive

Phlip79 enabled auto-merge April 3, 2026 04:51

Fix KeyError when param groups lack initial 'lr' key

f0dd8ab

Use param_group.get('lr') instead of param_group['lr'] to avoid KeyError on the first step() call from __init__, where param groups may not yet have an 'lr' key. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

copy-pr-bot bot temporarily deployed to test April 3, 2026 21:39 Inactive

Phlip79 added this pull request to the merge queue Apr 3, 2026

Merged via the queue into NVIDIA:main with commit 3d87bfc Apr 3, 2026
61 of 63 checks passed

	if OptimizerCudaGraphWrapper.cuda_graph is None:
	elif OptimizerCudaGraphWrapper.cuda_graph is None:


		import torch

		from megatron.core.tensor_parallel.random import get_all_rng_states

		@@ -0,0 +1,70 @@
		# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

		"""Full iteration CUDA graph for training."""

		@@ -0,0 +1,61 @@
		# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

Conversation

vasunvidia commented Feb 14, 2026

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Feb 14, 2026

Uh oh!

asolergi-nv commented Mar 24, 2026

Uh oh!

claude bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

asolergi-nv commented Mar 24, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maanug-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ko3n1g commented Mar 30, 2026

Uh oh!

Phlip79 commented Mar 31, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Phlip79 commented Apr 2, 2026

Uh oh!

Phlip79 commented Apr 2, 2026

Uh oh!

Phlip79 commented Apr 3, 2026

Uh oh!

Phlip79 commented Apr 3, 2026

Uh oh!

svcnvidia-nemo-ci commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

(Step 1): Add PR label `Expert Review`