Enable CUDA graph for ADAM optimizer#3429
Conversation
Add wait stream before copying next batch to CG input Add OptimizerCudaGraphWrapper to CUDA graph optimizer Cleanup
86eb0f6 to
d64787e
Compare
|
/claude review |
| torch.cuda.synchronize() | ||
| torch.distributed.barrier() | ||
| logger.info(f'Optimizer CUDA graph capture done!!!') | ||
| if OptimizerCudaGraphWrapper.cuda_graph is None: |
There was a problem hiding this comment.
Bug: On the capture iteration (curr_iteration == cuda_graph_warmup_steps), the optimizer step runs once during graph capture (line 45), then falls through to the else branch (line 52) which calls replay() — executing the optimizer step a second time. This will silently corrupt training on that iteration.
This if should be elif:
| if OptimizerCudaGraphWrapper.cuda_graph is None: | |
| elif OptimizerCudaGraphWrapper.cuda_graph is None: |
|
/ok to test 5284819 |
| @@ -0,0 +1,70 @@ | |||
| # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | |||
|
|
|||
| """Full iteration CUDA graph for training.""" | |||
There was a problem hiding this comment.
This description doesn't make much sense given the name of the file.
| config_logger_dir: str = "" | ||
| """When non-empty, dumps entry-point configs to config_logger_dir""" | ||
|
|
||
| on_device_clip_grad: bool = False |
There was a problem hiding this comment.
Is there a downside to doing this? Should we just default to on, or even just do it with no knob?
There was a problem hiding this comment.
Added a commit to remove on_device_clip_grad knob and use on_device_clip_grad if the kernel is present.
maanug-nv
left a comment
There was a problem hiding this comment.
LGTM. left 1 nit. approving to unblock, but please add the comment before merging.
| cuda_graph_helper.delete_cuda_graphs() | ||
|
|
||
| if args.optimizer_cuda_graph: | ||
| del optimizer.step |
There was a problem hiding this comment.
seems a bit unintuitive , can we just add a comment explaining that it resets it back from the wrapper to the normal method impl ?
|
/ok to test 131c546 |
|
/ok to test b7066fc |
| @@ -0,0 +1,61 @@ | |||
| # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | |||
|
/ok to test 83d686c |
|
/ok to test 180dd44 |
|
/ok to test cc1e076 |
Use param_group.get('lr') instead of param_group['lr'] to avoid
KeyError on the first step() call from __init__, where param groups
may not yet have an 'lr' key.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ok to test f0dd8ab |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23965168729 |
Co-authored-by: Antoni-Joan Solergibert <asolergibert@nvidia.com> Co-authored-by: Philip Petrakian <ppetrakian@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Add wait stream before copying next batch to CG input
Add OptimizerCudaGraphWrapper to CUDA graph optimizer
Cleanup
What does this PR do ?
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.