Skip to content

Fix losses and aggregators when using CUDA graphs#280

Merged
ktangsali merged 10 commits intoNVIDIA:mainfrom
jasooney23:main
Feb 19, 2026
Merged

Fix losses and aggregators when using CUDA graphs#280
ktangsali merged 10 commits intoNVIDIA:mainfrom
jasooney23:main

Conversation

@jasooney23
Copy link
Contributor

PhysicsNeMo Pull Request

Description

closes #279

Passes step as an int scalar tensor to gradient computation functions so that CUDA graphs can track it properly.
Refactors loss functions and aggregators to replace if statements with torch.wheres for CUDA graph compatibility.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • The CHANGELOG.md is up to date with these changes.
  • An issue is linked to this pull request.

Dependencies

jasooney23 and others added 5 commits February 6, 2026 22:57
Signed-off-by: Jason Ye <jasonyecanada@gmail.com>
Signed-off-by: Jason Ye <jasonyecanada@gmail.com>
Signed-off-by: Jason Ye <jasonyecanada@gmail.com>
@ktangsali ktangsali self-requested a review February 13, 2026 19:22
Copy link
Collaborator

@ktangsali ktangsali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jasooney23 , thank you for your PR. I have started to take a look at this and we will keep you posted soon

Copy link
Collaborator

@loliverhennigh loliverhennigh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went throught the larger code blocks fairly carefully and seems correct to me. I would say if the unit tests run and we were able to do some sanity check on the examples we are good to go

@ktangsali
Copy link
Collaborator

/blossom-ci

@ktangsali
Copy link
Collaborator

/blossom-ci

Copy link
Collaborator

@ktangsali ktangsali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ran a few end-to-end tests too to verify, and everything works. CI also passes, so I don't have any issues merging this fix.

Thank you for addressing the issue

@ktangsali ktangsali merged commit 160d3ad into NVIDIA:main Feb 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🐛[BUG]: Loss Functions and Aggregators break with CUDA graphs

3 participants