Skip to content

Conversation

@navsud
Copy link
Contributor

@navsud navsud commented Sep 26, 2025

Summary:
As part of enabling QAT for HTP model, we need to run QAT on the model that we use during export. Currently Rope is explicitly hardcoded to "cpu". This change enables us to create rope params on "cuda" if it is run on GPU machine.

Differential Revision: D82239525

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 26, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14619

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit d369723 with merge base deb42f2 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 26, 2025
@facebook-github-bot
Copy link
Contributor

@navsud has exported this pull request. If you are a Meta employee, you can view the originating diff in D82239525.

# Within the device="meta" context, tensors that are created do not carry data.
# They possess all other metadata a tensor carries such as size, stride, requires_grad.
with torch.device("meta"):
with torch.device("cpu"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @navsud, could you provide some more context on this?

Update: This was failing multiple export tests, as the Llama2Model (at llama/model.py) was instantiating the transformer with "meta" device, which needed the rope params to be explicitly instantiated on "cpu" device. Changed "meta" to "cpu" to fix this issue.

Using torch.device("meta") provides some memory benefits during export because we do not load all the tensors into memory until necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full context is: We use the same rope implementation for QAT of HTP model using pt2e flow. But since "cpu" was hardcoded for Rope, it was a bit messy to move to GPU (especially when we are doing multi-GPU training).
So, I removed "device" argument in rope implementation.
However, it was causing fails in export, because - during export models are built with "meta" device, and then during checkpoint load the weights are "assigned". However "rope" freq params are not present in the checkpoint, hence not loaded, causing the fail in the export.
So, I'm creating the model directly on "cpu" thus rope freq params are also created on the cpu, thus avoiding the need for explicit hardcoding of "cpu" only for rope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucylq

Using torch.device("meta") provides some memory benefits during export because we do not load all the tensors into memory until necessary.

Well, it is really a very short-term benefit in the export flow, because right after the model is built with "meta" here, we are either (a) loading weights from checkpoint - at which point the model takes memory here or (b) initializing the model weights with zeros - which also takes memory here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context @navsud.

Noob q: if the device is set to cpu when creating the model, does this cause issues when moving to GPU later?

I'm concerned we're regressing export performance; do you see memory/latency regressions at export with this change? FWIW, we were able to see significant savings when using device=meta: D54871495

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucylq
Thanks for the context around memory/latency savings with exportD54871495.
So I'm left with creating with device set to cpu and then move to GPU later. I will try out and revise this PR accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucylq
Updated to an implementation that doesn't disturb the way export works (so all of the memory/latency benefits with device=meta is preserved), but still works for my usecase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @navsud, appreciate it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucylq
Had to make another revision as some other tests were failing. Please do another quick round of review. Thanks.

@navsud navsud added the release notes: none Do not include this in the release notes label Sep 29, 2025
@facebook-github-bot
Copy link
Contributor

@navsud has exported this pull request. If you are a Meta employee, you can view the originating Diff in D82239525.

navsud added a commit to navsud/executorch that referenced this pull request Oct 2, 2025
Summary:
Pull Request resolved: pytorch#14619

As part of enabling QAT for HTP model, we need to run QAT on the model that we use during export. Currently Rope is explicitly hardcoded to "cpu". This change enables us to create rope params on "cuda" if it is run on GPU machine.

Differential Revision: D82239525
navsud added a commit to navsud/executorch that referenced this pull request Oct 2, 2025
Summary:

As part of enabling QAT for HTP model, we need to run QAT on the model that we use during export. Currently Rope is explicitly hardcoded to "cpu". This change enables us to create rope params on "cuda" if it is run on GPU machine.

Differential Revision: D82239525
Summary:

As part of enabling QAT for HTP model, we need to run QAT on the model that we use during export. Currently Rope is explicitly hardcoded to "cpu". This change enables us to switch between "cpu" vs. "cuda" based on the usecase.

Reviewed By: billmguo

Differential Revision: D82239525
navsud added a commit to navsud/executorch that referenced this pull request Oct 2, 2025
Summary:

As part of enabling QAT for HTP model, we need to run QAT on the model that we use during export. Currently Rope is explicitly hardcoded to "cpu". This change enables us to switch between "cpu" vs. "cuda" based on the usecase.

Reviewed By: billmguo

Differential Revision: D82239525
navsud added a commit to navsud/executorch that referenced this pull request Oct 2, 2025
Summary:

As part of enabling QAT for HTP model, we need to run QAT on the model that we use during export. Currently Rope is explicitly hardcoded to "cpu". This change enables us to switch between "cpu" vs. "cuda" based on the usecase.

Reviewed By: billmguo

Differential Revision: D82239525
@facebook-github-bot facebook-github-bot merged commit c997fe4 into pytorch:main Oct 3, 2025
130 of 133 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported release notes: none Do not include this in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants