Remove explicit device arguments #14619

navsud · 2025-09-26T04:01:33Z

Summary:
As part of enabling QAT for HTP model, we need to run QAT on the model that we use during export. Currently Rope is explicitly hardcoded to "cpu". This change enables us to create rope params on "cuda" if it is run on GPU machine.

Differential Revision: D82239525

pytorch-bot · 2025-09-26T04:01:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14619

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit d369723 with merge base deb42f2 ():

NEW FAILURE - The following job has failed:

pull / test-samsung-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t 9e73261798191bfa0f38ea3c3ae76fc70d7306aef73afab2df044faffe0a2414 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-09-26T04:01:45Z

@navsud has exported this pull request. If you are a Meta employee, you can view the originating diff in D82239525.

lucylq · 2025-09-29T21:05:40Z

examples/models/llama/model.py

        # Within the device="meta" context, tensors that are created do not carry data.
        # They possess all other metadata a tensor carries such as size, stride, requires_grad.
-        with torch.device("meta"):
+        with torch.device("cpu"):


Hi @navsud, could you provide some more context on this?

Update: This was failing multiple export tests, as the Llama2Model (at llama/model.py) was instantiating the transformer with "meta" device, which needed the rope params to be explicitly instantiated on "cpu" device. Changed "meta" to "cpu" to fix this issue.

Using torch.device("meta") provides some memory benefits during export because we do not load all the tensors into memory until necessary.

Full context is: We use the same rope implementation for QAT of HTP model using pt2e flow. But since "cpu" was hardcoded for Rope, it was a bit messy to move to GPU (especially when we are doing multi-GPU training).
So, I removed "device" argument in rope implementation.
However, it was causing fails in export, because - during export models are built with "meta" device, and then during checkpoint load the weights are "assigned". However "rope" freq params are not present in the checkpoint, hence not loaded, causing the fail in the export.
So, I'm creating the model directly on "cpu" thus rope freq params are also created on the cpu, thus avoiding the need for explicit hardcoding of "cpu" only for rope.

@lucylq

Using torch.device("meta") provides some memory benefits during export because we do not load all the tensors into memory until necessary.

Well, it is really a very short-term benefit in the export flow, because right after the model is built with "meta" here, we are either (a) loading weights from checkpoint - at which point the model takes memory here or (b) initializing the model weights with zeros - which also takes memory here.

Thanks for the context @navsud.

Noob q: if the device is set to cpu when creating the model, does this cause issues when moving to GPU later?

I'm concerned we're regressing export performance; do you see memory/latency regressions at export with this change? FWIW, we were able to see significant savings when using device=meta: D54871495

@lucylq
Thanks for the context around memory/latency savings with exportD54871495.
So I'm left with creating with device set to cpu and then move to GPU later. I will try out and revise this PR accordingly.

@lucylq
Updated to an implementation that doesn't disturb the way export works (so all of the memory/latency benefits with device=meta is preserved), but still works for my usecase.

Thanks for the changes @navsud, appreciate it.

@lucylq
Had to make another revision as some other tests were failing. Please do another quick round of review. Thanks.

facebook-github-bot · 2025-10-02T00:16:20Z

@navsud has exported this pull request. If you are a Meta employee, you can view the originating Diff in D82239525.

Summary: Pull Request resolved: pytorch#14619 As part of enabling QAT for HTP model, we need to run QAT on the model that we use during export. Currently Rope is explicitly hardcoded to "cpu". This change enables us to create rope params on "cuda" if it is run on GPU machine. Differential Revision: D82239525

Summary: As part of enabling QAT for HTP model, we need to run QAT on the model that we use during export. Currently Rope is explicitly hardcoded to "cpu". This change enables us to create rope params on "cuda" if it is run on GPU machine. Differential Revision: D82239525

Summary: As part of enabling QAT for HTP model, we need to run QAT on the model that we use during export. Currently Rope is explicitly hardcoded to "cpu". This change enables us to switch between "cpu" vs. "cuda" based on the usecase. Reviewed By: billmguo Differential Revision: D82239525

navsud requested review from jackzhxng and lucylq as code owners September 26, 2025 04:01

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 26, 2025

facebook-github-bot added fb-exported meta-exported labels Sep 26, 2025

lucylq reviewed Sep 29, 2025

View reviewed changes

navsud added the release notes: none Do not include this in the release notes label Sep 29, 2025

navsud force-pushed the export-D82239525 branch from 9c5ccc2 to b04778f Compare October 2, 2025 00:16

navsud force-pushed the export-D82239525 branch from b04778f to 37ee087 Compare October 2, 2025 01:14

billmguo approved these changes Oct 2, 2025

View reviewed changes

navsud force-pushed the export-D82239525 branch from 37ee087 to d369723 Compare October 2, 2025 19:34

lucylq approved these changes Oct 3, 2025

View reviewed changes

facebook-github-bot merged commit c997fe4 into pytorch:main Oct 3, 2025
130 of 133 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove explicit device arguments #14619

Remove explicit device arguments #14619

Uh oh!

navsud commented Sep 26, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 26, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Sep 26, 2025

Uh oh!

lucylq Sep 29, 2025

Uh oh!

navsud Sep 29, 2025

Uh oh!

navsud Sep 30, 2025

Uh oh!

lucylq Oct 1, 2025

Uh oh!

navsud Oct 1, 2025

Uh oh!

navsud Oct 2, 2025

Uh oh!

lucylq Oct 2, 2025

Uh oh!

navsud Oct 2, 2025

Uh oh!

facebook-github-bot commented Oct 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Remove explicit device arguments #14619

Remove explicit device arguments #14619

Uh oh!

Conversation

navsud commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14619

❌ 1 New Failure

Uh oh!

facebook-github-bot commented Sep 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

navsud commented Sep 26, 2025 •

edited

Loading

pytorch-bot bot commented Sep 26, 2025 •

edited

Loading