[ET-VK] Allocate memory for weight and activation tensors lazily #13500

pytorchbot · 2025-08-19T02:24:34Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #13474 by @SS-JIA
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/292/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/292/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/291/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/292/orig
@diff-train-skip-merge

Pull Request resolved: #13474 * Allocate memory for weight tensors right before the prepacking shader is dispatched, rather than while building the graph * Move allocation of shared objects (i.e. memory for intermediate tensors) to occur after prepacking ## Motivation Prevent screen blackout (Llama 3.2 1B) / device crash (Llama 3.2 3B) when running Llama 3.2 models on Samsung Galaxy S24. This behaviour is related to high peak memory usage when loading the model. ## Full Context During model loading, Vulkan delegate needs to store 3 copies of constant data in memory at various points: * source data obtained from loading the model * staging buffer * GPU texture/buffer The general rationale of this change is to allocate memory for each copy only when necessary to minimize the "overlap" when all 3 exist at once. ### Current Order of operations Legend: * `W` represents total weight nbytes * `w` represents weight nbytes for one tensor * `A` represents total activations nbytes * `M` represents approximation of total memory footprint First, model file is loaded Then, when building compute graph, for each weight tensor: 1. Weight data is loaded from NamedDataMap (`M = W`) 2. GPU texture/buffer for weight is initialized + memory allocated (`M = 2W`) 3. After building the graph, `graph->prepare()` is called which currently allocates memory for the activation tensors as well (`M = 2W + A`) Then, during the prepacking stage for each weight tensor, each weight tensor is copied individually: 1. Staging buffer initialized (`M = 2W + A + w`) 2. Copy CPU weight data to staging + CPU Weight data is freed (`M = 2W + A`) 3. Compute shader dispatch to copy staging to GPU texture/buffer + free staging buffer (`M = 2W + A - w`) The peak usage in mainline will be `M = 2W + A + w` ### Revised order of operations This change revises the order of operations: 1. Weight data is loaded from NamedDataMap (`M = W`) 2. GPU texture/buffer for weight is initialized, but **memory is not allocated** (`M = W`) Then, during the prepacking stage for each weight tensor, each weight tensor is copied individually: 1. Staging buffer initialized (`M = W + w`) 2. **Memory allocated for GPU texture/buffer** (`M = W + 2w`) 3. Copy CPU weight data to staging + CPU Weight data is freed (`M = W + w`) 4. Compute shader dispatch to copy staging to GPU texture/buffer + free staging buffer (`M = W`) **Then, after all prepacking operations complete, only then is Activation memory allocated** (`M = W + A`) Under this scheme, peak memory is reduced to `M = W + A` (or alternatively `M = W + 2w` if `2w > A`) which is (or at least very close to) the theoretical minimum. ghstack-source-id: 303862303 Differential Revision: [D80460033](https://our.internmc.facebook.com/intern/diff/D80460033/)

pytorch-bot · 2025-08-19T02:24:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13500

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 6c18621 with merge base 5ff0208 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-binary-size-linux-gcc / linux-job (gh) (trunk failure)
/pytorch/executorch/kernels/portable/cpu/op_stack.cpp:129:26: error: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘ssize_t’ {aka ‘long int’} [-Werror=sign-compare]

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorchbot requested a review from SS-JIA as a code owner August 19, 2025 02:24

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 19, 2025

An error occurred while trying to automatically change base from gh/SS-JIA/291/orig to main August 19, 2025 03:06

SS-JIA deleted the branch gh/SS-JIA/291/orig October 15, 2025 18:00

SS-JIA closed this Oct 15, 2025

SS-JIA deleted the gh/SS-JIA/292/orig branch October 15, 2025 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ET-VK] Allocate memory for weight and activation tensors lazily #13500

[ET-VK] Allocate memory for weight and activation tensors lazily #13500

Uh oh!

pytorchbot commented Aug 19, 2025

Uh oh!

pytorch-bot bot commented Aug 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[ET-VK] Allocate memory for weight and activation tensors lazily #13500

[ET-VK] Allocate memory for weight and activation tensors lazily #13500

Uh oh!

Conversation

pytorchbot commented Aug 19, 2025

Uh oh!

pytorch-bot bot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13500

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Aug 19, 2025 •

edited

Loading