-
Notifications
You must be signed in to change notification settings - Fork 752
[CUDA]: GPU Device Caching for Encoder Output in CUDA Backend #16060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary: In encoder-decoder models like Whisper, the encoder output tensor is used as input to every decoder iteration, and doing unnecessary CPU->GPU->CPU->GPU cpies. Implemented a "keep on device" caching mechanism in the CUDA backend that: - Caches encoder output in persistent GPU memory after the encoder runs - Uses fast GPU-to-GPU copies decoder iterations instead of slow CPU-to-GPU copies Test Plan: Reviewers: Subscribers: Tasks: Tags:
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16060
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 137e6da with merge base 33ec615 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements GPU device caching for encoder output in the CUDA backend to optimize ASR (Automatic Speech Recognition) model inference. The caching mechanism avoids redundant CPU-to-GPU memory copies of encoder output during decoder iterations by keeping the encoder output on the GPU and using fast GPU-to-GPU copies instead.
Key Changes:
- Added a global GPU tensor cache (
g_gpu_tensors) to store encoder outputs on the GPU across multiple execute() calls - Implemented backend options API (
cache_output,use_cache_input,clear_cache_input) to control caching behavior - Modified the ASR runner to set caching options before encoder execution and reuse cached tensors during the decoder loop
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 16 comments.
| File | Description |
|---|---|
backends/cuda/runtime/cuda_backend.cpp |
Implements GPU tensor caching infrastructure with set_option API, memory management for cached tensors, and GPU-to-GPU copy logic for cached inputs |
extension/asr/runner/runner.cpp |
Adds cache control flow: sets cache_output before encoder execution, sets use_cache_input before decoder loop, and clears settings after completion |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b31576c to
9e1a3cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6b2bcb9 to
2e23679
Compare
2e23679 to
137e6da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // This backend supports storing GPU tensors between execute() calls to enable | ||
| // device-to-device (D2D) copies instead of slower host-to-device (H2D) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why do we still need to copy? Can you just make_tensor using the GPU data pointer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I tried not copying initially and it was segfaulting. Because they're completely two different graphs, the output from the first graph and input from the second graph had different underlying layout assumptions, so had to explicitly copy.
| // TYPICAL USAGE PATTERN (encoder-decoder model): | ||
| // | ||
| // 1. Before encoder: set_option("store_output", "encoder_output") | ||
| // 2. Execute encoder (output is stored on GPU) | ||
| // 3. Before decoder loop: set_option("use_stored_input", "encoder_output") | ||
| // 4. Execute decoder N times (D2D copies for encoder output input) | ||
| // 5. After decoder loop: | ||
| // set_option("reset_stored_input", true) | ||
| // set_option("clear_stored_tensor", "encoder_output") | ||
| // | ||
| // ============================================================================ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to understand the intention, is it trying to use backend option to have let the method encode/decode share the output memory? In an ideal word, if encode/decode methods can share memory planning, does it mean we don't have to use this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it trying to use backend option to have let the method encode/decode share the output memory?
Its trying to avoid cpu->gpu copies. If we had device tensor we wouldnt need this, but its wip and perf here is time sensitive so Mergen is hacking around it until its properly fixed upstream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, that seems fine to me. Maybe worth adding this as part of the comment because I can't tell from the PR
Summary
This PR implements a "keep on device" optimization for the CUDA backend that stores encoder output tensors on the GPU and reuses them via fast device-to-device (D2D) copies
during decoder iterations. This avoids redundant CPU→GPU transfers in encoder-decoder models like Whisper.
Motivation
In encoder-decoder architectures like Whisper, the encoder runs once and produces an output tensor that the decoder consumes on every iteration. Without this optimization,
the flow is:
Encoder: CPU input → [H2D copy] → GPU compute → [D2H copy] → CPU output
Decoder (×N tokens): CPU inputs (including encoder output) → [H2D copy] → GPU compute → [D2H copy] → CPU output
The encoder output (~2.3 MB for Whisper) is copied from CPU→GPU on every decoder iteration, even though it never changes. With N=109 tokens, that's 109 redundant H2D copies.
Design
The optimization introduces a simple GPU tensor storage mechanism:
Encoder: CPU input → [H2D copy] → GPU compute → [store on GPU] → [D2H copy] → CPU output
Decoder (×N tokens): CPU inputs → [D2D copy for encoder output, H2D for others] → GPU compute → [D2H copy] → CPU output
Key Design Decisions
API
Three backend options control the behavior:
Lifecycle
Changes
backends/cuda/runtime/cuda_backend.cpp (+212 lines)
extension/asr/runner/runner.cpp (+42 lines)
Performance
Profiling with nsys confirms the optimization works:
The 109 D2D copies (one per decoder token) each transfer 2.304 MB (encoder output) ~9x faster than equivalent H2D copies would.
Test plan