[CUDA]: GPU Device Caching for Encoder Output in CUDA Backend #16060

mergennachin · 2025-12-03T01:23:15Z

Summary

This PR implements a "keep on device" optimization for the CUDA backend that stores encoder output tensors on the GPU and reuses them via fast device-to-device (D2D) copies
during decoder iterations. This avoids redundant CPU→GPU transfers in encoder-decoder models like Whisper.

Motivation

In encoder-decoder architectures like Whisper, the encoder runs once and produces an output tensor that the decoder consumes on every iteration. Without this optimization,
the flow is:

Encoder: CPU input → [H2D copy] → GPU compute → [D2H copy] → CPU output
Decoder (×N tokens): CPU inputs (including encoder output) → [H2D copy] → GPU compute → [D2H copy] → CPU output

The encoder output (~2.3 MB for Whisper) is copied from CPU→GPU on every decoder iteration, even though it never changes. With N=109 tokens, that's 109 redundant H2D copies.

Design

The optimization introduces a simple GPU tensor storage mechanism:

Encoder: CPU input → [H2D copy] → GPU compute → [store on GPU] → [D2H copy] → CPU output
Decoder (×N tokens): CPU inputs → [D2D copy for encoder output, H2D for others] → GPU compute → [D2H copy] → CPU output

Key Design Decisions

Name-based storage: Tensors are stored by name (e.g., "encoder_output") rather than by slot index, making the API more intuitive and less fragile.
Size-based matching: When looking for a stored tensor to use as input, the backend matches by tensor size rather than requiring explicit slot mapping.
Explicit opt-in: The optimization is controlled via set_option() calls, making it backwards compatible and non-intrusive to existing code paths.
RAII cleanup: A TensorCleanup guard ensures GPU tensors are freed on all exit paths, preventing memory leaks when errors occur during execution.

API

Three backend options control the behavior:

Option	Type	Description
store_output	string	Store first output tensor under this name
use_stored_input	string	Use stored tensor for inputs matching by size
reset_stored_input	bool	Clear the input setting (tensor remains in memory)

Lifecycle

Before encoder: set_option("store_output", "encoder_output")
Run encoder: output tensor stored on GPU
Before decoder loop: set_option("use_stored_input", "encoder_output")
Run decoder ×N: each iteration uses D2D copy for encoder output
After decoder loop: set_option("reset_stored_input", true)
On destroy(): all stored tensors freed

Changes

backends/cuda/runtime/cuda_backend.cpp (+212 lines)

Added GpuTensorRef struct to hold GPU tensor references with ownership
Added gpu_tensors_ map for named tensor storage with documented lifetime contract
Implemented set_option() with validation for option types
Added RAII TensorCleanup guard to prevent memory leaks on error paths
Added validation for single-output constraint
Cleanup of stored tensors in destroy()

extension/asr/runner/runner.cpp (+42 lines)

Set store_output before encoder execution
Set use_stored_input before decoder loop
Reset after decoder loop completes
Consistent error logging at Warning level

Performance

Profiling with nsys confirms the optimization works:

Operation	Count	Total Size	Avg Time
H2D copies	722	521.7 MB	125 µs
D2D copies	109	251.1 MB	13.6 µs

The 109 D2D copies (one per decoder token) each transfer 2.304 MB (encoder output) ~9x faster than equivalent H2D copies would.

Test plan

Build succeeds
Whisper transcription produces correct output
nsys profile confirms D2D copies are occurring
No memory leaks (RAII cleanup on all paths)

Summary: In encoder-decoder models like Whisper, the encoder output tensor is used as input to every decoder iteration, and doing unnecessary CPU->GPU->CPU->GPU cpies. Implemented a "keep on device" caching mechanism in the CUDA backend that: - Caches encoder output in persistent GPU memory after the encoder runs - Uses fast GPU-to-GPU copies decoder iterations instead of slow CPU-to-GPU copies Test Plan: Reviewers: Subscribers: Tasks: Tags:

pytorch-bot · 2025-12-03T01:23:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16060

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 137e6da with merge base 33ec615 ():

NEW FAILURE - The following job has failed:

pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t c13562eac34c79243dd9dcab5b9988e35542c02501326786bc408b480265b9ee /exec failed with exit code 139

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-12-03T01:23:54Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

This PR implements GPU device caching for encoder output in the CUDA backend to optimize ASR (Automatic Speech Recognition) model inference. The caching mechanism avoids redundant CPU-to-GPU memory copies of encoder output during decoder iterations by keeping the encoder output on the GPU and using fast GPU-to-GPU copies instead.

Key Changes:

Added a global GPU tensor cache (g_gpu_tensors) to store encoder outputs on the GPU across multiple execute() calls
Implemented backend options API (cache_output, use_cache_input, clear_cache_input) to control caching behavior
Modified the ASR runner to set caching options before encoder execution and reuse cached tensors during the decoder loop