Custom op to update cache for torch.cond #15937

larryliu0820 · 2025-11-21T07:42:12Z

torch.cond doesn't take aliasing or mutations. Adding 2 ops for supporting conditionally updating kv cache:

executorch::alias: takes 2 tensors and return the same 2 tensors.
executorch::update_cross_attn_cache: takes a tensor cache and a tensor value, in place copy value into cache.

With these 2 ops, we can rewrite the model definition from:

if is_cross_attention and past_key_values and is_updated:
    # reuse k,v, cross_attentions
    key_states = past_key_values.layers[self.layer_idx].keys
    value_states = past_key_values.layers[self.layer_idx].values
else:
    key_states = self.k_proj(current_states).view(bsz, -1, self.num_heads, self.head_dim)
    value_states = self.v_proj(current_states).view(bsz, -1, self.num_heads, self.head_dim)
    key_states = key_states.transpose(1, 2).contiguous()
    value_states = value_states.transpose(1, 2).contiguous()
    if past_key_values is not None:
        # save all key/value_states to cache to be re-used for fast auto-regressive generation
        cache_position = cache_position if not is_cross_attention else None
        key_states, value_states = past_key_values.update(
            key_states, value_states, self.layer_idx, {"cache_position": cache_position}
        )

Into:

def use_cached_kv(
    cached_keys: Tensor,
    cached_values: Tensor,
    key_value_states: Tensor,
) -> tuple[Tensor, Tensor]:
    # Just reuse cached K/V
    return torch.ops.executorch.alias(cached_keys, cached_values)

def recompute_kv(
    cached_keys: Tensor,  # unused
    cached_values: Tensor,  # unused
    key_value_states: Tensor,
) -> tuple[Tensor, Tensor]:
    # Compute fresh K/V (export-friendly: use custom op to mutate cache)
    key_states = self.k_proj(key_value_states).view(bsz, -1, self.num_heads, self.head_dim)
    value_states = self.v_proj(key_value_states).view(bsz, -1, self.num_heads, self.head_dim)
    key_states = key_states.transpose(1, 2).contiguous()
    value_states = value_states.transpose(1, 2).contiguous()
    k = torch.ops.executorch.update_cross_attn_cache(key_states, cached_keys)
    v = torch.ops.executorch.update_cross_attn_cache(value_states, cached_values)
    return k, v

if past_key_values is not None and self.layer_idx is not None:
    # Grab cached tensors (these are Tensors, so they are OK for export)
    cached_keys = past_key_values.layers[self.layer_idx].keys
    cached_values = past_key_values.layers[self.layer_idx].values

    # Tensor predicate: True if any element is non-zero
    # Result is a 0-dim bool tensor suitable for torch.cond
    cache_is_initialized = (cached_keys != 0).any()

    # Use torch.cond to select branch in a traceable way.
    # All operands must be (nested) tensors or simple Python values.
    key_states, value_states = torch.cond(
        cache_is_initialized,
        use_cached_kv,
        recompute_kv,
        operands=(cached_keys, cached_values, key_value_states),
    )

Summary

[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.

[PLEASE REMOVE] If this PR closes an issue, please add a Fixes #<issue-id> line.

[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.

Test plan

[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

pytorch-bot · 2025-11-21T07:42:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15937

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 20 New Failures

As of commit 00e89d7 with merge base fee1b2d ():

NEW FAILURES - The following jobs have failed:

pull / test-llama-runner-linux (bf16, custom, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
RuntimeError: Command docker exec -t 718da02962daaae84420d79e58f1455cd052349f5a804751282c210a779ee801 /exec failed with exit code 1
pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh)
RuntimeError: Command docker exec -t e030b6486cc584411920e5da7b7ebfc00e38fc34b8bf9b300f207c1daba2c69b /exec failed with exit code 1
pull / test-models-linux (resnet50, portable, linux.2xlarge) / linux-job (gh)
RuntimeError: Command docker exec -t 87b14ef0ae7706cbfb8c7a79ba0975d2df7477cabe8f684ddd0f90cfb7973ede /exec failed with exit code 1
pull / test-qnn-wheel-packages-linux (3.11) / linux-job (gh)
RuntimeError: Command docker exec -t d7cac78334cdf972e5b0dd83443355eaa689f8fa8faf3f6e6d6171c8d30d89a8 /exec failed with exit code 1
pull / test-static-llama-qnn-linux (stories_260k_bc) / linux-job (gh)
RuntimeError: Command docker exec -t b5f57ad4cd89b78ad6062c94fc2dfed2805c0fcb44131724b3db141af51d95f3 /exec failed with exit code 92
pull / unittest / linux / linux-job (gh)
extension/llm/modules/test/test_attention.py::AttentionTest::test_attention_torch_cond_export
pull / unittest / macos / macos-job (gh)
extension/llm/modules/test/test_attention.py::AttentionTest::test_attention_torch_cond_export
pull / unittest-arm-backend-with-no-fvp (test_pytest_models) / linux-job (gh)
RuntimeError: Command docker exec -t 2d14f56eadf74874e237a61d26a11d9a69685cd976df937271a6592e97e564ad /exec failed with exit code 1
pull / unittest-editable / linux / linux-job (gh)
extension/llm/modules/test/test_attention.py::AttentionTest::test_attention_torch_cond_export
pull / unittest-editable / macos / macos-job (gh)
extension/llm/custom_ops/test_update_cross_attn_cache.py::TestUpdateCrossAttnCache::test_update_cross_attn_cache_in_cond
Test CUDA Builds / check-all-cuda-builds (gh)
Process completed with exit code 1.
Test CUDA Builds / export-model-cuda-artifact (google, gemma-3-4b-it, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t ce5195bf2708b0556c7559101a79e0097cd790091d16889d40ac641654b0c4e1 /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (mistralai, Voxtral-Mini-3B-2507, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 26a7fb3b727c76938ded7ea4d62423e417b66e3d42180d9e249c8155e3bec4de /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (mistralai, Voxtral-Mini-3B-2507, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t 30e7253846b722a3e5d912a9483ccf9b0a1821e80bdd37e300de413d271f84df /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (mistralai, Voxtral-Mini-3B-2507, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t a1c59b55a0c6f67089fd0251f9cf8f6b9be7759e50ebdc6d06447f6a9535d5e3 /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (openai, whisper-large-v3-turbo, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t 878000caa18a65ff51640a24d1b3393de606cdc9be7c1afc258543685ce5c0c1 /exec failed with exit code 1
Test CUDA Builds / export-model-cuda-artifact (openai, whisper-large-v3-turbo, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t b193eb747e9c98d59b39046ab5c0e912a8ed5c11e5a710e487bc80dca9eed0b6 /exec failed with exit code 1
Test CUDA Builds / test-executorch-cuda-build-12.6 / linux-job (gh)
RuntimeError: Command docker exec -t 15acb6fd36acf86e29c9bc75847ad40dfd3919caa9b9d58b09ecded28130fc61 /exec failed with exit code 1
Test CUDA Builds / test-executorch-cuda-build-12.8 / linux-job (gh)
RuntimeError: Command docker exec -t 8548fd977681b954d3aeb3edf6e9d1bc616e700c3a36ef2cfa549971690905e9 /exec failed with exit code 1
Test CUDA Builds / test-models-cuda (linear) / linux-job (gh)
RuntimeError: Command docker exec -t dcea8572ee687709beaadec3e4da17cc1e12d1fadef0d56d27c868b3674f792d /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch.cond doesn't take aliasing or mutations. Adding 2 ops for supporting conditionally updating kv cache: * `executorch::alias`: takes 2 tensors and return the same 2 tensors. * `executorch::cross_attn_cache_update`: takes a tensor `cache` and a tensor `value`, in place copy `value` into `cache`. With these 2 ops, we can rewrite the model definition from: ```py if is_cross_attention and past_key_values and is_updated: # reuse k,v, cross_attentions key_states = past_key_values.layers[self.layer_idx].keys value_states = past_key_values.layers[self.layer_idx].values else: key_states = self.k_proj(current_states).view(bsz, -1, self.num_heads, self.head_dim) value_states = self.v_proj(current_states).view(bsz, -1, self.num_heads, self.head_dim) key_states = key_states.transpose(1, 2).contiguous() value_states = value_states.transpose(1, 2).contiguous() if past_key_values is not None: # save all key/value_states to cache to be re-used for fast auto-regressive generation cache_position = cache_position if not is_cross_attention else None key_states, value_states = past_key_values.update( key_states, value_states, self.layer_idx, {"cache_position": cache_position} ) ``` Into: ```py def use_cached_kv( cached_keys: Tensor, cached_values: Tensor, key_value_states: Tensor, ) -> tuple[Tensor, Tensor]: # Just reuse cached K/V return torch.ops.executorch.alias(cached_keys, cached_values) def recompute_kv( cached_keys: Tensor, # unused cached_values: Tensor, # unused key_value_states: Tensor, ) -> tuple[Tensor, Tensor]: # Compute fresh K/V (export-friendly: no cache mutation in here) key_states = self.k_proj(key_value_states).view(bsz, -1, self.num_heads, self.head_dim) value_states = self.v_proj(key_value_states).view(bsz, -1, self.num_heads, self.head_dim) key_states = key_states.transpose(1, 2).contiguous() value_states = value_states.transpose(1, 2).contiguous() k = torch.ops.executorch.update_cross_attn_cache(key_states, cached_keys) v = torch.ops.executorch.update_cross_attn_cache(value_states, cached_values) return k, v if past_key_values is not None and self.layer_idx is not None: # Grab cached tensors (these are Tensors, so they are OK for export) cached_keys = past_key_values.layers[self.layer_idx].keys cached_values = past_key_values.layers[self.layer_idx].values # Tensor predicate: True if any element is non-zero # Result is a 0-dim bool tensor suitable for torch.cond cache_is_initialized = (cached_keys != 0).any() # Use torch.cond to select branch in a traceable way. # All operands must be (nested) tensors or simple Python values. key_states, value_states = torch.cond( cache_is_initialized, use_cached_kv, recompute_kv, operands=(cached_keys, cached_values, key_value_states), ) ```

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 21, 2025

larryliu0820 added the release notes: desktop for desktop/laptop workstream label Nov 21, 2025

larryliu0820 force-pushed the cache_custom_op branch from 5da9504 to 00e89d7 Compare November 21, 2025 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Custom op to update cache for torch.cond #15937

Custom op to update cache for torch.cond #15937

larryliu0820 commented Nov 21, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Custom op to update cache for torch.cond #15937

Are you sure you want to change the base?

Custom op to update cache for torch.cond #15937

Conversation

larryliu0820 commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

pytorch-bot bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15937

❌ 20 New Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

larryliu0820 commented Nov 21, 2025 •

edited

Loading

pytorch-bot bot commented Nov 21, 2025 •

edited

Loading