Skip to content

Conversation

@hanhanW
Copy link
Contributor

@hanhanW hanhanW commented Jan 27, 2026

The K1 dimension (head_dim) in attention was unconditionally left untiled, which leads large stack allocation when the dimension is dynamic.

K1 is typically small (64/128 per AttentionOpDetail docs), so the original heuristic to leave it untiled was reasonable. The revision sets the tile sizes if the dimension is dynamic or it is not within typical range (<= 128).

E2E tests are added, and they have the same inputs and expected outputs like attention.mlir (which is a static version). Some backends, e.g., AMDGPU, does not support dynamic attention, so we create a new file. The test is enabled on CPU and VMVX backends in the revision.

Fixes #23277

The K1 dimension (head_dim) in attention was unconditionally left
untiled, which leads large stack allocation when the dimension is
dynamic.

K1 is typically small (64/128 per AttentionOpDetail docs), so the
original heuristic to leave it untiled was reasonable. The revision sets
the tile sizes if the dimension is dynamic or it is not within typical
range (<= 128).

An e2e test is added.

Signed-off-by: hanhanW <[email protected]>
@hanhanW hanhanW requested a review from Groverkss January 27, 2026 23:55
@hanhanW hanhanW requested a review from bjacob as a code owner January 27, 2026 23:55
Comment on lines -2399 to -2400
// Due to the way attention works, K1 dimensions cannot be tiled. Mark k1
// reduction dimensions not to distribute.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what I was thinking when i wrote this comment, ofcourse you can serially tile it.

@hanhanW hanhanW merged commit 761bd9c into main Jan 28, 2026
59 of 62 checks passed
@hanhanW hanhanW deleted the users/hanhanW/cpu-dynamic-attention branch January 28, 2026 17:56
@amd-eochoalo
Copy link
Contributor

@hanhanW this PR produces a build error while generating tests for android:

FAILED: [code=1] tests/e2e/linalg_ext_ops/check_llvm-cpu_local-task_dynamic_attention.mlir_module.vmfb /home/runner/work/iree/iree/build-android-arm_64/tests/e2e/linalg_ext_ops/check_llvm-cpu_local-task_dynamic_attention.mlir_module.vmfb 
cd /home/runner/work/iree/iree/build-android-arm_64/tests/e2e/linalg_ext_ops && /home/runner/work/iree/iree/.venv/bin/iree-compile --output-format=vm-bytecode --mlir-print-op-on-diagnostic=false --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu=generic /home/runner/work/iree/iree/tests/e2e/linalg_ext_ops/dynamic_attention.mlir -o check_llvm-cpu_local-task_dynamic_attention.mlir_module.vmfb --iree-hal-executable-object-search-path=\"/home/runner/work/iree/iree/build-android-arm_64\" --iree-llvmcpu-target-triple=aarch64-none-linux-android29
/home/runner/work/iree/iree/tests/e2e/linalg_ext_ops/dynamic_attention.mlir:26:13: error: Yield operand #1 is not equivalent to the corresponding iter bbArg
  %result = iree_linalg_ext.attention {
            ^
/home/runner/work/iree/iree/tests/e2e/linalg_ext_ops/dynamic_attention.mlir:26:13: error: Yield operand #1 is not equivalent to the corresponding iter bbArg
  %result = iree_linalg_ext.attention {
            ^
/home/runner/work/iree/iree/tests/e2e/linalg_ext_ops/dynamic_attention.mlir:26:13: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"llvm-cpu", "embedded-elf-arm_64", {cpu = "generic", cpu_features = "+reserve-x18", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128-Fn32", iree.encoding.resolver = #iree_cpu.cpu_encoding_resolver<>, max_stack_allocation_size = 32768 : i64, native_vector_size = 16 : i64, target_triple = "aarch64-unknown-unknown-eabi-elf"}>
  %result = iree_linalg_ext.attention {
            ^

@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 28, 2026

It looks like it triggers a bug, looking. Let's revert it for now.

hanhanW added a commit that referenced this pull request Jan 28, 2026
hanhanW added a commit that referenced this pull request Jan 28, 2026
…23313)

Reverts #23304

It triggers a bug on android build. To repro: `iree-compile
--output-format=vm-bytecode --iree-hal-target-device=local
--iree-hal-local-target-device-backends=llvm-cpu
--iree-llvmcpu-target-cpu=generic
--iree-llvmcpu-target-triple=aarch64-none-linux-android29
tests/e2e/linalg_ext_ops/dynamic_attention.mlir`
@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 29, 2026

The issue happens when masking is disabled. It routes back to the other old issue: #16956

I have a local fix and I'm polishing it.

@hanhanW
Copy link
Contributor Author

hanhanW commented Jan 29, 2026

#23318 fixes the issue. I'll re-land the PR once the other change is landed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compiler crash in LLVMCPUSelectLoweringStrategy with dynamic-shape iree_linalg_ext.attention

5 participants