-
Notifications
You must be signed in to change notification settings - Fork 831
[CPU] Support dynamic attention by tiling K1 when needed. #23304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The K1 dimension (head_dim) in attention was unconditionally left untiled, which leads large stack allocation when the dimension is dynamic. K1 is typically small (64/128 per AttentionOpDetail docs), so the original heuristic to leave it untiled was reasonable. The revision sets the tile sizes if the dimension is dynamic or it is not within typical range (<= 128). An e2e test is added. Signed-off-by: hanhanW <[email protected]>
| // Due to the way attention works, K1 dimensions cannot be tiled. Mark k1 | ||
| // reduction dimensions not to distribute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what I was thinking when i wrote this comment, ofcourse you can serially tile it.
|
@hanhanW this PR produces a build error while generating tests for android: |
|
It looks like it triggers a bug, looking. Let's revert it for now. |
…23313) Reverts #23304 It triggers a bug on android build. To repro: `iree-compile --output-format=vm-bytecode --iree-hal-target-device=local --iree-hal-local-target-device-backends=llvm-cpu --iree-llvmcpu-target-cpu=generic --iree-llvmcpu-target-triple=aarch64-none-linux-android29 tests/e2e/linalg_ext_ops/dynamic_attention.mlir`
|
The issue happens when masking is disabled. It routes back to the other old issue: #16956 I have a local fix and I'm polishing it. |
|
#23318 fixes the issue. I'll re-land the PR once the other change is landed. |
The K1 dimension (head_dim) in attention was unconditionally left untiled, which leads large stack allocation when the dimension is dynamic.
K1 is typically small (64/128 per AttentionOpDetail docs), so the original heuristic to leave it untiled was reasonable. The revision sets the tile sizes if the dimension is dynamic or it is not within typical range (<= 128).
E2E tests are added, and they have the same inputs and expected outputs like attention.mlir (which is a static version). Some backends, e.g., AMDGPU, does not support dynamic attention, so we create a new file. The test is enabled on CPU and VMVX backends in the revision.
Fixes #23277