You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[GPU] Skip subgroup-level tiling for tensor.pad fusion in coalesced DMA
The subgroup-level tiling was creating an outer loop (1, 4, 64) that
distributed the padded buffer across multiple iterations, causing each
iteration to create 1×64 dest subviews. The lowering pass would then
use dest shape (1×64) for delinearization, causing all iterations to
load from source row 0 instead of rows 0-3.
This fix skips subgroup-level tiling for tensor.pad fusion cases by:
1. Detecting tensor.pad in applySubgroupTiling() before calling
tileAtSubgroupLevel()
2. Adding a new ConvertPadFusionCopyToCoalescedDMA pattern that
converts these operations directly without requiring warp-mapped
forall parent
This allows coalesced_gather_dma to operate on full 4×64 buffers with
a single lane-mapped forall, letting the lowering pass correctly
generate 4 transfers per lane to cover all source rows.
Fixes unaligned matmul tests (65x64x121, 133x97x65).
0 commit comments