[AMDGPU][LDS] Add in_bounds attribute to CoalescedGatherDMAOp for tensor.pad fusion by lialan · Pull Request #23365 · iree-org/iree

lialan · 2026-02-03T15:25:47Z

Summary

Add in_bounds attribute to CoalescedGatherDMAOp to indicate which dimensions may have out-of-bounds reads
Fuse tensor.pad into coalesced_gather_dma by tracing through to the original source tensor
Enable coalesced DMA for unaligned matmuls that require padding

Motivation

When matmul dimensions don't align with tile sizes, tensor.pad is inserted to pad operands. Previously, this caused coalesced_gather_dma to fail because:

The padded tensor becomes a private memory allocation during bufferization
amdgpu.gather_to_lds requires source to be fat_raw_buffer or global memory

This PR fuses tensor.pad into the DMA operation, allowing it to read directly from buffer memory. AMD hardware's OOB behavior (returns 0 for out-of-bounds reads with boundsCheck=true) provides implicit padding.

Example

Before:

%extracted_slice_0 = tensor.extract_slice %6[%22, %19] [%17, %20] [1, 1] : tensor<65x121xf32> to tensor<?x?xf32>
%padded_3 = tensor.pad %extracted_slice_2 low[0, 0] high[%21, 0] {
      ^bb0(%arg4: index, %arg5: index):
        tensor.yield %cst : f32
} : tensor<?x64xf32> to tensor<4x64xf32>
%26 = tensor.empty() : tensor<4x64xf32>
%27 = linalg.copy {lowering_config = #iree_gpu.use_global_load_dma} ins(%padded_3 : tensor<4x64xf32>) outs(%26 : tensor<4x64xf32>) -> tensor<4x64xf32>

We cannot handle such case because:

When inferring address space of tensor.pad, it will always be assigned to private.
global load DMA cannot handle private sources.

Notice that the semantics of the op is basically to extract a slice of an unaligned tensor to an aligned tensor. We can still do a DMA from the buffer pointer to utilize the address clamping, as long as the source is inferred to use fat raw pointer.

So the solution is to just fuse tensor.pad into coalesced_gather_dma when we are converting linalg.copy to iree_gpu.coalesced_gather_dma:

%extracted_slice_0 = tensor.extract_slice %6[%22, %19] [%17, %20] [1, 1] : tensor<65x121xf32> to tensor<?x?xf32>
%23 = tensor.empty() : tensor<32x4xf32>
%24 = scf.forall (%arg4, %arg5) = (0, 0) to (32, 4) step (32, 4) shared_outs(%arg6 = %23) -> (tensor<32x4xf32>) {
    %extracted_slice_7 = tensor.extract_slice %arg6[%arg4, %arg5] [32, 4] [1, 1] : tensor<32x4xf32> to tensor<32x4xf32>
    %30 = scf.forall (%arg7) in (64) shared_outs(%arg8 = %extracted_slice_7) -> (tensor<32x4xf32>) {
     ...
     %extracted_slice_9 = tensor.extract_slice %arg8[0, %arg7] [32, %33] [1, 1] : tensor<32x4xf32> to tensor<32x?xf32>
      ...
      scf.forall.in_parallel {
        iree_gpu.coalesced_gather_dma %extracted_slice_0 into %arg8 lane(%arg7) in_bounds [false, false] : tensor<?x?xf32>, tensor<32x4xf32>, index
      }
    } {mapping = [#iree_gpu.lane_id<0>]}
    scf.forall.in_parallel {
      tensor.parallel_insert_slice %30 into %arg6[%arg4, %arg5] [32, 4] [1, 1] : tensor<32x4xf32> into tensor<32x4xf32>
    }
} {mapping = [#gpu.warp<linear_dim_1>, #gpu.warp<linear_dim_0>]}

The implicit fusing happens when we convert linalg.copy to iree_gpu.coalesced_gather_dma op (along with tiling).
iree_gpu.coalesced_gather_dma now has attribute in_bounds which tells whether we are loading in bounds, so it now supports masking.
make sure we source tensor is inferred to use buffer pointer.

krzysz00 · 2026-02-03T16:17:52Z

I'd like to flag that we'll need to handle the slow path (address less than 4 bytes from the edge of the buffer)

I'll also note that we should check the semantics of gather to LDS within an if statement because it's possible this might work for global loads as well

lialan · 2026-02-03T17:33:15Z

I'll also note that we should check the semantics of gather to LDS within an if statement because it's possible this might work for global loads as well

@krzysz00 Did you mean we avoid using buffer read but stick with global loads? I am confused.

krzysz00 · 2026-02-03T21:20:43Z

What I mean is that we should confirm that

x = ...; 
if (isInBounds(x)) {
  gather_to_lds global[x], lds[...] , ...
}

works in a way we can work with

If the answer is something like "yes, that causes the lanes that don't take the if to not write a value / write 0", then that'll be useful as a fallback for when we can't guarantee bufferability

krzysz00

Hold on this PR because I have correctness concerns

compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_convert_to_coalesced_dma.mlir

compiler/src/iree/compiler/Codegen/Common/GPU/GPUConvertToCoalescedDMA.cpp

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.cpp

qedawkins

A few comments on top of what Krzysztof added.

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.td

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.cpp

compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_convert_to_coalesced_dma.mlir

...iler/src/iree/compiler/Codegen/Common/GPU/test/amdgpu_lower_coalesced_dma_to_gather_lds.mlir

* Change gfx942 → gfx950 in gpu_convert_to_coalesced_dma tests. * Add in_bounds semantics documentation to CoalescedGatherDMAOp. * Remove hardware-specific references from op verifier comment. * Rewrite misleading "ONE level of extract_slice" fallback comment. * Add inner-dim padding OOB lowering test (64x62xf32 → 64x64xf32). * Fix missing trailing periods on comments.

compiler/src/iree/compiler/Codegen/Common/GPU/AMDGPULowerCoalescedDMAToGatherLDS.cpp

Yu-Zhewen

Thanks! Some Nits.

...iler/src/iree/compiler/Codegen/Common/GPU/test/amdgpu_lower_coalesced_dma_to_gather_lds.mlir

compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_convert_to_coalesced_dma.mlir

compiler/src/iree/compiler/Codegen/Common/GPU/GPUConvertToCoalescedDMA.cpp

compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/BufferizationInterfaces.cpp

tests/e2e/matmul/CMakeLists.txt

* Use gfx950 target and dma_sizes = [32, 128] in tests. * Use explicit tensor::PadOp type instead of auto. * Add trailing periods to comments.

compiler/src/iree/compiler/Codegen/Common/GPU/AMDGPULowerCoalescedDMAToGatherLDS.cpp

…fusion Add support for fusing tensor.pad into coalesced_gather_dma when the copy source is a padded tensor. This enables DMA operations to read directly from global memory (fat_raw_buffer) instead of creating private memory allocations for padded data. Key changes: * Add optional in_bounds attribute to CoalescedGatherDMAOp (per-dim bool array) * Update verifier to allow source/init shape mismatches when in_bounds[dim]=false * Modify GPUConvertToCoalescedDMA to trace through tensor.pad and extract_slice * Compute in_bounds based on padding: true if no padding, false if OOB allowed Constraints: * Low padding must be [0, 0] (no low padding) * Padding value must be constant 0.0 (matches AMD hardware OOB behavior) AMD fat_raw_buffer with boundsCheck=true returns 0 for out-of-bounds reads, providing hardware-level padding semantics without explicit software masking.

* Change gfx942 → gfx950 in gpu_convert_to_coalesced_dma tests. * Add in_bounds semantics documentation to CoalescedGatherDMAOp. * Remove hardware-specific references from op verifier comment. * Rewrite misleading "ONE level of extract_slice" fallback comment. * Add inner-dim padding OOB lowering test (64x62xf32 → 64x64xf32). * Fix missing trailing periods on comments.

* Use gfx950 target and dma_sizes = [32, 128] in tests. * Use explicit tensor::PadOp type instead of auto. * Add trailing periods to comments.

* Test fat_raw_buffer source → DMA applied (normal and small tensor) * Test storage_buffer source → DMA skipped (>2GB binding) * Test dispatch.tensor.load source → DMA skipped (>2GB binding) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Emit an error when in_bounds has OOB dimensions but the source memref lacks fat_raw_buffer address space, since hardware OOB clamping is unavailable without it. * Add lit test for the rejection case.

…ilds. * Move the unlowered-DMA verification walk out of #ifndef NDEBUG so it runs in both debug and release builds. * Use notifyMatchFailure (silent) in the pattern guard instead of emitOpError, letting the post-pattern walk produce the single error. * Remove the extra expected-error that only fired in debug builds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Change gfx942 → gfx950 in gpu_convert_to_coalesced_dma tests. * Add in_bounds semantics documentation to CoalescedGatherDMAOp. * Remove hardware-specific references from op verifier comment. * Rewrite misleading "ONE level of extract_slice" fallback comment. * Add inner-dim padding OOB lowering test (64x62xf32 → 64x64xf32). * Fix missing trailing periods on comments.

* Use gfx950 target and dma_sizes = [32, 128] in tests. * Use explicit tensor::PadOp type instead of auto. * Add trailing periods to comments.

qedawkins

I'm closing my eyes to the chipset leak. This should go in an AMD specific directory if it's going to be AMD specific.

...iler/src/iree/compiler/Codegen/Common/GPU/test/amdgpu_lower_coalesced_dma_to_gather_lds.mlir

qedawkins · 2026-02-12T21:28:03Z

...iler/src/iree/compiler/Codegen/Common/GPU/test/amdgpu_lower_coalesced_dma_to_gather_lds.mlir

+    // Replace outermost index with 64 (source dim 0 size) to force hardware OOB.
+    // CHECK: %[[C64_OOB:.+]] = arith.constant 64 : index
+    // CHECK: %[[FIXED_IDX:.+]] = arith.select %[[OOB]], %[[C64_OOB]], %[[SRC_DELIN0]]#0 : index
+    // CHECK: amdgpu.gather_to_lds %[[SRC]][%[[FIXED_IDX]], %[[SRC_DELIN0]]#1], %[[DST]][%[[DST_DELIN0]]#0, %[[DST_DELIN0]]#1] : vector<4xf32>


This is the partial read off the end case. Does this work how we expect cc @krzysz00?

...iler/src/iree/compiler/Codegen/Common/GPU/test/amdgpu_lower_coalesced_dma_to_gather_lds.mlir

compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_convert_to_coalesced_dma.mlir

lialan force-pushed the users/lialan/coalesced_gather_dma_padding branch 2 times, most recently from 2cdff45 to 5353caf Compare February 3, 2026 19:02

lialan force-pushed the users/lialan/coalesced_gather_dma_padding branch from 5353caf to bed7be1 Compare February 3, 2026 21:32

lialan mentioned this pull request Feb 3, 2026

[AMDGPU] Remaining LDS DMA tasks #23327

Open

5 tasks

lialan marked this pull request as ready for review February 5, 2026 04:12

lialan requested review from Groverkss, Max191, bjacob, krzysz00, kuhar, nirvedhmeshram and qedawkins as code owners February 5, 2026 04:12

lialan mentioned this pull request Feb 5, 2026

[AMDGPU][LDS] Enable global load DMA by default on gfx950+ #23230

Draft

lialan requested a review from Yu-Zhewen February 5, 2026 16:31

krzysz00 requested changes Feb 5, 2026

View reviewed changes

qedawkins reviewed Feb 5, 2026

View reviewed changes

lialan force-pushed the users/lialan/coalesced_gather_dma_padding branch from 6beab38 to 64c5041 Compare February 5, 2026 21:56

krzysz00 reviewed Feb 6, 2026

View reviewed changes

compiler/src/iree/compiler/Codegen/Common/GPU/AMDGPULowerCoalescedDMAToGatherLDS.cpp Outdated Show resolved Hide resolved

Yu-Zhewen reviewed Feb 6, 2026

View reviewed changes

lialan added a commit that referenced this pull request Feb 6, 2026

[GPU] Address PR #23365 review comments.

852c0f4

* Use gfx950 target and dma_sizes = [32, 128] in tests. * Use explicit tensor::PadOp type instead of auto. * Add trailing periods to comments.

lialan requested review from Yu-Zhewen, krzysz00 and qedawkins February 6, 2026 17:41

lialan changed the title ~~[GPU] Add in_bounds attribute to CoalescedGatherDMAOp for tensor.pad fusion~~ [AMDGPU][LDS] Add in_bounds attribute to CoalescedGatherDMAOp for tensor.pad fusion Feb 6, 2026

krzysz00 reviewed Feb 6, 2026

View reviewed changes

compiler/src/iree/compiler/Codegen/Common/GPU/AMDGPULowerCoalescedDMAToGatherLDS.cpp Show resolved Hide resolved

lialan requested a review from krzysz00 February 6, 2026 19:59

lialan and others added 11 commits February 10, 2026 13:30

Address some comments.

d88c3e9

Address comments

569165f

[GPU] Address PR #23365 review comments.

d57b98d

* Use gfx950 target and dma_sizes = [32, 128] in tests. * Use explicit tensor::PadOp type instead of auto. * Add trailing periods to comments.

Fix issues we have seen in BOO driver tests.

390d40f

Address more comments by adding TODOs.

1a0d0b7

[GPU] Reject DMA lowering for OOB padding without fat_raw_buffer source.

f563608

* Emit an error when in_bounds has OOB dimensions but the source memref lacks fat_raw_buffer address space, since hardware OOB clamping is unavailable without it. * Add lit test for the rejection case.

Rebase, and move pass back to where it was.

e231d46

lialan force-pushed the users/lialan/coalesced_gather_dma_padding branch from c8a0e72 to e231d46 Compare February 10, 2026 21:57

lialan added a commit that referenced this pull request Feb 11, 2026

[GPU] Address PR #23365 review comments.

a54d4db

* Use gfx950 target and dma_sizes = [32, 128] in tests. * Use explicit tensor::PadOp type instead of auto. * Add trailing periods to comments.

qedawkins approved these changes Feb 12, 2026

View reviewed changes

lialan added 2 commits February 13, 2026 07:44

Update

888d3ae

Address in-bound issue.

48793b3

Conversation

lialan commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Example

Uh oh!

krzysz00 commented Feb 3, 2026

Uh oh!

lialan commented Feb 3, 2026

Uh oh!

krzysz00 commented Feb 3, 2026

Uh oh!

krzysz00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qedawkins left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Yu-Zhewen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qedawkins left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qedawkins Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lialan commented Feb 3, 2026 •

edited

Loading