Skip to content

[AMDGPU][LDS] Add in_bounds attribute to CoalescedGatherDMAOp for tensor.pad fusion#23365

Open
lialan wants to merge 13 commits intomainfrom
users/lialan/coalesced_gather_dma_padding
Open

[AMDGPU][LDS] Add in_bounds attribute to CoalescedGatherDMAOp for tensor.pad fusion#23365
lialan wants to merge 13 commits intomainfrom
users/lialan/coalesced_gather_dma_padding

Conversation

@lialan
Copy link
Contributor

@lialan lialan commented Feb 3, 2026

Summary

  • Add in_bounds attribute to CoalescedGatherDMAOp to indicate which dimensions may have out-of-bounds reads
  • Fuse tensor.pad into coalesced_gather_dma by tracing through to the original source tensor
  • Enable coalesced DMA for unaligned matmuls that require padding

Motivation

When matmul dimensions don't align with tile sizes, tensor.pad is inserted to pad operands. Previously, this caused coalesced_gather_dma to fail because:

  1. The padded tensor becomes a private memory allocation during bufferization
  2. amdgpu.gather_to_lds requires source to be fat_raw_buffer or global memory

This PR fuses tensor.pad into the DMA operation, allowing it to read directly from buffer memory. AMD hardware's OOB behavior (returns 0 for out-of-bounds reads with boundsCheck=true) provides implicit padding.

Example

Before:

%extracted_slice_0 = tensor.extract_slice %6[%22, %19] [%17, %20] [1, 1] : tensor<65x121xf32> to tensor<?x?xf32>
%padded_3 = tensor.pad %extracted_slice_2 low[0, 0] high[%21, 0] {
      ^bb0(%arg4: index, %arg5: index):
        tensor.yield %cst : f32
} : tensor<?x64xf32> to tensor<4x64xf32>
%26 = tensor.empty() : tensor<4x64xf32>
%27 = linalg.copy {lowering_config = #iree_gpu.use_global_load_dma} ins(%padded_3 : tensor<4x64xf32>) outs(%26 : tensor<4x64xf32>) -> tensor<4x64xf32> 

We cannot handle such case because:

  • When inferring address space of tensor.pad, it will always be assigned to private.
  • global load DMA cannot handle private sources.

Notice that the semantics of the op is basically to extract a slice of an unaligned tensor to an aligned tensor. We can still do a DMA from the buffer pointer to utilize the address clamping, as long as the source is inferred to use fat raw pointer.

So the solution is to just fuse tensor.pad into coalesced_gather_dma when we are converting linalg.copy to iree_gpu.coalesced_gather_dma:

%extracted_slice_0 = tensor.extract_slice %6[%22, %19] [%17, %20] [1, 1] : tensor<65x121xf32> to tensor<?x?xf32>
%23 = tensor.empty() : tensor<32x4xf32>
%24 = scf.forall (%arg4, %arg5) = (0, 0) to (32, 4) step (32, 4) shared_outs(%arg6 = %23) -> (tensor<32x4xf32>) {
    %extracted_slice_7 = tensor.extract_slice %arg6[%arg4, %arg5] [32, 4] [1, 1] : tensor<32x4xf32> to tensor<32x4xf32>
    %30 = scf.forall (%arg7) in (64) shared_outs(%arg8 = %extracted_slice_7) -> (tensor<32x4xf32>) {
     ...
     %extracted_slice_9 = tensor.extract_slice %arg8[0, %arg7] [32, %33] [1, 1] : tensor<32x4xf32> to tensor<32x?xf32>
      ...
      scf.forall.in_parallel {
        iree_gpu.coalesced_gather_dma %extracted_slice_0 into %arg8 lane(%arg7) in_bounds [false, false] : tensor<?x?xf32>, tensor<32x4xf32>, index
      }
    } {mapping = [#iree_gpu.lane_id<0>]}
    scf.forall.in_parallel {
      tensor.parallel_insert_slice %30 into %arg6[%arg4, %arg5] [32, 4] [1, 1] : tensor<32x4xf32> into tensor<32x4xf32>
    }
} {mapping = [#gpu.warp<linear_dim_1>, #gpu.warp<linear_dim_0>]}
  • The implicit fusing happens when we convert linalg.copy to iree_gpu.coalesced_gather_dma op (along with tiling).
  • iree_gpu.coalesced_gather_dma now has attribute in_bounds which tells whether we are loading in bounds, so it now supports masking.
  • make sure we source tensor is inferred to use buffer pointer.

@krzysz00
Copy link
Contributor

krzysz00 commented Feb 3, 2026

I'd like to flag that we'll need to handle the slow path (address less than 4 bytes from the edge of the buffer)

I'll also note that we should check the semantics of gather to LDS within an if statement because it's possible this might work for global loads as well

@lialan
Copy link
Contributor Author

lialan commented Feb 3, 2026

I'll also note that we should check the semantics of gather to LDS within an if statement because it's possible this might work for global loads as well

@krzysz00 Did you mean we avoid using buffer read but stick with global loads? I am confused.

@lialan lialan force-pushed the users/lialan/coalesced_gather_dma_padding branch 2 times, most recently from 2cdff45 to 5353caf Compare February 3, 2026 19:02
@krzysz00
Copy link
Contributor

krzysz00 commented Feb 3, 2026

What I mean is that we should confirm that

x = ...; 
if (isInBounds(x)) {
  gather_to_lds global[x], lds[...] , ...
}

works in a way we can work with

If the answer is something like "yes, that causes the lanes that don't take the if to not write a value / write 0", then that'll be useful as a fallback for when we can't guarantee bufferability

@lialan lialan force-pushed the users/lialan/coalesced_gather_dma_padding branch from 5353caf to bed7be1 Compare February 3, 2026 21:32
@lialan lialan mentioned this pull request Feb 3, 2026
5 tasks
@lialan lialan marked this pull request as ready for review February 5, 2026 04:12
@lialan lialan requested a review from Yu-Zhewen February 5, 2026 16:31
Copy link
Contributor

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold on this PR because I have correctness concerns

Copy link
Contributor

@qedawkins qedawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on top of what Krzysztof added.

lialan added a commit that referenced this pull request Feb 5, 2026
* Change gfx942 → gfx950 in gpu_convert_to_coalesced_dma tests.
* Add in_bounds semantics documentation to CoalescedGatherDMAOp.
* Remove hardware-specific references from op verifier comment.
* Rewrite misleading "ONE level of extract_slice" fallback comment.
* Add inner-dim padding OOB lowering test (64x62xf32 → 64x64xf32).
* Fix missing trailing periods on comments.
@lialan lialan force-pushed the users/lialan/coalesced_gather_dma_padding branch from 6beab38 to 64c5041 Compare February 5, 2026 21:56
Copy link
Contributor

@Yu-Zhewen Yu-Zhewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Some Nits.

lialan added a commit that referenced this pull request Feb 6, 2026
* Use gfx950 target and dma_sizes = [32, 128] in tests.
* Use explicit tensor::PadOp type instead of auto.
* Add trailing periods to comments.
@lialan lialan changed the title [GPU] Add in_bounds attribute to CoalescedGatherDMAOp for tensor.pad fusion [AMDGPU][LDS] Add in_bounds attribute to CoalescedGatherDMAOp for tensor.pad fusion Feb 6, 2026
@lialan lialan requested a review from krzysz00 February 6, 2026 19:59
lialan and others added 11 commits February 10, 2026 13:30
…fusion

Add support for fusing tensor.pad into coalesced_gather_dma when the copy
source is a padded tensor. This enables DMA operations to read directly
from global memory (fat_raw_buffer) instead of creating private memory
allocations for padded data.

Key changes:
* Add optional in_bounds attribute to CoalescedGatherDMAOp (per-dim bool array)
* Update verifier to allow source/init shape mismatches when in_bounds[dim]=false
* Modify GPUConvertToCoalescedDMA to trace through tensor.pad and extract_slice
* Compute in_bounds based on padding: true if no padding, false if OOB allowed

Constraints:
* Low padding must be [0, 0] (no low padding)
* Padding value must be constant 0.0 (matches AMD hardware OOB behavior)

AMD fat_raw_buffer with boundsCheck=true returns 0 for out-of-bounds reads,
providing hardware-level padding semantics without explicit software masking.
* Change gfx942 → gfx950 in gpu_convert_to_coalesced_dma tests.
* Add in_bounds semantics documentation to CoalescedGatherDMAOp.
* Remove hardware-specific references from op verifier comment.
* Rewrite misleading "ONE level of extract_slice" fallback comment.
* Add inner-dim padding OOB lowering test (64x62xf32 → 64x64xf32).
* Fix missing trailing periods on comments.
* Use gfx950 target and dma_sizes = [32, 128] in tests.
* Use explicit tensor::PadOp type instead of auto.
* Add trailing periods to comments.
* Test fat_raw_buffer source → DMA applied (normal and small tensor)
* Test storage_buffer source → DMA skipped (>2GB binding)
* Test dispatch.tensor.load source → DMA skipped (>2GB binding)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Emit an error when in_bounds has OOB dimensions but the source memref
  lacks fat_raw_buffer address space, since hardware OOB clamping is
  unavailable without it.
* Add lit test for the rejection case.
…ilds.

* Move the unlowered-DMA verification walk out of #ifndef NDEBUG so it
  runs in both debug and release builds.
* Use notifyMatchFailure (silent) in the pattern guard instead of
  emitOpError, letting the post-pattern walk produce the single error.
* Remove the extra expected-error that only fired in debug builds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lialan lialan force-pushed the users/lialan/coalesced_gather_dma_padding branch from c8a0e72 to e231d46 Compare February 10, 2026 21:57
lialan added a commit that referenced this pull request Feb 11, 2026
* Change gfx942 → gfx950 in gpu_convert_to_coalesced_dma tests.
* Add in_bounds semantics documentation to CoalescedGatherDMAOp.
* Remove hardware-specific references from op verifier comment.
* Rewrite misleading "ONE level of extract_slice" fallback comment.
* Add inner-dim padding OOB lowering test (64x62xf32 → 64x64xf32).
* Fix missing trailing periods on comments.
lialan added a commit that referenced this pull request Feb 11, 2026
* Use gfx950 target and dma_sizes = [32, 128] in tests.
* Use explicit tensor::PadOp type instead of auto.
* Add trailing periods to comments.
Copy link
Contributor

@qedawkins qedawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm closing my eyes to the chipset leak. This should go in an AMD specific directory if it's going to be AMD specific.

// Replace outermost index with 64 (source dim 0 size) to force hardware OOB.
// CHECK: %[[C64_OOB:.+]] = arith.constant 64 : index
// CHECK: %[[FIXED_IDX:.+]] = arith.select %[[OOB]], %[[C64_OOB]], %[[SRC_DELIN0]]#0 : index
// CHECK: amdgpu.gather_to_lds %[[SRC]][%[[FIXED_IDX]], %[[SRC_DELIN0]]#1], %[[DST]][%[[DST_DELIN0]]#0, %[[DST_DELIN0]]#1] : vector<4xf32>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the partial read off the end case. Does this work how we expect cc @krzysz00?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants