[AMDGPU][LDS] Add in_bounds attribute to CoalescedGatherDMAOp for tensor.pad fusion#23365
[AMDGPU][LDS] Add in_bounds attribute to CoalescedGatherDMAOp for tensor.pad fusion#23365
Conversation
|
I'd like to flag that we'll need to handle the slow path (address less than 4 bytes from the edge of the buffer) I'll also note that we should check the semantics of gather to LDS within an if statement because it's possible this might work for global loads as well |
@krzysz00 Did you mean we avoid using buffer read but stick with global loads? I am confused. |
2cdff45 to
5353caf
Compare
|
What I mean is that we should confirm that works in a way we can work with If the answer is something like "yes, that causes the lanes that don't take the if to not write a value / write 0", then that'll be useful as a fallback for when we can't guarantee bufferability |
5353caf to
bed7be1
Compare
krzysz00
left a comment
There was a problem hiding this comment.
Hold on this PR because I have correctness concerns
compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_convert_to_coalesced_dma.mlir
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_convert_to_coalesced_dma.mlir
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_convert_to_coalesced_dma.mlir
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUConvertToCoalescedDMA.cpp
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUConvertToCoalescedDMA.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUConvertToCoalescedDMA.cpp
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUConvertToCoalescedDMA.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.cpp
Outdated
Show resolved
Hide resolved
qedawkins
left a comment
There was a problem hiding this comment.
A few comments on top of what Krzysztof added.
compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_convert_to_coalesced_dma.mlir
Outdated
Show resolved
Hide resolved
...iler/src/iree/compiler/Codegen/Common/GPU/test/amdgpu_lower_coalesced_dma_to_gather_lds.mlir
Show resolved
Hide resolved
* Change gfx942 → gfx950 in gpu_convert_to_coalesced_dma tests. * Add in_bounds semantics documentation to CoalescedGatherDMAOp. * Remove hardware-specific references from op verifier comment. * Rewrite misleading "ONE level of extract_slice" fallback comment. * Add inner-dim padding OOB lowering test (64x62xf32 → 64x64xf32). * Fix missing trailing periods on comments.
6beab38 to
64c5041
Compare
compiler/src/iree/compiler/Codegen/Common/GPU/AMDGPULowerCoalescedDMAToGatherLDS.cpp
Outdated
Show resolved
Hide resolved
...iler/src/iree/compiler/Codegen/Common/GPU/test/amdgpu_lower_coalesced_dma_to_gather_lds.mlir
Outdated
Show resolved
Hide resolved
...iler/src/iree/compiler/Codegen/Common/GPU/test/amdgpu_lower_coalesced_dma_to_gather_lds.mlir
Outdated
Show resolved
Hide resolved
...iler/src/iree/compiler/Codegen/Common/GPU/test/amdgpu_lower_coalesced_dma_to_gather_lds.mlir
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_convert_to_coalesced_dma.mlir
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_convert_to_coalesced_dma.mlir
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUConvertToCoalescedDMA.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUConvertToCoalescedDMA.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/BufferizationInterfaces.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Dialect/GPU/Transforms/BufferizationInterfaces.cpp
Outdated
Show resolved
Hide resolved
* Use gfx950 target and dma_sizes = [32, 128] in tests. * Use explicit tensor::PadOp type instead of auto. * Add trailing periods to comments.
compiler/src/iree/compiler/Codegen/Common/GPU/AMDGPULowerCoalescedDMAToGatherLDS.cpp
Show resolved
Hide resolved
…fusion Add support for fusing tensor.pad into coalesced_gather_dma when the copy source is a padded tensor. This enables DMA operations to read directly from global memory (fat_raw_buffer) instead of creating private memory allocations for padded data. Key changes: * Add optional in_bounds attribute to CoalescedGatherDMAOp (per-dim bool array) * Update verifier to allow source/init shape mismatches when in_bounds[dim]=false * Modify GPUConvertToCoalescedDMA to trace through tensor.pad and extract_slice * Compute in_bounds based on padding: true if no padding, false if OOB allowed Constraints: * Low padding must be [0, 0] (no low padding) * Padding value must be constant 0.0 (matches AMD hardware OOB behavior) AMD fat_raw_buffer with boundsCheck=true returns 0 for out-of-bounds reads, providing hardware-level padding semantics without explicit software masking.
* Change gfx942 → gfx950 in gpu_convert_to_coalesced_dma tests. * Add in_bounds semantics documentation to CoalescedGatherDMAOp. * Remove hardware-specific references from op verifier comment. * Rewrite misleading "ONE level of extract_slice" fallback comment. * Add inner-dim padding OOB lowering test (64x62xf32 → 64x64xf32). * Fix missing trailing periods on comments.
* Use gfx950 target and dma_sizes = [32, 128] in tests. * Use explicit tensor::PadOp type instead of auto. * Add trailing periods to comments.
* Test fat_raw_buffer source → DMA applied (normal and small tensor) * Test storage_buffer source → DMA skipped (>2GB binding) * Test dispatch.tensor.load source → DMA skipped (>2GB binding) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Emit an error when in_bounds has OOB dimensions but the source memref lacks fat_raw_buffer address space, since hardware OOB clamping is unavailable without it. * Add lit test for the rejection case.
…ilds. * Move the unlowered-DMA verification walk out of #ifndef NDEBUG so it runs in both debug and release builds. * Use notifyMatchFailure (silent) in the pattern guard instead of emitOpError, letting the post-pattern walk produce the single error. * Remove the extra expected-error that only fired in debug builds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
c8a0e72 to
e231d46
Compare
* Change gfx942 → gfx950 in gpu_convert_to_coalesced_dma tests. * Add in_bounds semantics documentation to CoalescedGatherDMAOp. * Remove hardware-specific references from op verifier comment. * Rewrite misleading "ONE level of extract_slice" fallback comment. * Add inner-dim padding OOB lowering test (64x62xf32 → 64x64xf32). * Fix missing trailing periods on comments.
* Use gfx950 target and dma_sizes = [32, 128] in tests. * Use explicit tensor::PadOp type instead of auto. * Add trailing periods to comments.
qedawkins
left a comment
There was a problem hiding this comment.
I'm closing my eyes to the chipset leak. This should go in an AMD specific directory if it's going to be AMD specific.
...iler/src/iree/compiler/Codegen/Common/GPU/test/amdgpu_lower_coalesced_dma_to_gather_lds.mlir
Outdated
Show resolved
Hide resolved
| // Replace outermost index with 64 (source dim 0 size) to force hardware OOB. | ||
| // CHECK: %[[C64_OOB:.+]] = arith.constant 64 : index | ||
| // CHECK: %[[FIXED_IDX:.+]] = arith.select %[[OOB]], %[[C64_OOB]], %[[SRC_DELIN0]]#0 : index | ||
| // CHECK: amdgpu.gather_to_lds %[[SRC]][%[[FIXED_IDX]], %[[SRC_DELIN0]]#1], %[[DST]][%[[DST_DELIN0]]#0, %[[DST_DELIN0]]#1] : vector<4xf32> |
There was a problem hiding this comment.
This is the partial read off the end case. Does this work how we expect cc @krzysz00?
Summary
in_boundsattribute toCoalescedGatherDMAOpto indicate which dimensions may have out-of-bounds readstensor.padintocoalesced_gather_dmaby tracing through to the original source tensorMotivation
When matmul dimensions don't align with tile sizes,
tensor.padis inserted to pad operands. Previously, this causedcoalesced_gather_dmato fail because:amdgpu.gather_to_ldsrequires source to befat_raw_bufferor global memoryThis PR fuses
tensor.padinto the DMA operation, allowing it to read directly from buffer memory. AMD hardware's OOB behavior (returns 0 for out-of-bounds reads withboundsCheck=true) provides implicit padding.Example
Before:
We cannot handle such case because:
tensor.pad, it will always be assigned toprivate.privatesources.Notice that the semantics of the op is basically to extract a slice of an unaligned tensor to an aligned tensor. We can still do a DMA from the buffer pointer to utilize the address clamping, as long as the source is inferred to use fat raw pointer.
So the solution is to just fuse
tensor.padintocoalesced_gather_dmawhen we are convertinglinalg.copytoiree_gpu.coalesced_gather_dma:linalg.copytoiree_gpu.coalesced_gather_dmaop (along with tiling).iree_gpu.coalesced_gather_dmanow has attributein_boundswhich tells whether we are loading in bounds, so it now supports masking.