Skip to content

Commit 8ae1b54

Browse files
[GPU] Use padding in IGEMM pipeline to support unaligned to intrinsic shapes (#19484)
This PR does two things 1. Allow all GEMM shapes to use padded TileAndFuse Matmul configuration. This is still behind the `iree-codegen-llvmgpu-test-tile-and-fuse-matmul=false` flag by default and does not change the default behavior. However following PRs that have landed in the past month make it possible to relax the guards we originally had on this. #19196 #19307 llvm/llvm-project#117340 2. Allow fused producers to use use padded TileAndFuse Matmul configuration. Following PRs make this possible now #19399 llvm/llvm-project#119039 Together this allows us to do padded IGEMM with intrinsics for shapes unaligned to intrinsic which we use by default. [Here](https://docs.google.com/spreadsheets/d/1O-SdUZCn5pHsxx7JTGjIIdH6PWCFnvlfe4XBbjEBaIM/edit?gid=0#gid=0) is the performance difference observed in conv cases in iree-kernel-benchmark-module that utilize this change. A median speedup of 2.26x was observed. The numeric changes I observed with enabling this path were the same between any aligned shape when comparing intrinsic vs no intrinsic use. Generally some differences are noticed for narrow types like f16 but they are within a relative error of 0.001 but since our tests use absolute errors we may have to change some test values to account for this change. The perf difference in CI seem to be within noise margin compared to main, https://github.com/iree-org/iree/actions/runs/12323399269/attempts/1#summary-34399247902 --------- Signed-off-by: Nirvedh <[email protected]>
1 parent 78ea0ad commit 8ae1b54

File tree

4 files changed

+112
-18
lines changed

4 files changed

+112
-18
lines changed

compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -182,8 +182,7 @@ static FailureOr<std::pair<LoweringConfigAttr, int64_t>>
182182
getMatmulLoweringConfigAndWorkgroupSize(SmallVector<int64_t> bounds,
183183
ArrayRef<AffineMap> maps,
184184
ArrayRef<Value> operands,
185-
IREE::GPU::TargetAttr target,
186-
bool hasFusedLeadingOp) {
185+
IREE::GPU::TargetAttr target) {
187186
if (target.getWgp().getMma().empty())
188187
return failure();
189188

@@ -253,13 +252,11 @@ getMatmulLoweringConfigAndWorkgroupSize(SmallVector<int64_t> bounds,
253252
std::optional<GPUMMASchedule> schedule = getMmaScheduleFromProblemAndTarget(
254253
target, problem, transposedLhs, transposedRhs);
255254

256-
// TODO (nirvedhmeshram, jerryyin): Support all GEMM types.
257-
// TODO (nirvedhmeshram): Support fused leading op.
258255
// TODO (nirvedhmeshram, qedawkins): The performance with this will be bad if
259256
// the GEMM is accumulating (i.e doesnt have a zero fill dpsInit) as that
260257
// buffer currently gets materialized as private memory. We need to add
261258
// missing patterns to fix that.
262-
if (!schedule && !contractionDims.batch.empty() && !hasFusedLeadingOp) {
259+
if (!schedule) {
263260
LDBG("Attempting to deduce unaligned TileAndFuse MMA schedulee");
264261
mustBeAligned = false;
265262
doCPromotion = true;
@@ -342,9 +339,6 @@ getMatmulLoweringConfigAndWorkgroupSize(SmallVector<int64_t> bounds,
342339
} else {
343340
// TODO (nirvedhmeshram, Max191, jerryyin) : Add support so that unaligned
344341
// shapes do not require c promotion.
345-
// TODO (nirvedhmeshram, jerryyin) : When using c promotion the heuristics
346-
// used during finding a schedule need to be updated to account for the
347-
// extra shared memory for the result.
348342
GPU::setPromotedOperandList(context, attrs, {0, 1, 2});
349343
SmallVector<int64_t> paddingTileSizes = workgroupTileSizes;
350344
int64_t innerKDim = contractionDims.k.back();
@@ -391,8 +385,7 @@ setIGEMMConvolutionLoweringConfig(IREE::GPU::TargetAttr target,
391385
SmallVector<int64_t> bounds = igemmLoopBounds.value();
392386
FailureOr<std::pair<LoweringConfigAttr, int64_t>> configAndWgSize =
393387
getMatmulLoweringConfigAndWorkgroupSize(
394-
bounds, igemmContractionMaps.value(), igemmOperands.value(), target,
395-
/*hasFusedLeadingOp=*/true);
388+
bounds, igemmContractionMaps.value(), igemmOperands.value(), target);
396389
if (failed(configAndWgSize)) {
397390
return failure();
398391
}
@@ -435,8 +428,7 @@ LogicalResult setMatmulLoweringConfig(IREE::GPU::TargetAttr target,
435428
LDBG("Matmul TileAndFuse Config");
436429

437430
FailureOr<std::pair<LoweringConfigAttr, int64_t>> configAndWgSize =
438-
getMatmulLoweringConfigAndWorkgroupSize(bounds, maps, operands, target,
439-
hasFusedLeadingOp(linalgOp));
431+
getMatmulLoweringConfigAndWorkgroupSize(bounds, maps, operands, target);
440432
if (failed(configAndWgSize)) {
441433
return failure();
442434
}

compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1033,6 +1033,8 @@ static void addLowerToLLVMGPUPasses(OpPassManager &modulePassManager,
10331033
// Pad allocations with dynamic dimension after linalg lowering but before
10341034
// lowering SCF and affine ops.
10351035
.addPass(createPadDynamicAllocPass)
1036+
// Hoist any newly static allocations from PadDynamicAlloc.
1037+
.addPass(createHoistStaticallyBoundAllocationsPass)
10361038

10371039
.addPass(createLowerAffinePass)
10381040
.addPass(createCanonicalizerPass)

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/config_igemm_tile_and_fuse.mlir

Lines changed: 26 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ func.func @nchw_conv_mfma() {
5959

6060
// -----
6161

62-
func.func @nhwc_conv_no_mfma() {
62+
func.func @nhwc_conv_unaligned_mfma() {
6363
%cst = arith.constant 0.000000e+00 : f32
6464
%c0 = arith.constant 0 : index
6565
%0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<2x33x33x128xf32>>
@@ -74,12 +74,22 @@ func.func @nhwc_conv_no_mfma() {
7474
return
7575
}
7676

77-
// CHECK-LABEL: func.func @nhwc_conv_no_mfma
78-
// CHECK-NOT: use_igemm_convolution = true
77+
// CHECK-LABEL: func.func @nhwc_conv_unaligned_mfma
78+
// CHECK-SAME: #iree_codegen.translation_info<pipeline = LLVMGPUTileAndFuse workgroup_size = [256, 1, 1] subgroup_size = 64
79+
// CHECK-SAME: #iree_gpu.pipeline_options<prefetch_shared_memory = true, no_reduce_shared_memory_bank_conflicts = false
80+
// CHECK-SAME: use_igemm_convolution = true
81+
82+
// CHECK: linalg.conv_2d_nhwc_hwcf {{.*}}lowering_config = #iree_gpu.lowering_config
83+
// CHECK-SAME: mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>
84+
// CHECK-SAME: padding = [2, 1, 32, 64, 32]
85+
// CHECK-SAME: promote_operands = [0, 1, 2]
86+
// CHECK-SAME: reduction = [0, 0, 0, 0, 8]
87+
// CHECK-SAME: subgroup = [2, 1, 2, 1, 0]
88+
// CHECK-SAME: workgroup = [2, 1, 32, 64, 0]
7989

8090
// -----
8191

82-
func.func @nchw_conv_no_mfma() {
92+
func.func @nchw_conv_unaligned_mfma() {
8393
%cst = arith.constant 0.000000e+00 : f32
8494
%c0 = arith.constant 0 : index
8595
%0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<2x128x34x34xf32>>
@@ -94,5 +104,15 @@ func.func @nchw_conv_no_mfma() {
94104
return
95105
}
96106

97-
// CHECK-LABEL: func.func @nchw_conv_no_mfma
98-
// CHECK-NOT: use_igemm_convolution = true
107+
// CHECK-LABEL: func.func @nchw_conv_unaligned_mfma
108+
// CHECK-SAME: #iree_codegen.translation_info<pipeline = LLVMGPUTileAndFuse workgroup_size = [256, 1, 1] subgroup_size = 64
109+
// CHECK-SAME: #iree_gpu.pipeline_options<prefetch_shared_memory = true, no_reduce_shared_memory_bank_conflicts = false
110+
// CHECK-SAME: use_igemm_convolution = true
111+
112+
// CHECK: linalg.conv_2d_nchw_fchw {{.*}}lowering_config = #iree_gpu.lowering_config
113+
// CHECK-SAME: mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x4_F32>
114+
// CHECK-SAME: padding = [1, 64, 2, 32, 32]
115+
// CHECK-SAME: promote_operands = [0, 1, 2]
116+
// CHECK-SAME: reduction = [0, 0, 0, 0, 8]
117+
// CHECK-SAME: subgroup = [1, 2, 2, 1, 0]
118+
// CHECK-SAME: workgroup = [1, 64, 2, 32, 0]

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_igemm_tile_and_fuse.mlir

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,3 +78,83 @@ hal.executable private @main {
7878
// CHECK: } {mapping = [#iree_codegen.workgroup_mapping<z>, #iree_codegen.workgroup_mapping<y>, #iree_codegen.workgroup_mapping<x>]}
7979

8080
// TODO(Max191): Add tests for more convolution types
81+
82+
// -----
83+
84+
#pipeline_layout = #hal.pipeline.layout<bindings = [
85+
#hal.pipeline.binding<storage_buffer, ReadOnly>,
86+
#hal.pipeline.binding<storage_buffer, ReadOnly>,
87+
#hal.pipeline.binding<storage_buffer>
88+
]>
89+
#translation = #iree_codegen.translation_info<pipeline =
90+
LLVMGPUTileAndFuse
91+
workgroup_size = [256, 1, 1]
92+
subgroup_size = 64,
93+
{
94+
gpu_pipeline_options = #iree_gpu.pipeline_options<
95+
prefetch_shared_memory = false,
96+
no_reduce_shared_memory_bank_conflicts = false,
97+
use_igemm_convolution = true>
98+
}>
99+
#config = #iree_gpu.lowering_config<{
100+
padding = [2, 1, 32, 16, 16],
101+
workgroup = [2, 1, 32, 16, 0],
102+
reduction = [0, 0, 0, 0, 1],
103+
subgroup = [1, 1, 1, 1, 0],
104+
mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>,
105+
promote_operands = [0, 1, 2]
106+
}>
107+
hal.executable private @main {
108+
hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb">) {
109+
hal.executable.export public @conv_dispatch_0_conv_2d_nhwc_hwcf_2x17x17x1281x3x3x1281_f16xf16xf32 ordinal(0) layout(#pipeline_layout) {
110+
^bb0(%arg0: !hal.device):
111+
%x, %y, %z = flow.dispatch.workgroup_count_from_slice
112+
hal.return %x, %y, %z : index, index, index
113+
}
114+
builtin.module {
115+
func.func @conv_nhwc_unaligned_stride_2() attributes {translation_info = #translation} {
116+
%cst = arith.constant 0.000000e+00 : f32
117+
%c0 = arith.constant 0 : index
118+
%0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<2x35x35x1281xf16>> %1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<3x3x1281x1281xf16>>
119+
%2 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(2) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<2x17x17x1281xf32>>
120+
%3 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0, 0], sizes = [2, 35, 35, 1281], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<2x35x35x1281xf16>> -> tensor<2x35x35x1281xf16>
121+
%4 = flow.dispatch.tensor.load %1, offsets = [0, 0, 0, 0], sizes = [3, 3, 1281, 1281], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<3x3x1281x1281xf16>> -> tensor<3x3x1281x1281xf16>
122+
%5 = tensor.empty() : tensor<2x17x17x1281xf32>
123+
%6 = linalg.fill ins(%cst : f32) outs(%5 : tensor<2x17x17x1281xf32>) -> tensor<2x17x17x1281xf32>
124+
%7 = linalg.conv_2d_nhwc_hwcf {dilations = dense<1> : tensor<2xi64>, lowering_config = #config, strides = dense<2> : tensor<2xi64>} ins(%3, %4 : tensor<2x35x35x1281xf16>, tensor<3x3x1281x1281xf16>) outs(%6 : tensor<2x17x17x1281xf32>) -> tensor<2x17x17x1281xf32>
125+
flow.dispatch.tensor.store %7, %2, offsets = [0, 0, 0, 0], sizes = [2, 17, 17, 1281], strides = [1, 1, 1, 1] : tensor<2x17x17x1281xf32> -> !flow.dispatch.tensor<writeonly:tensor<2x17x17x1281xf32>>
126+
return
127+
}
128+
}
129+
}
130+
}
131+
132+
// CHECK-LABEL: func @conv_nhwc_unaligned
133+
// CHECK-DAG: %[[B0:.+]] = hal.interface.binding.subspan layout({{.+}}) binding(0)
134+
// CHECK-DAG: %[[B1:.+]] = hal.interface.binding.subspan layout({{.+}}) binding(1)
135+
// CHECK-DAG: %[[B2:.+]] = hal.interface.binding.subspan layout({{.+}}) binding(2)
136+
// CHECK-DAG: memref.alloc() : memref<2x1x2x16x1x16xf32, #gpu.address_space<workgroup>>
137+
// CHECK-DAG: memref.alloc() : memref<16x20xf16, #gpu.address_space<workgroup>>
138+
// CHECK-DAG: memref.alloc() : memref<2x1x32x20xf16, #gpu.address_space<workgroup>>
139+
// CHECK-DAG: %[[C0:.+]] = arith.constant 0 : index
140+
// CHECK-DAG: %[[C721:.+]] = arith.constant 721 : index
141+
// CHECK-DAG: %[[C1:.+]] = arith.constant 1 : index
142+
// CHECK: scf.forall ({{.*}}) in (17, 81) {
143+
// CHECK: %[[LOOP:.+]] = scf.for %[[IV:.+]] = %[[C0]] to %[[C721]] step %[[C1]] {{.*}} -> (vector<1x1x1x1x4x1xf32>)
144+
// CHECK: gpu.barrier
145+
// CHECK-DAG: %[[LHS_RD:.+]] = vector.transfer_read %[[B0]]{{.*}} vector<1xf16>
146+
// CHECK-DAG: vector.transfer_write %[[LHS_RD]]
147+
// Note that to simplify the test we are not showing the mapping of the RHS_RD
148+
// to its buffer as it goes through an scf.if/else control structure
149+
// involving allocas.
150+
// CHECK-DAG: %[[RHS_RD:.+]] = vector.transfer_read {{.*}} vector<1xf16>
151+
// CHECK-DAG: vector.transfer_write %[[RHS_RD]]
152+
// CHECK: gpu.barrier
153+
// CHECK-DAG: %[[LHS_MM0:.+]] = vector.transfer_read {{.*}} vector<4xf16>
154+
// CHECK-DAG: %[[RHS_MM:.+]] = vector.transfer_read {{.*}} vector<4x1x1xf16>
155+
// CHECK-COUNT-1: amdgpu.mfma {{.*}}blocks = 1 : i32, k = 16 : i32, m = 16 : i32, n = 16 : i32
156+
// CHECK: %[[LOOP_T:.+]] = vector.shape_cast %[[LOOP]] : vector<1x1x1x1x4x1xf32> to vector<4x1x1xf32>
157+
// CHECK: vector.transfer_write %[[LOOP_T]]
158+
// Note there is a writeback loop here that is skipped to simplify the test.
159+
// CHECK: vector.transfer_write {{.*}}, %[[B2]]
160+
// CHECK: } {mapping = [#iree_codegen.workgroup_mapping<y>, #iree_codegen.workgroup_mapping<x>]}

0 commit comments

Comments
 (0)