[Codegen][GPU] Infer workgroup size multiples from producers and consumers (#19804)

Max191 · web-flow · commit 21234ed7876b · 2025-02-10T09:47:12.000-08:00
This PR adds new logic in ConfigUtils.cpp to analyze a dispatch and
determine required multiples of workgroup tile sizes for the root
operation. This affects dispatches that contain either tensor.pack or
tensor.unpack ops, since the pack and unpack ops require the workgroup
tile sizes to be a multiple of their inner_tiles in order for them to be
fused into the workgroup scf.forall loop. The following example of a gpu
set_encoding dispatch illustrates the new constraint imposed by this PR:

```mlir
%in = flow.dispatch.tensor.load ... -&gt; tensor&lt;256x64xi8&gt;
%pack = tensor.pack %in ... inner_tiles = [128, 64] ... tensor&lt;256x64xi8&gt; -&gt; tensor&lt;2x1x128x64xi8&gt;
%expanded = tensor.expand_shape %pack [[0], [1], [2, 3, 4], [5, 6, 7]]
    : tensor&lt;2x1x128x64xi8&gt; into tensor&lt;2x1x4x8x4x2x4x8xi8&gt;
// linalg.transpose is the root op. The workgroup tile sizes must contain an
// even multiple of the tensor.pack inner_tiles.
%transposed = linalg.transpose
    ins(%expanded : tensor&lt;2x1x4x8x4x2x4x8xi8&gt;)
    outs(%empty : tensor&lt;2x1x8x4x4x4x2x8xi8&gt;)
    permutation = [0, 1, 3, 6, 2, 4, 5, 7]
flow.dispatch.tensor.store %transposed
```

Since the linalg.transpose is the root op, it needs to be aware of its
producer chain when selecting tile sizes. With this PR, the lowering
config selection logic will walk producers until it hits an unsupported
operation or a block argument, and find the LCM of any pack or unpack
tiles along the dimensions of their inner_tiles. In the above example,
this would look like the following:

1. Walk producer chain up to the producer of `tensor.pack`, and stop at
the `flow.dispatch.tensor.load`. The initial workgroup tile size
multiples will be `[1, 1]` (i.e., no constraint for unsupported ops).
2. The workgroup tile sizes will be propagated through the
`tensor.pack`, which updates the workgroup tile size multiples to `[1,
1, 128, 64]`.
3. Then, it will propagate through the `tensor.expand_shape`, which will
expand the workgroup size multiples if possible. In this case, they are
expanded to `[1, 1, 4, 8, 4, 2, 4, 8]`.
4. Now walk the consumer chain to find the multiples for the workgroup
tile slice of the root op result. In this case, the propagation simply
stops at the `flow.dispatch.tensor.store`, and the multiples are `[1, 1,
1, ...]`.
5. Now the root op has the required workgroup tile size multiples for
the operand and result slices, and the multiples for the iteration space
of the op are computed based on the indexing maps of the operation, by
taking the LCM along each dimension of that dimension's multiples from
all operands and results. In this case the final workgroup tile size
multiples would become `[1, 1, 8, 4, 4, 4, 2, 8]`.

---------

Signed-off-by: Max Dawkins &lt;max.dawkins@gmail.com&gt;
diff --git a/compiler/src/iree/compiler/Codegen/Common/BUILD.bazel b/compiler/src/iree/compiler/Codegen/Common/BUILD.bazel
@@ -154,6 +154,7 @@ iree_compiler_cc_library(
         "TileAndDistributeToWorkgroupsPass.cpp",
         "TileDispatchUsingForall.cpp",
         "TileDispatchUsingInterface.cpp",
+        "TileInferenceUtils.cpp",
         "TileLargeTensors.cpp",
         "TileSizeSelection.cpp",
         "Transforms.cpp",
@@ -171,6 +172,7 @@ iree_compiler_cc_library(
         "PassUtils.h",
         "Passes.h",
         "TensorDynamicDimAnalysis.h",
+        "TileInferenceUtils.h",
         "TileSizeSelection.h",
         "Transforms.h",
         "UserConfig.h",
diff --git a/compiler/src/iree/compiler/Codegen/Common/CMakeLists.txt b/compiler/src/iree/compiler/Codegen/Common/CMakeLists.txt
@@ -74,6 +74,7 @@ iree_cc_library(
     "PassUtils.h"
     "Passes.h"
     "TensorDynamicDimAnalysis.h"
+    "TileInferenceUtils.h"
     "TileSizeSelection.h"
     "Transforms.h"
     "UserConfig.h"
@@ -146,6 +147,7 @@ iree_cc_library(
     "TileAndDistributeToWorkgroupsPass.cpp"
     "TileDispatchUsingForall.cpp"
     "TileDispatchUsingInterface.cpp"
+    "TileInferenceUtils.cpp"
     "TileLargeTensors.cpp"
     "TileSizeSelection.cpp"
     "Transforms.cpp"
diff --git a/compiler/src/iree/compiler/Codegen/Common/TileInferenceUtils.cpp b/compiler/src/iree/compiler/Codegen/Common/TileInferenceUtils.cpp
diff --git a/compiler/src/iree/compiler/Codegen/Common/TileInferenceUtils.h b/compiler/src/iree/compiler/Codegen/Common/TileInferenceUtils.h
@@ -0,0 +1,27 @@
+// Copyright 2025 The IREE Authors
+//
+// Licensed under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+#ifndef IREE_COMPILER_CODEGEN_LLVMCPU_TILEINFERENCEUTILS_H_
+#define IREE_COMPILER_CODEGEN_LLVMCPU_TILEINFERENCEUTILS_H_
+
+#include "mlir/Interfaces/TilingInterface.h"
+
+namespace mlir::iree_compiler {
+
+/// Walks the producer and consumer chains of the `tilingOp`, and looks for ops
+/// that require specific workgroup tile size multiples. Right now, the only ops
+/// that require a specific multiple are pack and unpack, since the workgroup
+/// tile sizes need to be multiples of the inner_tiles. After walking the IR and
+/// finding multiples for the slices of the `tilingOp` operands and results, the
+/// function computes and returns the multiples of the `tilingOp` iteration
+/// space. The function may fail to find a valid set of workgroup size
+/// multiples, in which case the function will fallback to returning a list of
+/// all 1, meaning no constraints on the workgroup tile sizes.
+SmallVector<int64_t> getWorkgroupSizeMultiples(TilingInterface tilingOp);
+
+} // namespace mlir::iree_compiler
+
+#endif // IREE_COMPILER_CODEGEN_LLVMCPU_TILEINFERENCEUTILS_H_
diff --git a/compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/BUILD.bazel b/compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/BUILD.bazel
@@ -21,12 +21,14 @@ iree_compiler_cc_library(
         "ConfigUtils.h",
     ],
     deps = [
+        "//compiler/src/iree/compiler/Codegen/Common",
         "//compiler/src/iree/compiler/Codegen/Common/GPU:GPUHeuristics",
         "//compiler/src/iree/compiler/Codegen/Dialect/Codegen/IR:IREECodegenDialect",
         "//compiler/src/iree/compiler/Codegen/Dialect/GPU/IR:IREEGPUDialect",
         "//compiler/src/iree/compiler/Codegen/Utils",
         "//compiler/src/iree/compiler/Dialect/LinalgExt/Utils",
         "@llvm-project//llvm:Support",
+        "@llvm-project//mlir:DialectUtils",
         "@llvm-project//mlir:FunctionInterfaces",
         "@llvm-project//mlir:IR",
         "@llvm-project//mlir:LinalgDialect",
diff --git a/compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/CMakeLists.txt b/compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/CMakeLists.txt
@@ -24,6 +24,7 @@ iree_cc_library(
     MLIRLinalgDialect
     MLIRLinalgUtils
     MLIRSupport
+    iree::compiler::Codegen::Common
     iree::compiler::Codegen::Common::GPU::GPUHeuristics
     iree::compiler::Codegen::Dialect::Codegen::IR::IREECodegenDialect
     iree::compiler::Codegen::Dialect::GPU::IR::IREEGPUDialect
diff --git a/compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp b/compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp
@@ -7,6 +7,7 @@
 #include "iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.h"
 
 #include "iree/compiler/Codegen/Common/GPU/GPUHeuristics.h"
+#include "iree/compiler/Codegen/Common/TileInferenceUtils.h"
 #include "iree/compiler/Codegen/Dialect/Codegen/IR/IREECodegenAttrs.h"
 #include "iree/compiler/Codegen/Dialect/GPU/IR/GPULoweringConfigUtils.h"
 #include "iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.h"
@@ -20,6 +21,7 @@
 #include "llvm/Support/Debug.h"
 #include "mlir/Dialect/Linalg/IR/LinalgInterfaces.h"
 #include "mlir/Dialect/Linalg/Utils/Utils.h"
+#include "mlir/Dialect/Utils/IndexingUtils.h"
 #include "mlir/IR/BuiltinAttributes.h"
 #include "mlir/IR/TypeUtilities.h"
 #include "mlir/Interfaces/FunctionInterfaces.h"
@@ -34,6 +36,10 @@ namespace mlir::iree_compiler::IREE::GPU {
 constexpr int64_t kCacheLineSizeBits = 128 * 8;
 constexpr int64_t kPreferredCopyNumBits = 128;
 
+//===----------------------------------------------------------------------===//
+// Lowering Config Selection
+//===----------------------------------------------------------------------===//
+
 LogicalResult setDataTiledMultiMmaLoweringConfig(
     IREE::GPU::TargetAttr target, mlir::FunctionOpInterface entryPoint,
     Operation *op, IREE::GPU::UKernelConfigAttr ukernelConfig) {
@@ -529,6 +535,17 @@ LogicalResult setTileAndFuseLoweringConfig(IREE::GPU::TargetAttr target,
   SmallVector<int64_t> workgroupTileSizes(loopDepth, 0);
   SmallVector<int64_t> threadTileSizes(loopDepth, 0);
 
+  // Find constraints on workgroup tile sizes due to pack or unpack ops in the
+  // dispatch. If there are no pack or unpack ops present, then these multiples
+  // will be 1, which means there is no constraint on workgroup tile sizes.
+  //
+  // TODO(Max191): Getting the workgroup size multiples is needed for current
+  // pack and unpack GPU codegen. Ideally, we won't rely on propagating pack
+  // and unpack tile size information during lowering strategy selection, and
+  // this logic should be dropped once we have a better solution.
+  SmallVector<int64_t> workgroupTileSizeMultiples =
+      getWorkgroupSizeMultiples(cast<TilingInterface>(op));
+
   // Common case for all linalg ops.
 
   // The core idea is to distribute the partitioned loops to the workgroup
@@ -566,23 +583,15 @@ LogicalResult setTileAndFuseLoweringConfig(IREE::GPU::TargetAttr target,
     LDBG("Loss factor: " << lossFactor << "\n");
     // Initialize the configuration.
     flatWorkgroupSize = 1;
-    // Initialize tiling along all partitioned loops with size 1.
+    // Initialize thread tiling along all partitioned loops with size 1, and
+    // workgroup tiling with the required tile size multiples. This may lead
+    // to larger workgroup tiles than the number of threads in the workgroup,
+    // but it is unavoidable.
     for (int64_t loopIndex : partitionableLoops) {
-      workgroupTileSizes[loopIndex] = threadTileSizes[loopIndex] = 1;
-    }
-    // Override the innermost dimension to distribute to threads in a subgroup.
-    workgroupTileSizes[partitionableLoops.back()] = subgroupSize;
-
-    // If there are more than 3 parallel dim try to tile the extra higher level
-    // dimensions to 1 for extra dimensions.
-    if (isa<linalg::GenericOp>(linalgOp.getOperation())) {
-      for (auto [i, tileSize] : llvm::enumerate(workgroupTileSizes)) {
-        if (tileSize != 0)
-          break;
-        if (loopBounds[i] != 1)
-          tileSize = 1;
-      }
+      workgroupTileSizes[loopIndex] = workgroupTileSizeMultiples[loopIndex];
+      threadTileSizes[loopIndex] = 1;
     }
+
     // Scan from the innermost shape dimension and try to deduce the
     // configuration for the corresponding GPU workgroup dimension.
     int64_t wgDim = 0;
@@ -592,18 +601,26 @@ LogicalResult setTileAndFuseLoweringConfig(IREE::GPU::TargetAttr target,
       if (ShapedType::isDynamic(loopBound))
         continue;
 
-      // Try to find some power of two that can devide the current shape dim
+      // Try to find some power of two that can divide the current shape dim
       // size. This vector keeps the candidate tile sizes.
       SmallVector<int64_t, 8> candidates;
 
+      // Ensure vectorization works with the `workgroupTileMultiple`.
+      int64_t workgroupTileMultiple = workgroupTileSizeMultiples[shapeDim];
+      vectorizable =
+          vectorizable && 4 * numThreads % workgroupTileMultiple == 0;
       // For the inner most workgroup dim, try to see if we can have 4
       // elements per thread. This enables vectorization.
       if (vectorizable && wgDim == 0 && !lossFactor) {
         candidates.push_back(4 * numThreads);
       }
-      // Try all power of two numbers up to the subgroup size.
-      for (unsigned i = numThreads; i >= 1; i >>= 1) {
-        candidates.push_back(i);
+      // Try all power of two multiples of `workgroupTileMultiple` up to the
+      // subgroup size.
+      uint64_t maxCandidate =
+          std::max<uint64_t>(1, llvm::PowerOf2Ceil(llvm::divideCeil(
+                                    numThreads, workgroupTileMultiple)));
+      for (unsigned i = maxCandidate; i >= 1; i >>= 1) {
+        candidates.push_back(i * workgroupTileMultiple);
       }
       LLVM_DEBUG({
         llvm::dbgs() << "Base candidate tile sizes: [";
@@ -629,13 +646,10 @@ LogicalResult setTileAndFuseLoweringConfig(IREE::GPU::TargetAttr target,
           continue;
         }
 
-        // Found a suitable candidate. Try to let each thread handle 4
-        // elements if this is the workgroup x dimension.
+        // Try to let each thread handle 4 elements if this is the workgroup x
+        // dimension.
         // TODO: Try to take into account element type bit width to get
         // 4xdword reads instead of 4x{elements}.
-        workgroupTileSizes[shapeDim] = scaledTileSize;
-        LLVM_DEBUG(llvm::dbgs()
-                   << "Chosen workgroup tile size: " << scaledTileSize << "\n");
         if (vectorizable && wgDim == 0 && !lossFactor && candidate % 4 == 0) {
           // Use size-1 vectors to increase parallelism if larger ones causes
           // idle threads in the subgroup.
@@ -648,13 +662,29 @@ LogicalResult setTileAndFuseLoweringConfig(IREE::GPU::TargetAttr target,
           assert(numThreads % (candidate / vectorSize) == 0);
           numThreads /= candidate / vectorSize;
         } else {
+          // When the workgroupTileMultiple is not a Po2, then the candidate
+          // may not evenly divide the numThreads. In this case, we get some
+          // idle threads in the last iteration of the workgroup tile. Verify
+          // that the idle threads are within the lossFactor.
+          int64_t maybeCandidateWorkgroupSize = candidate;
+          if (numThreads % candidate != 0) {
+            maybeCandidateWorkgroupSize =
+                std::min<int64_t>(1ll << llvm::Log2_64(candidate), numThreads);
+            int64_t idleThreads = candidate % maybeCandidateWorkgroupSize;
+            if (idleThreads != 0 &&
+                (!lossFactor || idleThreads > candidate / *lossFactor)) {
+              continue;
+            }
+          }
           if (wgDim == 0)
             vectorizable = false;
           threadTileSizes[shapeDim] = scaleToByte;
-          candidateWorkgroupSize = candidate;
-          assert(numThreads % candidate == 0);
-          numThreads /= candidate;
+          candidateWorkgroupSize = maybeCandidateWorkgroupSize;
+          numThreads /= candidateWorkgroupSize;
         }
+        workgroupTileSizes[shapeDim] = scaledTileSize;
+        LLVM_DEBUG(llvm::dbgs()
+                   << "Chosen workgroup tile size: " << scaledTileSize << "\n");
         assert(numThreads >= 1);
         break;
       }
@@ -674,8 +704,17 @@ LogicalResult setTileAndFuseLoweringConfig(IREE::GPU::TargetAttr target,
   if (distributeToThreads(newNumThreads) != 1) {
     // Otherwise, allow larger and larger loss factor.
 
-    // Threads for distribution. Use 32 at least.
-    int64_t numThreads = std::max(subgroupSize, 32);
+    // Threads for distribution. Use `minPreferredNumThreads` at least, but no
+    // more than 4 subgroups.
+    int64_t minPreferredNumThreads = std::reduce(
+        workgroupTileSizeMultiples.begin(), workgroupTileSizeMultiples.end(), 1,
+        std::multiplies<int64_t>());
+    int64_t numThreads =
+        std::min<int64_t>(4 * subgroupSize, minPreferredNumThreads);
+    // If minPreferredNumThreads is small, use at least 32 or subgroupSize
+    // threads, whichever is larger.
+    numThreads =
+        std::max<int64_t>(std::max<int64_t>(subgroupSize, 32), numThreads);
     // We can tolerate (1 / lossFactor) of threads in the workgroup to be idle.
     int64_t lossFactor = 32;
 
@@ -685,21 +724,6 @@ LogicalResult setTileAndFuseLoweringConfig(IREE::GPU::TargetAttr target,
     }
   }
 
-  // Attach the MMA schedule as an attribute to the entry point export function
-  // for later access in the pipeline.
-  MLIRContext *context = linalgOp.getContext();
-  SmallVector<NamedAttribute, 1> attrs;
-  Builder b(context);
-  attrs.emplace_back(StringAttr::get(context, "workgroup"),
-                     b.getI64ArrayAttr(workgroupTileSizes));
-
-  attrs.emplace_back(StringAttr::get(context, "thread"),
-                     b.getI64ArrayAttr(threadTileSizes));
-
-  if (isNonMatvecContraction(linalgOp)) {
-    GPU::setPromotedOperandList(context, attrs, {0, 1});
-  }
-
   // Heuristic value chosen to limit maximum vector sizes when tiling below.
   const unsigned maxVectorSize = 32;
 
@@ -726,6 +750,22 @@ LogicalResult setTileAndFuseLoweringConfig(IREE::GPU::TargetAttr target,
       loopTileSizes[i] = tileSize;
     }
   }
+
+  // Attach the MMA schedule as an attribute to the entry point export function
+  // for later access in the pipeline.
+  MLIRContext *context = linalgOp.getContext();
+  SmallVector<NamedAttribute, 1> attrs;
+  Builder b(context);
+  attrs.emplace_back(StringAttr::get(context, "workgroup"),
+                     b.getI64ArrayAttr(workgroupTileSizes));
+
+  attrs.emplace_back(StringAttr::get(context, "thread"),
+                     b.getI64ArrayAttr(threadTileSizes));
+
+  if (isNonMatvecContraction(linalgOp)) {
+    GPU::setPromotedOperandList(context, attrs, {0, 1});
+  }
+
   if (llvm::any_of(loopTileSizes, [](int64_t s) { return s != 0; })) {
     attrs.emplace_back(StringAttr::get(context, "reduction"),
                        b.getI64ArrayAttr(loopTileSizes));
diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp b/compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp
@@ -2520,14 +2520,15 @@ LogicalResult initGPULaunchConfig(FunctionOpInterface funcOp) {
 
   Operation *rootOperation = nullptr;
 
-  // Find the root operation. linalg.generic, linalg.fill, and scatter are not
-  // root operations if there are other compute operations present.
-  // Also, construct a set of generic ops that are to be skipped. These generic
-  // ops that are used to compute scatter indices are not root operations.
+  // Find the root operation. linalg.generic, linalg.fill, tensor.pack,
+  // tensor.unpack, and scatter are not root operations if there are other
+  // compute operations present. Also, construct a set of generic ops that
+  // are to be skipped. These generic ops that are used to compute scatter
+  // indices are not root operations.
   llvm::SmallDenseSet<Operation *, 4> genericToSkip;
   for (Operation *op : llvm::reverse(computeOps)) {
-    if (!isa<linalg::GenericOp, linalg::FillOp, IREE::LinalgExt::ScatterOp>(
-            op)) {
+    if (!isa<linalg::GenericOp, linalg::FillOp, IREE::LinalgExt::ScatterOp,
+             tensor::PackOp, tensor::UnPackOp>(op)) {
       rootOperation = op;
       break;
     }
@@ -2554,7 +2555,8 @@ LogicalResult initGPULaunchConfig(FunctionOpInterface funcOp) {
     }
   }
 
-  // Generic ops take priority over scatter and fill ops as the root op.
+  // Generic ops take priority over pack, unpack, scatter, and fill ops as the
+  // root op.
   if (!rootOperation) {
     for (Operation *op : llvm::reverse(computeOps)) {
       if (isa<linalg::GenericOp>(op) && !genericToSkip.contains(op)) {
@@ -2564,6 +2566,16 @@ LogicalResult initGPULaunchConfig(FunctionOpInterface funcOp) {
     }
   }
 
+  // Pack and unpack ops take priority over scatter and fill ops as the root op.
+  if (!rootOperation) {
+    for (Operation *op : llvm::reverse(computeOps)) {
+      if (isa<tensor::PackOp, tensor::UnPackOp>(op)) {
+        rootOperation = op;
+        break;
+      }
+    }
+  }
+
   if (!rootOperation) {
     for (Operation *op : llvm::reverse(computeOps)) {
       if (isa<IREE::LinalgExt::ScatterOp, linalg::FillOp>(op)) {
diff --git a/compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/config_tile_and_fuse.mlir b/compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/config_tile_and_fuse.mlir