Skip to content

Conversation

@charithaintc
Copy link
Contributor

moveRegionToNewWarpOpAndAppendReturns implicitly assumes that there are no duplicates in the WarpOp's yield operands. It uses a SetVector to store the yielded values and SmallVector to collect the corresponding yielded types. This creates an issue when there are duplicated yielded values as shown in the test case.

func.func @warp_propagate_duplicated_operands_in_yield(%laneid: index)  {
  %r:3 = gpu.warp_execute_on_lane_0(%laneid)[32] -> (vector<1xf32>, vector<1xf32>, vector<1xf32>) {
    %0 = "some_def"() : () -> (vector<32xf32>)
    %1 = "some_other_def"() : () -> (vector<32xf32>)
    %2 = math.exp %1 : vector<32xf32>
    gpu.yield %2, %0, %0 : vector<32xf32>, vector<32xf32>, vector<32xf32>
  }
  "some_use"(%r#0) : (vector<1xf32>) -> ()
  return
}

This causes a size mismatch in yieldedValues and types causing a crash.

This is a subtle bug. Notice that if WarpOpDeadResult is run before the WarpOpElementwise, this crash won't occur because then the duplicate operands will be simplified by WarpOpDeadResult. However moveRegionToNewWarpOpAndAppendReturns should not assume such pattern application order in practice.

@llvmbot
Copy link
Member

llvmbot commented Aug 14, 2025

@llvm/pr-subscribers-mlir-vector
@llvm/pr-subscribers-mlir-gpu

@llvm/pr-subscribers-mlir

Author: Charitha Saumya (charithaintc)

Changes

moveRegionToNewWarpOpAndAppendReturns implicitly assumes that there are no duplicates in the WarpOp's yield operands. It uses a SetVector to store the yielded values and SmallVector to collect the corresponding yielded types. This creates an issue when there are duplicated yielded values as shown in the test case.

func.func @<!-- -->warp_propagate_duplicated_operands_in_yield(%laneid: index)  {
  %r:3 = gpu.warp_execute_on_lane_0(%laneid)[32] -&gt; (vector&lt;1xf32&gt;, vector&lt;1xf32&gt;, vector&lt;1xf32&gt;) {
    %0 = "some_def"() : () -&gt; (vector&lt;32xf32&gt;)
    %1 = "some_other_def"() : () -&gt; (vector&lt;32xf32&gt;)
    %2 = math.exp %1 : vector&lt;32xf32&gt;
    gpu.yield %2, %0, %0 : vector&lt;32xf32&gt;, vector&lt;32xf32&gt;, vector&lt;32xf32&gt;
  }
  "some_use"(%r#<!-- -->0) : (vector&lt;1xf32&gt;) -&gt; ()
  return
}

This causes a size mismatch in yieldedValues and types causing a crash.

This is a subtle bug. Notice that if WarpOpDeadResult is run before the WarpOpElementwise, this crash won't occur because then the duplicate operands will be simplified by WarpOpDeadResult. However moveRegionToNewWarpOpAndAppendReturns should not assume such pattern application order in practice.


Full diff: https://github.com/llvm/llvm-project/pull/153656.diff

2 Files Affected:

  • (modified) mlir/lib/Dialect/GPU/Utils/DistributionUtils.cpp (+18-14)
  • (modified) mlir/test/Dialect/Vector/vector-warp-distribute.mlir (+21)
diff --git a/mlir/lib/Dialect/GPU/Utils/DistributionUtils.cpp b/mlir/lib/Dialect/GPU/Utils/DistributionUtils.cpp
index 384d1a0ddccd2..be71bd02fc43b 100644
--- a/mlir/lib/Dialect/GPU/Utils/DistributionUtils.cpp
+++ b/mlir/lib/Dialect/GPU/Utils/DistributionUtils.cpp
@@ -14,6 +14,7 @@
 #include "mlir/Dialect/Affine/IR/AffineOps.h"
 #include "mlir/Dialect/Arith/IR/Arith.h"
 #include "mlir/IR/Value.h"
+#include "llvm/ADT/DenseMap.h"
 
 #include <numeric>
 
@@ -57,26 +58,29 @@ WarpDistributionPattern::moveRegionToNewWarpOpAndAppendReturns(
                           warpOp.getResultTypes().end());
   auto yield = cast<gpu::YieldOp>(
       warpOp.getBodyRegion().getBlocks().begin()->getTerminator());
-  llvm::SmallSetVector<Value, 32> yieldValues(yield.getOperands().begin(),
-                                              yield.getOperands().end());
+  SmallVector<Value> yieldValues(yield.getOperands().begin(),
+                                 yield.getOperands().end());
+  llvm::SmallDenseMap<Value, unsigned> indexLookup;
+  // Record the value -> first index mapping for faster lookup.
+  for (auto [i, v] : llvm::enumerate(yieldValues)) {
+    if (!indexLookup.count(v))
+      indexLookup[v] = i;
+  }
+
   for (auto [value, type] : llvm::zip_equal(newYieldedValues, newReturnTypes)) {
-    if (yieldValues.insert(value)) {
+    // If the value already exists in the yield, don't create a new output.
+    if (indexLookup.count(value)) {
+      indices.push_back(indexLookup[value]);
+    } else {
+      // If the value is new, add it to the yield and to the types.
+      yieldValues.push_back(value);
       types.push_back(type);
       indices.push_back(yieldValues.size() - 1);
-    } else {
-      // If the value already exit the region don't create a new output.
-      for (auto [idx, yieldOperand] :
-           llvm::enumerate(yieldValues.getArrayRef())) {
-        if (yieldOperand == value) {
-          indices.push_back(idx);
-          break;
-        }
-      }
     }
   }
-  yieldValues.insert_range(newYieldedValues);
+
   WarpExecuteOnLane0Op newWarpOp = moveRegionToNewWarpOpAndReplaceReturns(
-      rewriter, warpOp, yieldValues.getArrayRef(), types);
+      rewriter, warpOp, yieldValues, types);
   rewriter.replaceOp(warpOp,
                      newWarpOp.getResults().take_front(warpOp.getNumResults()));
   return newWarpOp;
diff --git a/mlir/test/Dialect/Vector/vector-warp-distribute.mlir b/mlir/test/Dialect/Vector/vector-warp-distribute.mlir
index ae8fce786ee57..c3ce7e9ca7fda 100644
--- a/mlir/test/Dialect/Vector/vector-warp-distribute.mlir
+++ b/mlir/test/Dialect/Vector/vector-warp-distribute.mlir
@@ -1803,3 +1803,24 @@ func.func @warp_propagate_nd_write(%laneid: index, %dest: memref<4x1024xf32>) {
 //       CHECK-DIST-AND-PROP:   %[[IDS:.+]]:2 = affine.delinearize_index %{{.*}} into (4, 8) : index, index
 //       CHECK-DIST-AND-PROP:   %[[INNER_ID:.+]] = affine.apply #map()[%[[IDS]]#1]
 //       CHECK-DIST-AND-PROP:   vector.transfer_write %[[W]], %{{.*}}[%[[IDS]]#0, %[[INNER_ID]]] {{.*}} : vector<1x128xf32>
+
+// -----
+func.func @warp_propagate_duplicated_operands_in_yield(%laneid: index)  {
+  %r:3 = gpu.warp_execute_on_lane_0(%laneid)[32] -> (vector<1xf32>, vector<1xf32>, vector<1xf32>) {
+    %0 = "some_def"() : () -> (vector<32xf32>)
+    %1 = "some_other_def"() : () -> (vector<32xf32>)
+    %2 = math.exp %1 : vector<32xf32>
+    gpu.yield %2, %0, %0 : vector<32xf32>, vector<32xf32>, vector<32xf32>
+  }
+  "some_use"(%r#0) : (vector<1xf32>) -> ()
+  return
+}
+
+// CHECK-PROP-LABEL : func.func @warp_propagate_duplicated_operands_in_yield(
+// CHECK-PROP       :   %[[W:.*]] = gpu.warp_execute_on_lane_0(%{{.*}})[32] -> (vector<1xf32>) {
+// CHECK-PROP       :     %{{.*}} = "some_def"() : () -> vector<32xf32>
+// CHECK-PROP       :     %[[T3:.*]] = "some_other_def"() : () -> vector<32xf32>
+// CHECK-PROP       :     gpu.yield %[[T3]] : vector<32xf32>
+// CHECK-PROP       :   }
+// CHECK-PROP       :   %[T1:.*] = math.exp %[[W]] : vector<1xf32>
+// CHECK-PROP       :   "some_use"(%[[T1]]) : (vector<1xf32>) -> ()

@charithaintc charithaintc changed the title [vector][distribution] Bug fix in moveRegionToNewWarpOpAndAppendReturns [vector][distribution] Bug fix in moveRegionToNewWarpOpAndAppendReturns Aug 14, 2025
Copy link
Contributor

@Jianhui-Li Jianhui-Li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adam-smnk
Copy link
Contributor

adam-smnk commented Aug 15, 2025

What happens when there are users of all three %r:3 results? Is it tracked correctly?

Btw, this example still crashes:

func.func @warp_propagate_duplicated_operands_in_yield(%laneid: index)  {
  %r:3 = gpu.warp_execute_on_lane_0(%laneid)[32] -> (vector<1xf32>, vector<1xf32>, vector<1xf32>) {
    %0 = "some_def"() : () -> (vector<32xf32>)
    %1 = "some_other_def"() : () -> (vector<32xf32>)
    %2 = math.exp %1 : vector<32xf32>
    gpu.yield %2, %0, %0 : vector<32xf32>, vector<32xf32>, vector<32xf32>
  }
  "some_use"(%r#2) : (vector<1xf32>) -> () // Note user of the duplicate only
  return
}

@charithaintc
Copy link
Contributor Author

charithaintc commented Aug 15, 2025

What happens when there are users of all three %r:3 results? Is it tracked correctly?

yes. Other results gets folded away by WarpOpDeadResult pattern.

Btw, this example still crashes:

func.func @warp_propagate_duplicated_operands_in_yield(%laneid: index)  {
  %r:3 = gpu.warp_execute_on_lane_0(%laneid)[32] -> (vector<1xf32>, vector<1xf32>, vector<1xf32>) {
    %0 = "some_def"() : () -> (vector<32xf32>)
    %1 = "some_other_def"() : () -> (vector<32xf32>)
    %2 = math.exp %1 : vector<32xf32>
    gpu.yield %2, %0, %0 : vector<32xf32>, vector<32xf32>, vector<32xf32>
  }
  "some_use"(%r#2) : (vector<1xf32>) -> () // Note user of the duplicate only
  return
}

Good catch. There is another issue in WarpOpDeadResult. I think this is the reason for that. I have a fix in another PR. But did not merge it yet because that issue not bothering us at this point. But you are right, we need to take care of it too at some point.

#148067

In any case, I think issue in this PR is clearly isolated. The fact that yielded values using a SetVector and types using SmallVector is clearly not correct in the presence of duplicated yielded values.

Copy link
Contributor

@adam-smnk adam-smnk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarification 👍
LGTM

@charithaintc charithaintc merged commit 9617ce4 into llvm:main Aug 18, 2025
5 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants