Skip to content

Conversation

@oowekyala
Copy link
Contributor

Fix #93973. This allows using linalg.reduce to eg reduce several tensors into one. The current implementation is limited to have the same number of inputs and outputs.

@github-actions
Copy link

github-actions bot commented Sep 2, 2024

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Sep 2, 2024

@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-linalg

Author: Clément Fournier (oowekyala)

Changes

Fix #93973. This allows using linalg.reduce to eg reduce several tensors into one. The current implementation is limited to have the same number of inputs and outputs.


Full diff: https://github.com/llvm/llvm-project/pull/107005.diff

3 Files Affected:

  • (modified) mlir/include/mlir/Dialect/Linalg/IR/LinalgStructuredOps.td (+1-1)
  • (modified) mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp (+11-5)
  • (modified) mlir/test/Dialect/Linalg/roundtrip.mlir (+42)
diff --git a/mlir/include/mlir/Dialect/Linalg/IR/LinalgStructuredOps.td b/mlir/include/mlir/Dialect/Linalg/IR/LinalgStructuredOps.td
index ac61117c3d6e36..f20f036d6fe480 100644
--- a/mlir/include/mlir/Dialect/Linalg/IR/LinalgStructuredOps.td
+++ b/mlir/include/mlir/Dialect/Linalg/IR/LinalgStructuredOps.td
@@ -311,7 +311,7 @@ def MapOp : LinalgStructuredBase_Op<"map", [
 def ReduceOp : LinalgStructuredBase_Op<"reduce", [
     DeclareOpInterfaceMethods<OpAsmOpInterface, ["getAsmResultNames"]>,
     DeclareOpInterfaceMethods<OpAsmOpInterface, ["getAsmBlockArgumentNames"]>,
-    SameVariadicOperandSize,
+    AttrSizedOperandSegments,
     SingleBlockImplicitTerminator<"YieldOp">]> {
   let summary = "Reduce operator";
   let description = [{
diff --git a/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp b/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
index 76df3ecf2d2bd4..9c6c36075b55bd 100644
--- a/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
+++ b/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
@@ -1301,11 +1301,12 @@ LogicalResult GenericOp::fold(FoldAdaptor, SmallVectorImpl<OpFoldResult> &) {
 static ParseResult parseDstStyleOp(
     OpAsmParser &parser, OperationState &result,
     function_ref<ParseResult(OpAsmParser &, NamedAttrList &)> parseAttrsFn =
-        nullptr) {
+        nullptr,
+    bool addOperandSegmentSizes = false) {
   // Parse `ins` and `outs`.
   SmallVector<Type, 4> inputTypes, outputTypes;
   if (parseCommonStructuredOpParts(parser, result, inputTypes, outputTypes,
-                                   /*addOperandSegmentSizes=*/false))
+                                   addOperandSegmentSizes))
     return failure();
 
   // Add result types.
@@ -1646,9 +1647,12 @@ ParseResult ReduceOp::parse(OpAsmParser &parser, OperationState &result) {
   }
 
   if (parseDstStyleOp(
-          parser, result, [&](OpAsmParser &parser, NamedAttrList &attributes) {
+          parser, result,
+          [&](OpAsmParser &parser, NamedAttrList &attributes) {
             return parseDenseI64ArrayAttr(parser, attributes, "dimensions");
-          }))
+          },
+          /*addOperandSegmentSizes=*/true))
+
     return failure();
 
   if (payloadOpName.has_value()) {
@@ -1683,7 +1687,9 @@ void ReduceOp::print(OpAsmPrinter &p) {
 
   printCommonStructuredOpParts(p, getDpsInputs(), getDpsInits());
   printDenseI64ArrayAttr(p, getDimensionsAttrName(), getDimensions());
-  p.printOptionalAttrDict((*this)->getAttrs(), {getDimensionsAttrName()});
+  p.printOptionalAttrDict(
+      (*this)->getAttrs(),
+      {getDimensionsAttrName(), getOperandSegmentSizesAttrName()});
   if (!payloadOp) {
     // Print region if the payload op was not detected.
     p.increaseIndent();
diff --git a/mlir/test/Dialect/Linalg/roundtrip.mlir b/mlir/test/Dialect/Linalg/roundtrip.mlir
index 146e9780b8ebbe..802de7c335d9b1 100644
--- a/mlir/test/Dialect/Linalg/roundtrip.mlir
+++ b/mlir/test/Dialect/Linalg/roundtrip.mlir
@@ -485,6 +485,48 @@ func.func @variadic_reduce_memref(%input1: memref<16x32x64xf32>,
 
 // -----
 
+func.func @reduce_asymmetric(%input: tensor<16x32x64xi32>, %input2: tensor<16x32x64xi32>,
+                  %init: tensor<16x64xi32>) -> tensor<16x64xi32> {
+  %reduce = linalg.reduce
+      ins(%input, %input2:tensor<16x32x64xi32>, tensor<16x32x64xi32>)
+      outs(%init:tensor<16x64xi32>)
+      dimensions = [1]
+      (%in: i32, %in2: i32, %out: i32) {
+        %0 = arith.muli %in, %in2: i32
+        %1 = arith.addi %out, %0: i32
+        linalg.yield %1: i32
+      }
+  func.return %reduce : tensor<16x64xi32>
+}
+// CHECK-LABEL: func @reduce_asymmetric
+//       CHECK:   linalg.reduce ins(%{{.*}}, %{{.*}}: tensor<16x32x64xi32>, tensor<16x32x64xi32>)
+//  CHECK-NOT:    operandSegmentSize
+//  CHECK-SAME:   outs(%{{.*}}: tensor<16x64xi32>)
+//  CHECK-SAME:   dimensions = [1]
+
+// -----
+
+func.func @reduce_asymmetric_memref(%input: memref<16x32x64xi32>, %input2: memref<16x32x64xi32>,
+                  %init: memref<16x64xi32>) {
+  linalg.reduce
+      ins(%input, %input2:memref<16x32x64xi32>, memref<16x32x64xi32>)
+      outs(%init:memref<16x64xi32>)
+      dimensions = [1]
+      (%in: i32, %in2: i32, %out: i32) {
+        %0 = arith.muli %in, %in2: i32
+        %1 = arith.addi %out, %0: i32
+        linalg.yield %1: i32
+      }
+  func.return
+}
+// CHECK-LABEL: func @reduce_asymmetric_memref
+//       CHECK:   linalg.reduce ins(%{{.*}}, %{{.*}}: memref<16x32x64xi32>, memref<16x32x64xi32>)
+//  CHECK-NOT:    operandSegmentSize
+//  CHECK-SAME:   outs(%{{.*}}: memref<16x64xi32>)
+//  CHECK-SAME:   dimensions = [1]
+
+// -----
+
 func.func @transpose(%input: tensor<16x32x64xf32>,
                      %init: tensor<32x64x16xf32>) -> tensor<32x64x16xf32> {
   %transpose = linalg.transpose

@CoTinker
Copy link
Contributor

Ping~, Is this PR reasonable?

@dcaballe
Copy link
Contributor

Thanks for the PR!

  linalg.reduce
      ins(%input, %input2:memref<16x32x64xi32>, memref<16x32x64xi32>)
      outs(%init:memref<16x64xi32>)
      dimensions = [1]
      (%in: i32, %in2: i32, %out: i32) {
        %0 = arith.muli %in, %in2: i32
        %1 = arith.addi %out, %0: i32
        linalg.yield %1: i32

Not sure I follow this change. Shouldn't we just combine the two tensors before the linalg.reduce with a mul and then just add reduce the result of that?

@oowekyala
Copy link
Contributor Author

oowekyala commented Sep 20, 2024

Thanks for taking a look :)

Not sure I follow this change. Shouldn't we just combine the two tensors before the linalg.reduce with a mul and then just add reduce the result of that?

It probably is possible but it would be much more involved I think. How would you combine the two tensors into one tensor 16x32x64 where the second dimension is the mul, short of using a linalg.generic? And if we have to use a generic, then we could just fuse the reduce into it directly, but then we don't have nice structured ops anymore.

For a compiler that has to generate this code, it is simpler to just generate one reduce op, and the generated code is higher level than a generic, or a sequence of generic + reduce. To me at least it is very natural to think about the reduce operator this way.

I did a usage search for linalg.reduce on GH: https://github.com/search?q=linalg.reduce+language%3AMLIR&type=code

It seems nobody is using reduce with several operands (except our compiler, with the semantics that this PR gives it ^^). This hints that the current behavior is not useful to anyone so far. OTOH these updated semantics are more general, but would support the older behavior just fine, so I don't see the harm in them :)

@MaheshRavishankar
Copy link
Contributor

The PR description is misleading. It says "fixing a crash", but it is actually changing the semantics of the operation.

Is there anything more to reduction than just saying that the iterator types are "reduction". I dont see it providing any more value other than that.

@oowekyala oowekyala changed the title [mlir] Fix #93973 - linalg::ReduceOp verifier crash [mlir] Fix semantics of linalg::ReduceOp with several inputs Nov 1, 2024
@oowekyala
Copy link
Contributor Author

The PR description is misleading. It says "fixing a crash", but it is actually changing the semantics of the operation.

You're right, although there was a logic error in the verifier code that caused a crash, so it also does fix a crash. Since the semantics of this op in case it has several input operands are not documented or tested, I interpreted what the semantics should be. In my view, the number of inputs and number of outputs should not be related. No other linalg op has this restriction afaict.

It seems like the only consistent alternative semantics is to say that the op must have exactly one input and output, which is restrictive for no good reason.

Is there anything more to reduction than just saying that the iterator types are "reduction". I dont see it providing any more value other than that.

Sure, but this can be said of every structured linalg op. linalg.map doesn't require the same number of inputs and outputs, which allows some useful things to be written (as in its documentation, whose example is similar to the example in this comment). Making reduce have the same freedom makes the linalg dialect more consistent.

@MaheshRavishankar
Copy link
Contributor

The PR description is misleading. It says "fixing a crash", but it is actually changing the semantics of the operation.

You're right, although there was a logic error in the verifier code that caused a crash, so it also does fix a crash. Since the semantics of this op in case it has several input operands are not documented or tested, I interpreted what the semantics should be. In my view, the number of inputs and number of outputs should not be related. No other linalg op has this restriction afaict.

That is not entirely true. There is a very specific relation ship between all the inputs and outputs of a linalg op. They all have indexing maps with same domains and there is a relationship between how different dimensions are constrained based on the range of the indexing maps used. Also having a dimension that is marked as reduction showing up in the result indexing map is not allowed. We can have multiple inputs and multiple outputs for a reduction. Thats totally OK, but you are saying its one op running independent reductions to generate different outputs, but with the same iteration space and dependence? That works.

It seems like the only consistent alternative semantics is to say that the op must have exactly one input and output, which is restrictive for no good reason.

Is there anything more to reduction than just saying that the iterator types are "reduction". I dont see it providing any more value other than that.

Sure, but this can be said of every structured linalg op. linalg.map doesn't require the same number of inputs and outputs, which allows some useful things to be written (as in its documentation, whose example is similar to the example in this comment). Making reduce have the same freedom makes the linalg dialect more consistent.

I am fine with that but just need to document the semantics that you expect.


func.func @reduce_asymmetric(%input: tensor<16x32x64xi32>, %input2: tensor<16x32x64xi32>,
%init: tensor<16x64xi32>) -> tensor<16x64xi32> {
%reduce = linalg.reduce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I dont understand what this is trying to do. You are effectively doing a multiply and a reduce? How is this different from just using a linalg.generic. As in why use a linalg.reduce better than linalg.generic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is a multiply-accumulate type operation. The difference is just that it's higher-level than a linalg generic. That is nice when you want to generate this op in a compiler pass, or when you want to pattern-match it. All linalg ops are just sugar over a generic, so I don't think this use case should be disqualified because it can be implemented with a generic.

@oowekyala
Copy link
Contributor Author

We can have multiple inputs and multiple outputs for a reduction. Thats totally OK, but you are saying its one op running independent reductions to generate different outputs, but with the same iteration space and dependence? That works.

Could you clarify what you mean here? I'm not sure I follow. The main point of this change is to be able to reduce several tensors into one, as in my examples, so the reductions are not independent, rather it uses all inputs.

@MaheshRavishankar
Copy link
Contributor

We can have multiple inputs and multiple outputs for a reduction. Thats totally OK, but you are saying its one op running independent reductions to generate different outputs, but with the same iteration space and dependence? That works.

Could you clarify what you mean here? I'm not sure I follow. The main point of this change is to be able to reduce several tensors into one, as in my examples, so the reductions are not independent, rather it uses all inputs.

Sorry to reverse the question on you. Can you write the loop version or linalg.generic version of what the reduce is doing? What are the constraints on the shapes of the inputs? Should they all be the same shape? Are they all accessed using the same indexing map and is this checked in the verifier?

@oowekyala
Copy link
Contributor Author

Here is the affine version:

    affine.for %i = 0 to 16 {
      affine.for %k = 0 to 32 {
        affine.for %j = 0 to 64 {
          %4 = affine.load %1[%i, %k, %j] : memref<16x32x64xi32>
          %5 = affine.load %0[%i, %k, %j] : memref<16x32x64xi32>
          %6 = affine.load %alloc[%i, %j] : memref<16x64xi32>
          %7 = arith.muli %4, %5 : i32
          %8 = arith.addi %6, %7 : i32
          affine.store %8, %alloc[%i, %j] : memref<16x64xi32>
        }
      }
    }

and the generic version:

#map = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d2)>

    %0 = linalg.generic {indexing_maps = [#map, #map, #map1], iterator_types = ["parallel", "reduction", "parallel"]} ins(%arg0, %arg1 : tensor<16x32x64xi32>, tensor<16x32x64xi32>) outs(%arg2 : tensor<16x64xi32>) {
    ^bb0(%in: i32, %in_0: i32, %out: i32):
      %1 = arith.muli %in, %in_0 : i32
      %2 = arith.addi %out, %1 : i32
      linalg.yield %2 : i32
    } -> tensor<16x64xi32>
 

What are the constraints on the shapes of the inputs? Should they all be the same shape? Are they all accessed using the same indexing map and is this checked in the verifier?

The inputs should all have the same shape. This is because the shape of the output is the shape of the inputs with the reduction dimensions removed. So the reduction dimensions must match in extent, and the parallel dimensions must match because they determine the shape of the output, so in total the full shape must match. That is already checked by the verifier yes. The above code samples are generated with the mlir-opt built with this PR, so the way linalg.reduce maps to linalg.generic is not changed by this PR.

Copy link
Member

@rengolin rengolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @MaheshRavishankar said, the name and description of this PR are misleading. This is adding new semantics to an existing operation without proper consideration and discussion in the forum.

I updated your issue to reflect the changes we're making to linalg as a whole and the new contract operation that you could use instead. Please change this PR to make the verifier more restrict and we can discuss the use case you have with the existing instructions.

@oowekyala
Copy link
Contributor Author

oowekyala commented Dec 13, 2024

I wasn't aware of this restructuring effort of linalg. It seems the new contract operation would fit my use case most of the time (although it is binary and not variadic)...

I will restrict the verifier in this PR to make reduce only unary. I think there is disagreement over what the semantics of the reduce with several arguments should be. It seems to me your restructuring of linalg also considers reduce to be unary, so I think we should just disallow variadic reduce until we clear that up in the forum.

I will reopen a PR with the verifier fix

@oowekyala
Copy link
Contributor Author

oowekyala commented Dec 13, 2024

A couple more thoughts, to make sure this change is not misunderstood.

I have not actually been proposing to extend the semantics of linalg.reduce. It is already possible to write the MAC code that I presented above like so:

  linalg.reduce
      ins(%input, %input2:memref<16x32x64xi32>, memref<16x32x64xi32>)
      outs(%init, %init2: memref<16x64xi32>, memref<16x64xi32>)
      dimensions = [1]
      (%in: i32, %in2: i32, %out: i32, %out2: i32) {
        %0 = arith.muli %in, %in2: i32
        %1 = arith.addi %out, %0: i32
        linalg.yield %1, %1: i32, i32
     }

But in current linalg, if you have 2 inputs then you must have 2 outputs, so here we have to return a second result that we don't care about. To me this restriction is most likely an accident. It makes no sense to require that, unless maybe you restrict each input to flow only to its corresponding output, but that would be better expressed by several unary reduce. It is also not documented why this is the case or what the intended semantics should be. On the other hand there is evidence in the code and documentation that this operator is designed to support variadic operands. My change is lifting an artificial restriction, but it is not making reduce more powerful. That fixes the bugs in the verifier so I think that's the semantics that the original authors intended. All in all as far as semantic changes go, this one is IMO very minor.

Of course if to fit within @rengolin's new linalg design, reduce must be unary, then the current semantics are indeed too general... So there is a choice to be made.

Anyway I will close this and reopen a PR with just the verifier fix.

@oowekyala oowekyala changed the title [mlir] Fix semantics of linalg::ReduceOp with several inputs [mlir] Stop restricting linalg::ReduceOp to have same number of inputs and outputs Dec 13, 2024
@oowekyala oowekyala closed this Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MLIR crash in linalg::ReduceOp::verify

6 participants