-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[mlir] Canonicalize tensor.extract_slice (linalg.fill) #112619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
|
@llvm/pr-subscribers-mlir Author: Nithin Meganathan (nithinsubbiah) ChangesFull diff: https://github.com/llvm/llvm-project/pull/112619.diff 2 Files Affected:
diff --git a/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp b/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
index 730c478c2883ef..658445f19e91aa 100644
--- a/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
+++ b/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
@@ -806,6 +806,36 @@ struct FoldFillWithTensorExtract : public OpRewritePattern<tensor::ExtractOp> {
}
};
+/// Fold tensor.extract_slice(linalg.fill(<input>)) into <input>
+struct FoldFillWithTensorExtractSlice
+ : public OpRewritePattern<tensor::ExtractSliceOp> {
+public:
+ using OpRewritePattern<tensor::ExtractSliceOp>::OpRewritePattern;
+
+ LogicalResult matchAndRewrite(tensor::ExtractSliceOp extractSliceOp,
+ PatternRewriter &rewriter) const override {
+ // See if tensor input of tensor.extract_slice op is the result of a
+ // linalg.fill op.
+ auto fillOp = extractSliceOp.getSource().getDefiningOp<linalg::FillOp>();
+ if (!fillOp)
+ return failure();
+
+ Value fillInput = fillOp.getInputs()[0];
+
+ Location loc = extractSliceOp.getLoc();
+ SmallVector<OpFoldResult> mixedSizes = extractSliceOp.getMixedSizes();
+ auto emptyOp = rewriter.create<tensor::EmptyOp>(
+ loc, mixedSizes, extractSliceOp.getType().getElementType());
+
+ // Replace tensor.extract_slice op with new linalg.fillOp (former's result
+ // type and shape).
+ rewriter.replaceOpWithNewOp<linalg::FillOp>(
+ extractSliceOp, extractSliceOp.getResultType(), ValueRange{fillInput},
+ ValueRange{emptyOp});
+ return success();
+ }
+};
+
/// Folds pack(fill) into a single fill op if
/// 1. The pack op does not have padding value, or
/// 2. The filled value and padding value are the same.
@@ -936,7 +966,7 @@ struct FoldConcatsOfFill : public OpRewritePattern<tensor::ConcatOp> {
void FillOp::getCanonicalizationPatterns(RewritePatternSet &results,
MLIRContext *context) {
results.add<FoldConcatsOfFill, FoldFillWithCopy, FoldFillWithTensorExtract,
- FoldFillWithPack, FoldFillWithPad,
+ FoldFillWithTensorExtractSlice, FoldFillWithPack, FoldFillWithPad,
FoldFillWithTensorReshape<tensor::CollapseShapeOp>,
FoldFillWithTensorReshape<tensor::ExpandShapeOp>,
FoldInsertPadIntoFill, FoldFillWithTranspose>(context);
diff --git a/mlir/test/Dialect/Linalg/canonicalize.mlir b/mlir/test/Dialect/Linalg/canonicalize.mlir
index 4bc2ed140da91a..763cf80241fcc0 100644
--- a/mlir/test/Dialect/Linalg/canonicalize.mlir
+++ b/mlir/test/Dialect/Linalg/canonicalize.mlir
@@ -352,6 +352,20 @@ func.func @fold_fill_extract(%arg0 : i1) -> i1 {
// -----
+func.func @fold_fill_extract_slice() -> tensor<2x1920x64x66xf32> {
+ %c0 = arith.constant 0. : f32
+ %0 = tensor.empty() : tensor<2x1920x66x66xf32>
+ %1 = linalg.fill ins(%c0 : f32) outs(%0 : tensor<2x1920x66x66xf32>) -> tensor<2x1920x66x66xf32>
+ %extracted_slice = tensor.extract_slice %1[0, 0, 1, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
+ return %extracted_slice : tensor<2x1920x64x66xf32>
+}
+// CHECK-LABEL: func.func @fold_fill_extract_slice
+// CHECK: %[[EMPTY_TENSOR:.+]] = tensor.empty() : tensor<2x1920x64x66xf32>
+// CHECK: %[[FILL:.+]] = linalg.fill ins(%{{.+}}) outs(%[[EMPTY_TENSOR]]
+// CHECK: return %[[FILL]]
+
+// -----
+
func.func @fill_pack() -> tensor<24x32x16x16xf32> {
%dest = tensor.empty() : tensor<384x512xf32>
%cst = arith.constant 0.000000e+00 : f32
|
|
@llvm/pr-subscribers-mlir-linalg Author: Nithin Meganathan (nithinsubbiah) ChangesFull diff: https://github.com/llvm/llvm-project/pull/112619.diff 2 Files Affected:
diff --git a/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp b/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
index 730c478c2883ef..658445f19e91aa 100644
--- a/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
+++ b/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
@@ -806,6 +806,36 @@ struct FoldFillWithTensorExtract : public OpRewritePattern<tensor::ExtractOp> {
}
};
+/// Fold tensor.extract_slice(linalg.fill(<input>)) into <input>
+struct FoldFillWithTensorExtractSlice
+ : public OpRewritePattern<tensor::ExtractSliceOp> {
+public:
+ using OpRewritePattern<tensor::ExtractSliceOp>::OpRewritePattern;
+
+ LogicalResult matchAndRewrite(tensor::ExtractSliceOp extractSliceOp,
+ PatternRewriter &rewriter) const override {
+ // See if tensor input of tensor.extract_slice op is the result of a
+ // linalg.fill op.
+ auto fillOp = extractSliceOp.getSource().getDefiningOp<linalg::FillOp>();
+ if (!fillOp)
+ return failure();
+
+ Value fillInput = fillOp.getInputs()[0];
+
+ Location loc = extractSliceOp.getLoc();
+ SmallVector<OpFoldResult> mixedSizes = extractSliceOp.getMixedSizes();
+ auto emptyOp = rewriter.create<tensor::EmptyOp>(
+ loc, mixedSizes, extractSliceOp.getType().getElementType());
+
+ // Replace tensor.extract_slice op with new linalg.fillOp (former's result
+ // type and shape).
+ rewriter.replaceOpWithNewOp<linalg::FillOp>(
+ extractSliceOp, extractSliceOp.getResultType(), ValueRange{fillInput},
+ ValueRange{emptyOp});
+ return success();
+ }
+};
+
/// Folds pack(fill) into a single fill op if
/// 1. The pack op does not have padding value, or
/// 2. The filled value and padding value are the same.
@@ -936,7 +966,7 @@ struct FoldConcatsOfFill : public OpRewritePattern<tensor::ConcatOp> {
void FillOp::getCanonicalizationPatterns(RewritePatternSet &results,
MLIRContext *context) {
results.add<FoldConcatsOfFill, FoldFillWithCopy, FoldFillWithTensorExtract,
- FoldFillWithPack, FoldFillWithPad,
+ FoldFillWithTensorExtractSlice, FoldFillWithPack, FoldFillWithPad,
FoldFillWithTensorReshape<tensor::CollapseShapeOp>,
FoldFillWithTensorReshape<tensor::ExpandShapeOp>,
FoldInsertPadIntoFill, FoldFillWithTranspose>(context);
diff --git a/mlir/test/Dialect/Linalg/canonicalize.mlir b/mlir/test/Dialect/Linalg/canonicalize.mlir
index 4bc2ed140da91a..763cf80241fcc0 100644
--- a/mlir/test/Dialect/Linalg/canonicalize.mlir
+++ b/mlir/test/Dialect/Linalg/canonicalize.mlir
@@ -352,6 +352,20 @@ func.func @fold_fill_extract(%arg0 : i1) -> i1 {
// -----
+func.func @fold_fill_extract_slice() -> tensor<2x1920x64x66xf32> {
+ %c0 = arith.constant 0. : f32
+ %0 = tensor.empty() : tensor<2x1920x66x66xf32>
+ %1 = linalg.fill ins(%c0 : f32) outs(%0 : tensor<2x1920x66x66xf32>) -> tensor<2x1920x66x66xf32>
+ %extracted_slice = tensor.extract_slice %1[0, 0, 1, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
+ return %extracted_slice : tensor<2x1920x64x66xf32>
+}
+// CHECK-LABEL: func.func @fold_fill_extract_slice
+// CHECK: %[[EMPTY_TENSOR:.+]] = tensor.empty() : tensor<2x1920x64x66xf32>
+// CHECK: %[[FILL:.+]] = linalg.fill ins(%{{.+}}) outs(%[[EMPTY_TENSOR]]
+// CHECK: return %[[FILL]]
+
+// -----
+
func.func @fill_pack() -> tensor<24x32x16x16xf32> {
%dest = tensor.empty() : tensor<384x512xf32>
%cst = arith.constant 0.000000e+00 : f32
|
|
|
||
| // ----- | ||
|
|
||
| func.func @fold_fill_extract_slice() -> tensor<2x1920x64x66xf32> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a negative test, e.g. linalg.fill following by tensor.insert? Provided that doesn't trigger some other canonicalization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate why we may need such a test here? To test that such a pattern wouldn't get folded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To test that such a pattern wouldn't get folded?
Indeed. IMHO, we should always try covering "interesting" corner cases (within reason). These tests effectively document the code and its design - a negative test is an example of where the fold should not work. And if anything changes, we will almost immediately know (the test will start failing),
| } | ||
| }; | ||
|
|
||
| /// Fold tensor.extract_slice(linalg.fill(<input>)) into <input> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this comment is not quite correct, is it? You are replacing tensor.extract_slice with linalg.fill, right?
What if there are multiple uses of the original linalg.fill? Then this is adding another linalg.fill (and tensor.empty) rather than updating the original linalg.fill? Is that still more canonical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think even with multiple users it might be ok to introduce a new fill of the smaller shape. Thats an opinion though. I think all such slice propagation patterns need to be kept separate, but fill -> slice might be worth adding to canonicalization.
| func.func @fold_fill_extract_slice() -> tensor<2x1920x64x66xf32> { | ||
| %c0 = arith.constant 0. : f32 | ||
| %0 = tensor.empty() : tensor<2x1920x66x66xf32> | ||
| %1 = linalg.fill ins(%c0 : f32) outs(%0 : tensor<2x1920x66x66xf32>) -> tensor<2x1920x66x66xf32> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if this linalg.fill has multiple uses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesnt matter. You will have two fills, one larger, and one smaller. Fills are always better I think (especially in tensor semantics).
|
We probably want to have a separate set of patterns that "bubble up slices' this way. In IREE we have a pass that does this https://github.com/iree-org/iree/blob/main/compiler/src/iree/compiler/DispatchCreation/BubbleUpExtractSlices.cpp . I think we want to just collect such patterns into a single populate* method and not put this in canonicalization. |
|
Can you elaborate why this isn’t a good canonicalization? Is this losing semantics or anything? |
Like Andrej said, it depends on whether for a compiler stack a linalg.fill -> extract_slice is better or a two linalg.fills are better. I think it is an opinionated decision. So I can see reasons not to do it as a canonicalization. Fill is a bit of a corner case though. IMO it is always better to do this, but that is also based on my use case. I am not necessarily pushing back against this as a canonicalization, but if someone has an issue with it being one, then id be fine moving it out of canonicalization. |
Sorry but that's not a principled way to think about canonicalization: this actually a perfect example of why we need canonicalization. If the two forms can be equally well matched, the canonicalization is here so that a compiler does not have to match both but can match a single one.
Sure, I'd want to hear a principled reason about why it's not a good one (that is "it is not alway possible to match the new canonical form", or "we lost some important information from the original program"). This should always be possible to express with an example of input IR and the intended lowering. |
|
First, a note: BEFORE (1 use): %empty = tensor.empty
%fill = linalg.fill %empty
%res = tensor.extract_sliceAFTER (1 use): %empty = tensor.empty
%res = linalg.fill %emptyBEFORE (2 uses): AFTER (2 uses): Looking at the case with 1 use, this feels like a canonicalization to me:
This is no longer so clear to me in the case with 2 uses. Perhaps this should be limited to cases with 1 use? On a related note, going by the Tensor Op docs
So, with 2 uses, this would potentially replace (after bufferization):
It's not immediately obvious to me whether one is more "canonical" or "better" than the other. If aliasing is an issue, separate allocations would help. If aliasing is not a problem, re-using one buffer is probably better? No strong opinion here, mostly just analysing out loud. |
9d24d35 to
1c93e3d
Compare
|
Thanks for the comments. I agree with @banach-space that this might not fit into a canonicalization pattern when the I've changed the pattern to fold only when there's one consumer of the |
I really don't want to add to this thread, but I am skeptical that this qualifies if restricted to a single consumer: that doesn't actually simplify any downstream matching because now a compiler still has to pattern match all of the variants. I could imagine arguments why this might be the case, but having run into a lot of these fragile single-consumer optimizations trying to fly as a canonicalization, I am default skeptical. If the multi use case is a canonical form, then so be it. But I expect that Mahesh is right that the actual optimization being sought here is a full graph transformation where the graph is put into a form such that some of these ops are maximally "bubbled". That is an optimization. The effect can be approximated by the canonicalizer but it is typically not the right tool for the job, ime. |
|
Let me try to summarize based on the discussions we've had so far. Folding of When there's multiple consumers, an additional I see two options here:
|
I think Andrej's point is good. If we want to add a canonicalization, we add it for a single user of the |
Signed-off-by: nithinsubbiah <[email protected]>
f017af6 to
52e7b53
Compare
…o FillOp Signed-off-by: nithinsubbiah <[email protected]>
MaheshRavishankar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should avoid using a tensor.empty for the destination of the fill op. It is better just using a slice of a destination.
Come to think of it, that way you could drop the multiple use constraint too..
@banach-space how about this change for this example That seems like that should fall in the always good to do category. Again I am not advocating one way or another, just thinking aloud. |
I see 4 Ops before and 5 Ops after , so it's not immediately obvious to me that AFTER is better :) What metric do you use to decide what's better/worse? Not claiming my metric is perfect, just sharing how I reason about these things. I tried this simple experiment to better understand the tradeoffs (from my biased viewpoint): // 4 Ops - empty + fill + extract_slice + extract_slice
func.func @foo() -> (tensor<2x1920x64x66xf32>, tensor<2x1920x64x66xf32>) {
%c0 = arith.constant 0. : f32
%0 = tensor.empty() : tensor<2x1920x66x66xf32>
%1 = linalg.fill ins(%c0 : f32) outs(%0 : tensor<2x1920x66x66xf32>) -> tensor<2x1920x66x66xf32>
%extracted_slice = tensor.extract_slice %1[0, 0, 0, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
%extracted_slice_2 = tensor.extract_slice %1[0, 0, 2, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
return %extracted_slice,%extracted_slice_2 : tensor<2x1920x64x66xf32>, tensor<2x1920x64x66xf32>
}
// 5 Ops - empty + extract_slice + extract_slice + fill + fill
func.func @bar() -> (tensor<2x1920x64x66xf32>, tensor<2x1920x64x66xf32>) {
%c0 = arith.constant 0. : f32
%0 = tensor.empty() : tensor<2x1920x66x66xf32>
%extracted_slice = tensor.extract_slice %0[0, 0, 0, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
%extracted_slice_2 = tensor.extract_slice %0[0, 0, 2, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
%res_1 = linalg.fill ins(%c0 : f32) outs(%extracted_slice : tensor<2x1920x64x66xf32>) -> tensor<2x1920x64x66xf32>
%res_2 = linalg.fill ins(%c0 : f32) outs(%extracted_slice_2 : tensor<2x1920x64x66xf32>) -> tensor<2x1920x64x66xf32>
return %res_1, %res_2 : tensor<2x1920x64x66xf32>, tensor<2x1920x64x66xf32>
}After bufferization ( func.func @foo() -> (memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>>, memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>) {
%cst = arith.constant 0.000000e+00 : f32
%alloc = memref.alloc() {alignment = 64 : i64} : memref<2x1920x66x66xf32>
linalg.fill ins(%cst : f32) outs(%alloc : memref<2x1920x66x66xf32>)
%subview = memref.subview %alloc[0, 0, 0, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : memref<2x1920x66x66xf32> to memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>>
%subview_0 = memref.subview %alloc[0, 0, 2, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : memref<2x1920x66x66xf32> to memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>
%cast = memref.cast %subview : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>> to memref<2x1920x64x66xf32, strided<[?, ?, ?, ?], offset: ?>>
%cast_1 = memref.cast %subview_0 : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>> to memref<2x1920x64x66xf32, strided<[?, ?, ?, ?], offset: ?>>
return %subview, %subview_0 : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>>, memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>
}
func.func @bar() -> (memref<2x1920x64x66xf32>, memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>) {
%cst = arith.constant 0.000000e+00 : f32
%alloc = memref.alloc() {alignment = 64 : i64} : memref<2x1920x66x66xf32>
%subview = memref.subview %alloc[0, 0, 0, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : memref<2x1920x66x66xf32> to memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>>
%alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<2x1920x64x66xf32>
%subview_1 = memref.subview %alloc[0, 0, 2, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : memref<2x1920x66x66xf32> to memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>
linalg.fill ins(%cst : f32) outs(%alloc_0 : memref<2x1920x64x66xf32>)
linalg.fill ins(%cst : f32) outs(%subview_1 : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>)
%cast = memref.cast %alloc_0 : memref<2x1920x64x66xf32> to memref<2x1920x64x66xf32, strided<[?, ?, ?, ?], offset: ?>>
%cast_2 = memref.cast %subview_1 : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>> to memref<2x1920x64x66xf32, strided<[?, ?, ?, ?], offset: ?>>
return %alloc_0, %subview_1 : memref<2x1920x64x66xf32>, memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>
}So, with 5 Ops there's one additional big allocation and an additional FillOp. Which is not ideal. However, I appreciate that a lot can happen between the proposed canonicalization and bufferization (i.e. my example is a bit academic).
Could you elaborate? I'd like to better follow the rationale here. Just to clarify, I am not against this change. And it should be OK to keep it as canonicalization - no semantic context is lost, matching should be similar. IIUC, you are trying to get rid of |
|
Thanks for checking.... I had things not connected correctly. After seeing the output you have, I think it makes more sense to me to just do this for a single use case for now as a canonicalization. |
banach-space
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % my final (minor) comments
Great technical discussion, thank you everyone!
| } | ||
| }; | ||
|
|
||
| /// Fold tensor.extract_slice(linalg.fill(<input>)) into linalg.fill(<input>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Add a note that this triggers only when there's only one use "by design" (with one use this is obviously beneficial, with multiple uses it is not so clear)
|
|
||
| // ----- | ||
|
|
||
| func.func @fold_fill_extract_slice() -> tensor<2x1920x64x66xf32> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To test that such a pattern wouldn't get folded?
Indeed. IMHO, we should always try covering "interesting" corner cases (within reason). These tests effectively document the code and its design - a negative test is an example of where the fold should not work. And if anything changes, we will almost immediately know (the test will start failing),
This patch adds a canonicalization pattern to fold away
tensor.extract_slicepreceded by alinalg.fillif the latter op has only one consumer.