Skip to content

Conversation

@nithinsubbiah
Copy link

@nithinsubbiah nithinsubbiah commented Oct 16, 2024

This patch adds a canonicalization pattern to fold away tensor.extract_slice preceded by a linalg.fill if the latter op has only one consumer.

@github-actions
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Oct 16, 2024

@llvm/pr-subscribers-mlir

Author: Nithin Meganathan (nithinsubbiah)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/112619.diff

2 Files Affected:

  • (modified) mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp (+31-1)
  • (modified) mlir/test/Dialect/Linalg/canonicalize.mlir (+14)
diff --git a/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp b/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
index 730c478c2883ef..658445f19e91aa 100644
--- a/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
+++ b/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
@@ -806,6 +806,36 @@ struct FoldFillWithTensorExtract : public OpRewritePattern<tensor::ExtractOp> {
   }
 };
 
+/// Fold tensor.extract_slice(linalg.fill(<input>)) into <input>
+struct FoldFillWithTensorExtractSlice
+    : public OpRewritePattern<tensor::ExtractSliceOp> {
+public:
+  using OpRewritePattern<tensor::ExtractSliceOp>::OpRewritePattern;
+
+  LogicalResult matchAndRewrite(tensor::ExtractSliceOp extractSliceOp,
+                                PatternRewriter &rewriter) const override {
+    // See if tensor input of tensor.extract_slice op is the result of a
+    // linalg.fill op.
+    auto fillOp = extractSliceOp.getSource().getDefiningOp<linalg::FillOp>();
+    if (!fillOp)
+      return failure();
+
+    Value fillInput = fillOp.getInputs()[0];
+
+    Location loc = extractSliceOp.getLoc();
+    SmallVector<OpFoldResult> mixedSizes = extractSliceOp.getMixedSizes();
+    auto emptyOp = rewriter.create<tensor::EmptyOp>(
+        loc, mixedSizes, extractSliceOp.getType().getElementType());
+
+    // Replace tensor.extract_slice op with new linalg.fillOp (former's result
+    // type and shape).
+    rewriter.replaceOpWithNewOp<linalg::FillOp>(
+        extractSliceOp, extractSliceOp.getResultType(), ValueRange{fillInput},
+        ValueRange{emptyOp});
+    return success();
+  }
+};
+
 /// Folds pack(fill) into a single fill op if
 ///   1. The pack op does not have padding value, or
 ///   2. The filled value and padding value are the same.
@@ -936,7 +966,7 @@ struct FoldConcatsOfFill : public OpRewritePattern<tensor::ConcatOp> {
 void FillOp::getCanonicalizationPatterns(RewritePatternSet &results,
                                          MLIRContext *context) {
   results.add<FoldConcatsOfFill, FoldFillWithCopy, FoldFillWithTensorExtract,
-              FoldFillWithPack, FoldFillWithPad,
+              FoldFillWithTensorExtractSlice, FoldFillWithPack, FoldFillWithPad,
               FoldFillWithTensorReshape<tensor::CollapseShapeOp>,
               FoldFillWithTensorReshape<tensor::ExpandShapeOp>,
               FoldInsertPadIntoFill, FoldFillWithTranspose>(context);
diff --git a/mlir/test/Dialect/Linalg/canonicalize.mlir b/mlir/test/Dialect/Linalg/canonicalize.mlir
index 4bc2ed140da91a..763cf80241fcc0 100644
--- a/mlir/test/Dialect/Linalg/canonicalize.mlir
+++ b/mlir/test/Dialect/Linalg/canonicalize.mlir
@@ -352,6 +352,20 @@ func.func @fold_fill_extract(%arg0 : i1) -> i1 {
 
 // -----
 
+func.func @fold_fill_extract_slice() -> tensor<2x1920x64x66xf32> {
+  %c0 = arith.constant 0. : f32
+  %0 = tensor.empty() : tensor<2x1920x66x66xf32>
+  %1 = linalg.fill ins(%c0 : f32) outs(%0 : tensor<2x1920x66x66xf32>) -> tensor<2x1920x66x66xf32>
+  %extracted_slice = tensor.extract_slice %1[0, 0, 1, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
+  return %extracted_slice : tensor<2x1920x64x66xf32>
+}
+// CHECK-LABEL: func.func @fold_fill_extract_slice
+// CHECK:         %[[EMPTY_TENSOR:.+]] = tensor.empty() : tensor<2x1920x64x66xf32>
+// CHECK:         %[[FILL:.+]] = linalg.fill ins(%{{.+}}) outs(%[[EMPTY_TENSOR]]
+// CHECK:         return %[[FILL]]
+
+// -----
+
 func.func @fill_pack() -> tensor<24x32x16x16xf32> {
   %dest = tensor.empty() : tensor<384x512xf32>
   %cst = arith.constant 0.000000e+00 : f32

@llvmbot
Copy link
Member

llvmbot commented Oct 16, 2024

@llvm/pr-subscribers-mlir-linalg

Author: Nithin Meganathan (nithinsubbiah)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/112619.diff

2 Files Affected:

  • (modified) mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp (+31-1)
  • (modified) mlir/test/Dialect/Linalg/canonicalize.mlir (+14)
diff --git a/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp b/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
index 730c478c2883ef..658445f19e91aa 100644
--- a/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
+++ b/mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp
@@ -806,6 +806,36 @@ struct FoldFillWithTensorExtract : public OpRewritePattern<tensor::ExtractOp> {
   }
 };
 
+/// Fold tensor.extract_slice(linalg.fill(<input>)) into <input>
+struct FoldFillWithTensorExtractSlice
+    : public OpRewritePattern<tensor::ExtractSliceOp> {
+public:
+  using OpRewritePattern<tensor::ExtractSliceOp>::OpRewritePattern;
+
+  LogicalResult matchAndRewrite(tensor::ExtractSliceOp extractSliceOp,
+                                PatternRewriter &rewriter) const override {
+    // See if tensor input of tensor.extract_slice op is the result of a
+    // linalg.fill op.
+    auto fillOp = extractSliceOp.getSource().getDefiningOp<linalg::FillOp>();
+    if (!fillOp)
+      return failure();
+
+    Value fillInput = fillOp.getInputs()[0];
+
+    Location loc = extractSliceOp.getLoc();
+    SmallVector<OpFoldResult> mixedSizes = extractSliceOp.getMixedSizes();
+    auto emptyOp = rewriter.create<tensor::EmptyOp>(
+        loc, mixedSizes, extractSliceOp.getType().getElementType());
+
+    // Replace tensor.extract_slice op with new linalg.fillOp (former's result
+    // type and shape).
+    rewriter.replaceOpWithNewOp<linalg::FillOp>(
+        extractSliceOp, extractSliceOp.getResultType(), ValueRange{fillInput},
+        ValueRange{emptyOp});
+    return success();
+  }
+};
+
 /// Folds pack(fill) into a single fill op if
 ///   1. The pack op does not have padding value, or
 ///   2. The filled value and padding value are the same.
@@ -936,7 +966,7 @@ struct FoldConcatsOfFill : public OpRewritePattern<tensor::ConcatOp> {
 void FillOp::getCanonicalizationPatterns(RewritePatternSet &results,
                                          MLIRContext *context) {
   results.add<FoldConcatsOfFill, FoldFillWithCopy, FoldFillWithTensorExtract,
-              FoldFillWithPack, FoldFillWithPad,
+              FoldFillWithTensorExtractSlice, FoldFillWithPack, FoldFillWithPad,
               FoldFillWithTensorReshape<tensor::CollapseShapeOp>,
               FoldFillWithTensorReshape<tensor::ExpandShapeOp>,
               FoldInsertPadIntoFill, FoldFillWithTranspose>(context);
diff --git a/mlir/test/Dialect/Linalg/canonicalize.mlir b/mlir/test/Dialect/Linalg/canonicalize.mlir
index 4bc2ed140da91a..763cf80241fcc0 100644
--- a/mlir/test/Dialect/Linalg/canonicalize.mlir
+++ b/mlir/test/Dialect/Linalg/canonicalize.mlir
@@ -352,6 +352,20 @@ func.func @fold_fill_extract(%arg0 : i1) -> i1 {
 
 // -----
 
+func.func @fold_fill_extract_slice() -> tensor<2x1920x64x66xf32> {
+  %c0 = arith.constant 0. : f32
+  %0 = tensor.empty() : tensor<2x1920x66x66xf32>
+  %1 = linalg.fill ins(%c0 : f32) outs(%0 : tensor<2x1920x66x66xf32>) -> tensor<2x1920x66x66xf32>
+  %extracted_slice = tensor.extract_slice %1[0, 0, 1, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
+  return %extracted_slice : tensor<2x1920x64x66xf32>
+}
+// CHECK-LABEL: func.func @fold_fill_extract_slice
+// CHECK:         %[[EMPTY_TENSOR:.+]] = tensor.empty() : tensor<2x1920x64x66xf32>
+// CHECK:         %[[FILL:.+]] = linalg.fill ins(%{{.+}}) outs(%[[EMPTY_TENSOR]]
+// CHECK:         return %[[FILL]]
+
+// -----
+
 func.func @fill_pack() -> tensor<24x32x16x16xf32> {
   %dest = tensor.empty() : tensor<384x512xf32>
   %cst = arith.constant 0.000000e+00 : f32


// -----

func.func @fold_fill_extract_slice() -> tensor<2x1920x64x66xf32> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a negative test, e.g. linalg.fill following by tensor.insert? Provided that doesn't trigger some other canonicalization.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate why we may need such a test here? To test that such a pattern wouldn't get folded?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test that such a pattern wouldn't get folded?

Indeed. IMHO, we should always try covering "interesting" corner cases (within reason). These tests effectively document the code and its design - a negative test is an example of where the fold should not work. And if anything changes, we will almost immediately know (the test will start failing),

}
};

/// Fold tensor.extract_slice(linalg.fill(<input>)) into <input>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this comment is not quite correct, is it? You are replacing tensor.extract_slice with linalg.fill, right?

What if there are multiple uses of the original linalg.fill? Then this is adding another linalg.fill (and tensor.empty) rather than updating the original linalg.fill? Is that still more canonical?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even with multiple users it might be ok to introduce a new fill of the smaller shape. Thats an opinion though. I think all such slice propagation patterns need to be kept separate, but fill -> slice might be worth adding to canonicalization.

func.func @fold_fill_extract_slice() -> tensor<2x1920x64x66xf32> {
%c0 = arith.constant 0. : f32
%0 = tensor.empty() : tensor<2x1920x66x66xf32>
%1 = linalg.fill ins(%c0 : f32) outs(%0 : tensor<2x1920x66x66xf32>) -> tensor<2x1920x66x66xf32>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if this linalg.fill has multiple uses?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesnt matter. You will have two fills, one larger, and one smaller. Fills are always better I think (especially in tensor semantics).

@MaheshRavishankar
Copy link
Contributor

We probably want to have a separate set of patterns that "bubble up slices' this way. In IREE we have a pass that does this https://github.com/iree-org/iree/blob/main/compiler/src/iree/compiler/DispatchCreation/BubbleUpExtractSlices.cpp . I think we want to just collect such patterns into a single populate* method and not put this in canonicalization.

@joker-eph
Copy link
Collaborator

Can you elaborate why this isn’t a good canonicalization? Is this losing semantics or anything?

@MaheshRavishankar
Copy link
Contributor

Can you elaborate why this isn’t a good canonicalization? Is this losing semantics or anything?

Like Andrej said, it depends on whether for a compiler stack a linalg.fill -> extract_slice is better or a two linalg.fills are better. I think it is an opinionated decision. So I can see reasons not to do it as a canonicalization. Fill is a bit of a corner case though. IMO it is always better to do this, but that is also based on my use case. I am not necessarily pushing back against this as a canonicalization, but if someone has an issue with it being one, then id be fine moving it out of canonicalization.

@joker-eph
Copy link
Collaborator

Like Andrej said, it depends on whether for a compiler stack a linalg.fill -> extract_slice is better or a two linalg.fills are better. I think it is an opinionated decision

Sorry but that's not a principled way to think about canonicalization: this actually a perfect example of why we need canonicalization. If the two forms can be equally well matched, the canonicalization is here so that a compiler does not have to match both but can match a single one.
A "compiler stack" cannot just expect something to be "better" without a principled argument, hence the question about semantic loss.

if someone has an issue with it being one, then id be fine moving it out of canonicalization.

Sure, I'd want to hear a principled reason about why it's not a good one (that is "it is not alway possible to match the new canonical form", or "we lost some important information from the original program"). This should always be possible to express with an example of input IR and the intended lowering.

@banach-space
Copy link
Contributor

First, a note:

BEFORE (1 use):

%empty = tensor.empty
%fill = linalg.fill %empty
%res = tensor.extract_slice

AFTER (1 use):

%empty = tensor.empty
%res = linalg.fill %empty

BEFORE (2 uses):

%empty = tensor.empty
%fill = linalg.fill %empty
%res_1 = tensor.extract_slice %fill
%res_2 = tensor.extract_slice %fill

AFTER (2 uses):

%empty_1 = tensor.empty
%res_1 = linalg.fill %empty_1
%empty_2 = tensor.empty
%res_2 = linalg.fill %empty_2

Looking at the case with 1 use, this feels like a canonicalization to me:

  • no information is lost and, with fewer Ops, the AFTER variant is basically simpler.

This is no longer so clear to me in the case with 2 uses. Perhaps this should be limited to cases with 1 use?

On a related note, going by the Tensor Op docs

[tensor.empty]
This op can be lowered to a bufferization.alloc_tensor, at which point it turns into an explicit buffer allocation.

[tensor.extract_slice]
After buffer allocation, the “extract_slice” op is expected to lower into a memref.subview op.

So, with 2 uses, this would potentially replace (after bufferization):

  • 1 allocation (bufferization.alloc_tensor) followed by 2 subviews (memref.subview), with
  • 2 allocations.

It's not immediately obvious to me whether one is more "canonical" or "better" than the other. If aliasing is an issue, separate allocations would help. If aliasing is not a problem, re-using one buffer is probably better?

No strong opinion here, mostly just analysing out loud.

@nithinsubbiah
Copy link
Author

Thanks for the comments. I agree with @banach-space that this might not fit into a canonicalization pattern when the FillOp has multiple consumers introducing additional buffer allocation.

I've changed the pattern to fold only when there's one consumer of the FillOp.

@stellaraccident
Copy link
Contributor

stellaraccident commented Oct 21, 2024

Thanks for the comments. I agree with @banach-space that this might not fit into a canonicalization pattern when the FillOp has multiple consumers introducing additional buffer allocation.

I've changed the pattern to fold only when there's one consumer of the FillOp.

I really don't want to add to this thread, but I am skeptical that this qualifies if restricted to a single consumer: that doesn't actually simplify any downstream matching because now a compiler still has to pattern match all of the variants. I could imagine arguments why this might be the case, but having run into a lot of these fragile single-consumer optimizations trying to fly as a canonicalization, I am default skeptical.

If the multi use case is a canonical form, then so be it. But I expect that Mahesh is right that the actual optimization being sought here is a full graph transformation where the graph is put into a form such that some of these ops are maximally "bubbled". That is an optimization. The effect can be approximated by the canonicalizer but it is typically not the right tool for the job, ime.

@nithinsubbiah
Copy link
Author

Let me try to summarize based on the discussions we've had so far.

Folding of extract_slice into linalg.fill when the latter op has single vs multiple consumers. Do we consider either/both of them as part of canonicalization? I think there's agreement when there's a single consumer of FillOp, this pass produces a simpler representation of the same set of input instructions.

When there's multiple consumers, an additional tensor.empty needs to be introduced since replacing the old FillOp with a new sliced FillOp might not always work. An example of where it will not work:

%1 = linalg.fill
%2 = tensor.extract %1
%3 = scf.for [%1]

I see two options here:

  1. Don't enforce single consumer constraint and have an additional tensor.empty.
  2. I can also create a separate pattern like https://github.com/llvm/llvm-project/blob/78e026f845fb4d924673a9d534cc36cf7b55473c/mlir/lib/Dialect/Linalg/Transforms/SwapExtractSliceWithFillPatterns.cpp

@MaheshRavishankar
Copy link
Contributor

Let me try to summarize based on the discussions we've had so far.

Folding of extract_slice into linalg.fill when the latter op has single vs multiple consumers. Do we consider either/both of them as part of canonicalization? I think there's agreement when there's a single consumer of FillOp, this pass produces a simpler representation of the same set of input instructions.

When there's multiple consumers, an additional tensor.empty needs to be introduced since replacing the old FillOp with a new sliced FillOp might not always work. An example of where it will not work:

%1 = linalg.fill
%2 = tensor.extract %1
%3 = scf.for [%1]

I see two options here:

  1. Don't enforce single consumer constraint and have an additional tensor.empty.
  2. I can also create a separate pattern like https://github.com/llvm/llvm-project/blob/78e026f845fb4d924673a9d534cc36cf7b55473c/mlir/lib/Dialect/Linalg/Transforms/SwapExtractSliceWithFillPatterns.cpp

I think Andrej's point is good. If we want to add a canonicalization, we add it for a single user of the linalg.fill. But mainly I think we want to add a separate pattern that does the "bubbling up" of extract_slice irrespective of number of uses and that is a separate pattern + populate method.

Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should avoid using a tensor.empty for the destination of the fill op. It is better just using a slice of a destination.

Come to think of it, that way you could drop the multiple use constraint too..

@MaheshRavishankar
Copy link
Contributor

BEFORE (2 uses):

%empty = tensor.empty
%fill = linalg.fill %empty
%res_1 = tensor.extract_slice %fill
%res_2 = tensor.extract_slice %fill

AFTER (2 uses):

%empty_1 = tensor.empty
%res_1 = linalg.fill %empty_1
%empty_2 = tensor.empty
%res_2 = linalg.fill %empty_2

@banach-space how about this change for this example

%empty = tensor.empty
%extract_slice1 = tensor.extract_slice %empty
%extract_slice2 = tensor.extract_slice %empty
%res1 = linalg.fill (...) outs(%extract_slice1)
%res2 = linalg.fill (...) outs(%extract_slice2)

That seems like that should fall in the always good to do category. Again I am not advocating one way or another, just thinking aloud.

@banach-space
Copy link
Contributor

@banach-space how about this change for this example

%empty = tensor.empty
%extract_slice1 = tensor.extract_slice %empty
%extract_slice2 = tensor.extract_slice %empty
%res1 = linalg.fill (...) outs(%extract_slice1)
%res2 = linalg.fill (...) outs(%extract_slice2)

That seems like that should fall in the always good to do category. Again I am not advocating one way or another, just thinking aloud.

I see 4 Ops before and 5 Ops after , so it's not immediately obvious to me that AFTER is better :) What metric do you use to decide what's better/worse? Not claiming my metric is perfect, just sharing how I reason about these things.

I tried this simple experiment to better understand the tradeoffs (from my biased viewpoint):

// 4 Ops - empty + fill + extract_slice + extract_slice
func.func @foo() -> (tensor<2x1920x64x66xf32>, tensor<2x1920x64x66xf32>) {
  %c0 = arith.constant 0. : f32
  %0 = tensor.empty() : tensor<2x1920x66x66xf32>
  %1 = linalg.fill ins(%c0 : f32) outs(%0 : tensor<2x1920x66x66xf32>) -> tensor<2x1920x66x66xf32>
  %extracted_slice = tensor.extract_slice %1[0, 0, 0, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
  %extracted_slice_2 = tensor.extract_slice %1[0, 0, 2, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
  return %extracted_slice,%extracted_slice_2 : tensor<2x1920x64x66xf32>, tensor<2x1920x64x66xf32>
}

// 5 Ops - empty + extract_slice + extract_slice + fill + fill
func.func @bar() -> (tensor<2x1920x64x66xf32>, tensor<2x1920x64x66xf32>) {
  %c0 = arith.constant 0. : f32
  %0 = tensor.empty() : tensor<2x1920x66x66xf32>
  %extracted_slice = tensor.extract_slice %0[0, 0, 0, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
  %extracted_slice_2 = tensor.extract_slice %0[0, 0, 2, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
  %res_1 = linalg.fill ins(%c0 : f32) outs(%extracted_slice : tensor<2x1920x64x66xf32>) -> tensor<2x1920x64x66xf32>
  %res_2 = linalg.fill ins(%c0 : f32) outs(%extracted_slice_2 : tensor<2x1920x64x66xf32>) -> tensor<2x1920x64x66xf32>
  return %res_1, %res_2 : tensor<2x1920x64x66xf32>, tensor<2x1920x64x66xf32>
}

After bufferization (-one-shot-bufferize="bufferize-function-boundaries"):

  func.func @foo() -> (memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>>, memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>) {
    %cst = arith.constant 0.000000e+00 : f32
    %alloc = memref.alloc() {alignment = 64 : i64} : memref<2x1920x66x66xf32>
    linalg.fill ins(%cst : f32) outs(%alloc : memref<2x1920x66x66xf32>)
    %subview = memref.subview %alloc[0, 0, 0, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : memref<2x1920x66x66xf32> to memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>>
    %subview_0 = memref.subview %alloc[0, 0, 2, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : memref<2x1920x66x66xf32> to memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>
    %cast = memref.cast %subview : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>> to memref<2x1920x64x66xf32, strided<[?, ?, ?, ?], offset: ?>>
    %cast_1 = memref.cast %subview_0 : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>> to memref<2x1920x64x66xf32, strided<[?, ?, ?, ?], offset: ?>>
    return %subview, %subview_0 : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>>, memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>
  }
  func.func @bar() -> (memref<2x1920x64x66xf32>, memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>) {
    %cst = arith.constant 0.000000e+00 : f32
    %alloc = memref.alloc() {alignment = 64 : i64} : memref<2x1920x66x66xf32>
    %subview = memref.subview %alloc[0, 0, 0, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : memref<2x1920x66x66xf32> to memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>>
    %alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<2x1920x64x66xf32>
    %subview_1 = memref.subview %alloc[0, 0, 2, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : memref<2x1920x66x66xf32> to memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>
    linalg.fill ins(%cst : f32) outs(%alloc_0 : memref<2x1920x64x66xf32>)
    linalg.fill ins(%cst : f32) outs(%subview_1 : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>)
    %cast = memref.cast %alloc_0 : memref<2x1920x64x66xf32> to memref<2x1920x64x66xf32, strided<[?, ?, ?, ?], offset: ?>>
    %cast_2 = memref.cast %subview_1 : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>> to memref<2x1920x64x66xf32, strided<[?, ?, ?, ?], offset: ?>>
    return %alloc_0, %subview_1 : memref<2x1920x64x66xf32>, memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>
  }

So, with 5 Ops there's one additional big allocation and an additional FillOp. Which is not ideal. However, I appreciate that a lot can happen between the proposed canonicalization and bufferization (i.e. my example is a bit academic).

I think you should avoid using a tensor.empty for the destination of the fill op. It is better just using a slice of a destination.

Could you elaborate? I'd like to better follow the rationale here.

Just to clarify, I am not against this change. And it should be OK to keep it as canonicalization - no semantic context is lost, matching should be similar. IIUC, you are trying to get rid of tensor.extract_slice? I am trying to better grasp what is it that we are aiming for 😅

@MaheshRavishankar
Copy link
Contributor

Thanks for checking.... I had things not connected correctly. After seeing the output you have, I think it makes more sense to me to just do this for a single use case for now as a canonicalization.

Copy link
Contributor

@banach-space banach-space left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % my final (minor) comments

Great technical discussion, thank you everyone!

}
};

/// Fold tensor.extract_slice(linalg.fill(<input>)) into linalg.fill(<input>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Add a note that this triggers only when there's only one use "by design" (with one use this is obviously beneficial, with multiple uses it is not so clear)


// -----

func.func @fold_fill_extract_slice() -> tensor<2x1920x64x66xf32> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test that such a pattern wouldn't get folded?

Indeed. IMHO, we should always try covering "interesting" corner cases (within reason). These tests effectively document the code and its design - a negative test is an example of where the fold should not work. And if anything changes, we will almost immediately know (the test will start failing),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants