-
Notifications
You must be signed in to change notification settings - Fork 15.3k
[mlir][scf]: Avoid inserting affine.min when tiling dynamic operation sizes if possible #113819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-mlir-scf @llvm/pr-subscribers-mlir Author: Aviad Cohen (AviadCo) Changes
Full diff: https://github.com/llvm/llvm-project/pull/113819.diff 2 Files Affected:
diff --git a/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp b/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp
index e2feb10b314540..54afbe4b032bd4 100644
--- a/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp
+++ b/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp
@@ -186,11 +186,35 @@ static void checkSafeToTileToForall(TilingInterface op,
}
}
+/// Returns true if `size` is dynamic multiplication of `stride`.
+/// i.e. , `size = stride * k` where stride may be a constant or a dynamic.
+static bool dynamiclyDivisible(OpFoldResult size, OpFoldResult stride) {
+ Value dynamicSize = dyn_cast_if_present<Value>(size);
+ if (!dynamicSize)
+ return false;
+ auto mulOp = dynamicSize.getDefiningOp<arith::MulIOp>();
+ if (!mulOp)
+ return false;
+ if (Value dynamicStride = dyn_cast_if_present<Value>(stride))
+ return mulOp.getLhs() == dynamicStride || mulOp.getRhs() == dynamicStride;
+ std::optional<int64_t> strideAsInt = getConstantIntValue(stride);
+ std::optional<int64_t> lhsAsInt = getConstantIntValue(mulOp.getLhs());
+ std::optional<int64_t> rhsAsInt = getConstantIntValue(mulOp.getRhs());
+ if (strideAsInt && lhsAsInt && *strideAsInt == *lhsAsInt)
+ return true;
+ if (strideAsInt && rhsAsInt && *strideAsInt == *rhsAsInt)
+ return true;
+
+ return false;
+}
+
/// Check if `stride` evenly divides the trip count `size - offset`.
static bool tileDividesIterationDomain(Range loopRange) {
std::optional<int64_t> offsetAsInt = getConstantIntValue(loopRange.offset);
if (!offsetAsInt)
return false;
+ if (*offsetAsInt == 0 && dynamiclyDivisible(loopRange.size, loopRange.stride))
+ return true;
std::optional<int64_t> sizeAsInt = getConstantIntValue(loopRange.size);
if (!sizeAsInt)
return false;
diff --git a/mlir/test/Dialect/Linalg/transform-op-tile.mlir b/mlir/test/Dialect/Linalg/transform-op-tile.mlir
index 7bac850d0b7fe9..ade523ef378f36 100644
--- a/mlir/test/Dialect/Linalg/transform-op-tile.mlir
+++ b/mlir/test/Dialect/Linalg/transform-op-tile.mlir
@@ -266,3 +266,50 @@ func.func @tile_linalg_matmul(
-> tensor<128x128xf32>
return %0 : tensor<128x128xf32>
}
+
+// -----
+
+#map = affine_map<(d0) -> (d0)>
+
+// CHECK-LABEL: splited_dynamic_linalg_generic
+func.func @splited_dynamic_linalg_generic(%arg0: tensor<?xi16>, %arg1: tensor<?xi16>) -> tensor<?xi16> {
+ %c80 = arith.constant 80 : index
+ %c0 = arith.constant 0 : index
+ %dim = tensor.dim %arg1, %c0 : tensor<?xi16>
+ %0 = tensor.empty(%dim) : tensor<?xi16>
+ %1 = arith.divui %dim, %c80 : index
+ %2 = arith.muli %1, %c80 : index
+ %3 = arith.remui %dim, %c80 : index
+ %extracted_slice = tensor.extract_slice %arg0[0] [%2] [1] : tensor<?xi16> to tensor<?xi16>
+ %extracted_slice_0 = tensor.extract_slice %arg1[0] [%2] [1] : tensor<?xi16> to tensor<?xi16>
+ %extracted_slice_1 = tensor.extract_slice %0[0] [%2] [1] : tensor<?xi16> to tensor<?xi16>
+ // CHECK: scf.for
+ // CHECK-NOT: affine.min
+ %4 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%extracted_slice, %extracted_slice_0 : tensor<?xi16>, tensor<?xi16>) outs(%extracted_slice_1 : tensor<?xi16>) {
+ ^bb0(%in_1: i16, %in_2: i16, %out: i16):
+ %6 = arith.addi %in_1, %in_2 : i16
+ linalg.yield %6 : i16
+ } -> tensor<?xi16>
+ %inserted_slice = tensor.insert_slice %4 into %0[%2] [%2] [1] : tensor<?xi16> into tensor<?xi16>
+ %extracted_slice_2 = tensor.extract_slice %arg0[%2] [%3] [1] : tensor<?xi16> to tensor<?xi16>
+ %extracted_slice_3 = tensor.extract_slice %arg1[%2] [%3] [1] : tensor<?xi16> to tensor<?xi16>
+ %extracted_slice_4 = tensor.extract_slice %0[%2] [%3] [1] : tensor<?xi16> to tensor<?xi16>
+ // CHECK-NOT: scf.for
+ %5 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%extracted_slice_2, %extracted_slice_3 : tensor<?xi16>, tensor<?xi16>) outs(%extracted_slice_4 : tensor<?xi16>) {
+ ^bb0(%in_1: i16, %in_2: i16, %out: i16):
+ %7 = arith.addi %in_1, %in_2 : i16
+ linalg.yield %7 : i16
+ } -> tensor<?xi16>
+ %inserted_slice_0 = tensor.insert_slice %5 into %inserted_slice[%2] [%3] [1] : tensor<?xi16> into tensor<?xi16>
+ return %inserted_slice_0 : tensor<?xi16>
+}
+
+
+module attributes {transform.with_named_sequence} {
+transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
+ %0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!transform.any_op) -> !transform.any_op
+ %const = transform.structured.match ops{["arith.constant"]} in %arg1 : (!transform.any_op) -> !transform.any_op
+ %1, %loop = transform.structured.tile_using_for %0 tile_sizes [%const] : (!transform.any_op, !transform.any_op) -> (!transform.any_op, !transform.any_op)
+ transform.yield
+}
+}
|
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
d88f749 to
33ba8d6
Compare
MaheshRavishankar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to this as a cleanup on `
| } | ||
| } | ||
|
|
||
| /// Returns true if `size` is dynamic multiplication of `stride`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow. The tileDividesIterationDomain is checking for (ub - lb) % tilesize = 0. Why doesnt that catch the case you are trying to cover here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaheshRavishankar
you are right, my change adds dynamic support for (ub - lb) % tilesize where lb is zero, and ub is dynamicly defined by possibly_dynamic_ub = possibly_dynamic_tilesize * possibly_dynamic_stride.
Some background for this change - I am using split on dynamic operations (i.e. linalg::generic) where the I use arith.divui to calculate a tile which fits my cache and another part of arith.remui if the dynamic size won't be perfectly divided.
After the split, the first part size is defined by arith.muli where one of its operands is the tile size I am using to tile the operation - but the the ub is now considered dynamic although we can know the tile size divides it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see what you are trying. I am not a fan of this kind of piecemeal approaches. I tried something. I changed your input to use affine maps. Still didnt work. I get something like this
%1 = affine.apply affine_map<()[s0] -> ((s0 floordiv 80) * 80)>()[%dim]
%3 = scf.for %arg2 = %c0 to %1 step %c80 iter_args(%arg3 = %extracted_slice_1) -> (tensor<?xi16>) {
%5 = affine.min affine_map<(d0)[s0] -> (-d0 + (s0 floordiv 80) * 80, 80)>(%arg2)[%dim]
IIRC adding --scf-for-loop-canonicalization should have fixed the problem, but it didnt
cc @matthias-springer I thought this was fixed at some time.
Anyway, I am OK with this patch overall given where things are, but this needs to account for the lower bound of the loop as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked into this pass for a long time. What I vaguely remember is that -scf-for-loop-canonicalization has always been a bit fragile due to limitations around semi-affine expressions in FlatAffineValueConstraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaheshRavishankar thanks for the replay!
I tried to take a look at scf-for-loop-canonicalization which I was unfamiliar with, as for now I didn't manage to find the issue why it does not manage to remove the affine.min in this case. I updated the MR to also handle none 0 offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaheshRavishankar can you please take a look on the latest patch?
33ba8d6 to
d90fa69
Compare
… sizes if possible * During operation tiling using scf, we may avoid inserting affine.min to handle the last tile where `upper_bound = step * k` where stride may be a constant or a dynamic.
d90fa69 to
7a5ab7c
Compare
MaheshRavishankar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this seems to be doing much more than what the original PR set out to do. I am fine with it, but I'd really like this logic simpler. I cant gaurantee that as written this is maintainable.
| if (!strideAsInt) | ||
| return false; | ||
| return ((sizeAsInt.value() - offsetAsInt.value()) % strideAsInt.value() == 0); | ||
| if (strideAsInt && offsetAsInt && sizeAsInt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please add { } around multi-line statements.
| return; | ||
|
|
||
| // Given `ofr` = `x` * `y`, all dividers of `x` and `y` are dividers of `ofr`. | ||
| collectDividers(mulOp.getLhs(), dividers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recursions are generally discouraged.
| std::optional<int64_t> dividerAsInt = getConstantIntValue(divider); | ||
| return dividerAsInt && *dividerAsInt % *strideAsInt == 0; | ||
| }; | ||
| return llvm::any_of(dividersOfSize, isStrideDividesDivider) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dont you think this logic is more complicated that you want it to be. I'd like to scope this a bit more narrowly. Just look for the immediate operations being a arith.mul and avoid the recursion.
The more general solution should be using something the IntegerDivisibilityAnalysis (here) that is very similar to the range inference analysis that can be used to fold this away. Doing this in transformations like this becomes unmaintainable in the long run.
|
@MaheshRavishankar per the discussion with you I decided to abandon the approach of searching patterns in arith operations as it may be too fragile like you mentioned. Instead I am avoiding the "split" before tiling and handling the |
upper_bound = step * kwhere stride may be a constant or a dynamic.