Skip to content

Conversation

@cferry-AMD
Copy link
Collaborator

@cferry-AMD cferry-AMD commented Apr 25, 2025

This PR gives the possibility to change the order in which tiling happens along with fusion.

@cferry-AMD cferry-AMD marked this pull request as ready for review April 29, 2025 13:10
@cferry-AMD
Copy link
Collaborator Author

I tried to use the result op order within the slice as a reference instead of counting remarks, but it turns out it does not match:

within split at mlir/test/Dialect/Linalg/tile-sort.mlir:1 offset :13:8: remark: Fused op in position 0
  %0 = linalg.ceil ins(%arg: tensor<256xf32>) outs(%empty: tensor<256xf32>) -> tensor<256xf32>
       ^
within split at mlir/test/Dialect/Linalg/tile-sort.mlir:1 offset :13:8: note: see current operation: %1 = linalg.ceil ins(%arg0 : tensor<256xf32>) outs(%0 : tensor<256xf32>) -> tensor<256xf32>
within split at mlir/test/Dialect/Linalg/tile-sort.mlir:1 offset :18:8: remark: Fused op in position 1
  %1 = linalg.negf ins(%0 : tensor<256xf32>) outs(%empty1: tensor<256xf32>) -> tensor<256xf32>
       ^
within split at mlir/test/Dialect/Linalg/tile-sort.mlir:1 offset :18:8: note: see current operation: %3 = linalg.negf ins(%1 : tensor<256xf32>) outs(%2 : tensor<256xf32>) -> tensor<256xf32>
within split at mlir/test/Dialect/Linalg/tile-sort.mlir:1 offset :13:8: remark: Fused op in position 2
  %0 = linalg.ceil ins(%arg: tensor<256xf32>) outs(%empty: tensor<256xf32>) -> tensor<256xf32>
       ^
within split at mlir/test/Dialect/Linalg/tile-sort.mlir:1 offset :13:8: note: see current operation: %1 = linalg.ceil ins(%arg0 : tensor<256xf32>) outs(%0 : tensor<256xf32>) -> tensor<256xf32>

and the output is:

    %4 = scf.for %arg1 = %c0 to %c256 step %c32 iter_args(%arg2 = %3) -> (tensor<256xf32>) {
      [...]
      %6 = linalg.ceil ins(%extracted_slice : tensor<32xf32>) outs(%extracted_slice_1 : tensor<32xf32>) -> tensor<32xf32>
      [...]
      %8 = linalg.ceil ins(%extracted_slice_2 : tensor<32xf32>) outs(%extracted_slice_4 : tensor<32xf32>) -> tensor<32xf32>
      [...]
      %10 = linalg.negf ins(%8 : tensor<32xf32>) outs(%extracted_slice_6 : tensor<32xf32>) -> tensor<32xf32>
      [...]
      %12 = linalg.powf {tile} ins(%6, %10 : tensor<32xf32>, tensor<32xf32>) outs(%extracted_slice_8 : tensor<32xf32>) -> tensor<32xf32>
      %inserted_slice = tensor.insert_slice %12 into %arg2[%arg1] [32] [1] : tensor<32xf32> into tensor<256xf32>
      scf.yield %inserted_slice : tensor<256xf32>
    }

so ceil, ceil, negf, powf instead of ceil, negf, ceil, powf... so I have to rely on the remark for now. I don't really like abusing the remarks for that...

@mgehre-amd
Copy link
Collaborator

I tried to use the result op order within the slice as a reference instead of counting remarks, but it turns out it does not match:

within split at mlir/test/Dialect/Linalg/tile-sort.mlir:1 offset :13:8: remark: Fused op in position 0
  %0 = linalg.ceil ins(%arg: tensor<256xf32>) outs(%empty: tensor<256xf32>) -> tensor<256xf32>
       ^
within split at mlir/test/Dialect/Linalg/tile-sort.mlir:1 offset :13:8: note: see current operation: %1 = linalg.ceil ins(%arg0 : tensor<256xf32>) outs(%0 : tensor<256xf32>) -> tensor<256xf32>
within split at mlir/test/Dialect/Linalg/tile-sort.mlir:1 offset :18:8: remark: Fused op in position 1
  %1 = linalg.negf ins(%0 : tensor<256xf32>) outs(%empty1: tensor<256xf32>) -> tensor<256xf32>
       ^
within split at mlir/test/Dialect/Linalg/tile-sort.mlir:1 offset :18:8: note: see current operation: %3 = linalg.negf ins(%1 : tensor<256xf32>) outs(%2 : tensor<256xf32>) -> tensor<256xf32>
within split at mlir/test/Dialect/Linalg/tile-sort.mlir:1 offset :13:8: remark: Fused op in position 2
  %0 = linalg.ceil ins(%arg: tensor<256xf32>) outs(%empty: tensor<256xf32>) -> tensor<256xf32>
       ^
within split at mlir/test/Dialect/Linalg/tile-sort.mlir:1 offset :13:8: note: see current operation: %1 = linalg.ceil ins(%arg0 : tensor<256xf32>) outs(%0 : tensor<256xf32>) -> tensor<256xf32>

and the output is:

    %4 = scf.for %arg1 = %c0 to %c256 step %c32 iter_args(%arg2 = %3) -> (tensor<256xf32>) {
      [...]
      %6 = linalg.ceil ins(%extracted_slice : tensor<32xf32>) outs(%extracted_slice_1 : tensor<32xf32>) -> tensor<32xf32>
      [...]
      %8 = linalg.ceil ins(%extracted_slice_2 : tensor<32xf32>) outs(%extracted_slice_4 : tensor<32xf32>) -> tensor<32xf32>
      [...]
      %10 = linalg.negf ins(%8 : tensor<32xf32>) outs(%extracted_slice_6 : tensor<32xf32>) -> tensor<32xf32>
      [...]
      %12 = linalg.powf {tile} ins(%6, %10 : tensor<32xf32>, tensor<32xf32>) outs(%extracted_slice_8 : tensor<32xf32>) -> tensor<32xf32>
      %inserted_slice = tensor.insert_slice %12 into %arg2[%arg1] [32] [1] : tensor<32xf32> into tensor<256xf32>
      scf.yield %inserted_slice : tensor<256xf32>
    }

so ceil, ceil, negf, powf instead of ceil, negf, ceil, powf... so I have to rely on the remark for now. I don't really like abusing the remarks for that...

As the output is only for testing, I suggest to use a pass option + output on llvm::errs().

@cferry-AMD
Copy link
Collaborator Author

Instead of using the error output, I went for an attribute. I guess it's a bit neater than resorting to a remark that then gets caught back, and is likely not getting used in production contexts.

I'm all for a pass option, unfortunately TileUsingInterface isn't a pass but rather just a standalone function passes are expected to call. How does the use of an extra attribute debug_worklist on the test op for tiling in the transform dialect, as is currently done, look like to you? I'm no big fan of that, and open to anything less invasive...

@mgehre-amd
Copy link
Collaborator

Instead of using the error output, I went for an attribute. I guess it's a bit neater than resorting to a remark that then gets caught back, and is likely not getting used in production contexts.

I'm all for a pass option, unfortunately TileUsingInterface isn't a pass but rather just a standalone function passes are expected to call. How does the use of an extra attribute debug_worklist on the test op for tiling in the transform dialect, as is currently done, look like to you? I'm no big fan of that, and open to anything less invasive...

There are a few tests that use -debug-only=name + llvm::dbgs() + FileCheck. Maybe we should follow that?

@mgehre-amd
Copy link
Collaborator

Can you add a test case that uses setWorklistInsertFn to show that we can affect the order?

@cferry-AMD cferry-AMD requested a review from mgehre-amd May 5, 2025 12:55
Co-authored-by: Matthias Gehre <[email protected]>
@cferry-AMD cferry-AMD enabled auto-merge (squash) May 6, 2025 05:55
@cferry-AMD cferry-AMD merged commit acc5603 into feature/fused-ops May 6, 2025
4 checks passed
@cferry-AMD cferry-AMD deleted the corentin.tiling_order branch May 6, 2025 06:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants