Linalg in Triton #1842

manbearian · 2023-06-26T16:02:27Z

manbearian
Jun 26, 2023
Collaborator

@sethbrin and @ingomueller-net and @nhat-nguyen

Hi folks,

Nhat and i work on ML compilation at Microsoft. Our team is very interested in adding a lowering from Triton IR to Linalg IR to the Triton compiler. Our goal is to create a lowering to a common MLIR dialect that teams can use to build, or leverage existing, code generators and analysis for Triton. Our goal is not to leverage the Llnalg dialect in cases where the TritonGPU dialect is used, but rather as an alternative path.

i'd like to use this thread talk about how to converge our efforts to add the linalg dialect on Triton.

For reference here are three approaches from our respective groups:

#1797
#1542
https://github.com/iree-org/iree-llvm-sandbox/tree/main/lib/Conversion/TritonToLLVM

The Microsoft contribution in PR #1797 has two parts. First, a pointer analysis pass that identifies contiguous loads/stores and the actual conversion between dialects. The pointer analysis is someone orthogonal to the Linalg dialect lowering, however i believe it is beneficial to non-GPU architectures; this pointers analysis can fail (since not all loads/stores are contiguous) and using a loop, scatter/gather is required, but not yet implemented in our approach.

Can we collaborate using this branch: https://github.com/openai/triton/tree/triton-to-linalg ?

manbearian · 2023-06-26T16:03:33Z

manbearian
Jun 26, 2023
Collaborator Author

@ingomueller-net sorry for misspelling your screenname.

0 replies

sethbrin · 2023-06-27T14:41:56Z

sethbrin
Jun 27, 2023

@manbearian Sure, that would be great!

We're a bit confused about the program you mentioned in #1797 . Could you please explain it in more detail?

We have noticed that your solution converts tt.ptr to unranked memref. However, we are curious how this addresses the issue of function signature mismatch with the original kernel.
When lowering to the backend such as GPU or DSA, we may need to use the infrastructure of Linalg's tile & fusion, etc, many of these need to be handled on tensor semantics, but the ir transformed from triton still leaves behind memref.interpret_cast + memref.copy etc, how are these be handled?

0 replies

manbearian · 2023-06-29T18:28:26Z

manbearian
Jun 29, 2023
Collaborator Author

I think both of your questions are related to the model we're proposing.

In the proposed model we differentiate between Triton pointers and Triton values/tensors/blocks. We translate the pointers into unranked memrefs and the values into proper tensors.

We haven't encountered any function signature mismatch, unranked memrefs are just lowered as raw pointers, so exactly match the original kernel.

are tiling and fusion code runs on the linalg operations and ignores the load/stores. Since the load stores represent the boundary between data of unknown size in shared memory and data of known size in local memory this works out for us.

Do my explanations make senese?
Do you foresee issues with this model/approach for what you're trying to achieve?

0 replies

sethbrin · 2023-07-02T10:40:25Z

sethbrin
Jul 2, 2023

Yes, thank you for your explanations, make sense, but I have a minor question regarding the function signature.

In memref-to-llvm conversion pass, unranked memref will be converted to {size, raw_ptr}, that's why i questions about function signature, I don't know if I missed something.

1 reply

manbearian Jul 5, 2023
Collaborator Author

Oh!, yes that :) it is on our backlog to address; i'm open to listening to how it might be solved.

sethbrin · 2023-07-02T14:35:39Z

sethbrin
Jul 2, 2023

Regarding to the pointer continuity analysis, just as the PtrAnalysis mentioned in your model, we have observed some different practices that we can discuss together.

We obtained the analysis algorithm by expanding Triton's own AxisInfoAnalysis. The original algorithm expresses continuity and constancy using the terms contiguous and constancy in the AxisInfo. However, in the case of two-dimensional tensors, it is crucial to consider that each row has an equal stride. To accommodate this requirement, we expanded the function based on the original algorithm, as outlined below:

func.func @func(%arg0: i32, %arg1: i32, %arg2: !tt.ptr<f32>) {
    %offset = tt.splat %arg0 : (i32) -> tensor<64xi32> // stride = [64], stride_value = [0]
    %stride = tt.splat %arg1 : (i32) -> tensor<64x1xi32> // stride = [64, 1], stride_value = [0, 0]
    %2 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32> // stride = [64], stride_value = [1]
    %3 = arith.addi %offset, %2 : tensor<64xi32> // stride = [64], stride_value = [1]
    %4 = tt.expand_dims %3 {axis = 1 : i32} : (tensor<64xi32>) -> tensor<64x1xi32> // stride = [64, 1], stride_value = [1, 0]
    %5 = arith.muli %4, %stride : tensor<64x1xi32> // stride = [64, 1], stride_value = [-1, -1]
    %6 = tt.broadcast %5 : (tensor<64x1xi32>) -> tensor<64x64xi32> // stride = [64, 64], stride_value = [-1, 0]
    %7 = tt.expand_dims %2 {axis = 0 : i32} : (tensor<64xi32>) -> tensor<1x64xi32> // stride = [1, 64], stride_value = [0, 1]
    %8 = tt.broadcast %7 : (tensor<1x64xi32>) -> tensor<64x64xi32> // stride = [64, 64], stride_value = [0, 1]
    %9 = arith.addi %6, %8 : tensor<64x64xi32> // stride = [64, 64], stride_value = [-1, 1]
    %10 = tt.splat %arg2 : (!tt.ptr<f32>) -> tensor<64x64x!tt.ptr<f32>> // stride = [64, 64], stride_value = [0, 0]
    %11 = tt.addptr %10, %9 : tensor<64x64x!tt.ptr<f32>>, tensor<64x64xi32> // stride = [64, 64], stride_value = [-1, 1]
 
    %data = tt.load %ptr_cur {...} : tensor<64x64xf32>
    return
  }

The meaning of stride and stride_value described below:

  The _stride_ information maps the `d`-th
  /// dimension to the length of the shortest
  /// sequence of integers of the same stride along it.
  /// Suppose we have an array of N elements,
  /// with a stride value C,
  /// the array can be divided into a list of
  /// N/C sequences of C subsequence of integers of the same stride.
  /// Since we have N = 2^k, C must be a power of two.
  /// For example:
  /// [10, 11, 12, 13, 18, 19, 20, 21]
  /// [20, 21, 22, 23, 28, 29, 30, 31]
  /// Would have stride [2, 4].
  /// and
  /// [12, 16, 20, 24]
  /// [13, 17, 21, 25]
  /// [14, 18, 22, 26]
  /// [15, 19, 23, 27]
  /// [18, 22, 26, 30]
  /// [19, 23, 27, 31]
  /// Would have stride [2, 4].
  DimVectorT stride;

  /// The _stride_value_ information maps the `d`-th
  /// dimension to the stride value of the shortest
  /// sequence of integers of the same stride along it.
  /// Suppose we have an array of N elements,
  /// with a stride value C,
  /// the array can be divided into a list of
  /// N/C sequences of C subsequence of integers of the same stride.
  /// Since we have N = 2^k, C must be a power of two.
  /// -1 means unknown stride value.
  /// For example:
  /// [10, 11, 12, 13, 18, 19, 20, 21]
  /// [20, 21, 22, 23, 28, 29, 30, 31]
  /// Would have stride [10, 1].
  /// and
  /// [12, 16, 20, 24]
  /// [13, 17, 21, 25]
  /// [14, 18, 22, 26]
  /// [15, 19, 23, 27]
  /// [18, 22, 26, 30]
  /// [19, 23, 27, 31]
  /// Would have stride [1, 4].
  DimVectorT strideValue;

We use AxisInfoAnalysis to detect the contiguity, if the stride == shape && strideValue[strideValue.size() - 1] == 1, then we know that the data to be loaded is extract_slice from the tensor which pointer by the ptr. For tt.load %ptr, we can compute stride and offset from the formula:

%stride[1] = 1
%stride[0] = %ptr[1, 0] -  %ptr[0, 0]

%offset = delinearize(%ptr, %stride)

By doing this, we think there are the following advantages:

Reuse code from the TritonGPU lowering path, including gathering more info from the compiler hint such as tl.max_contiguous, tl.multiple_of.
When loads/stores are not contiguous but the lowest dimensions are contiguous, we can use gather/scatter with slice size for optimization. The slice_sizes is similar with mlho gather

I am really looking forward to hearing your feedback or response regarding the proposal. Could you please provide some feedback or suggestions?

5 replies

manbearian Jul 5, 2023
Collaborator Author

I'd love to be able to reuse the AxisInfoAnalysis if it can provide the needed functionality. I'm not very familiar with the GPU lowering path in Triton and would like to learn more about this and how it could be used here.

Recognizing slices of contiguous data is on our backlog and should be extensible from the work we've done; we're focused entirely on cases where the data being loaded/stored is contiguous so supporting other cases hasn't been a priority.

sethbrin Jul 8, 2023

We are improving the triton-to-linalg code and discussing the follow-up code open source plan, which maybe update in the near future. The relevant code of AxisInfoAnalysis is at https://github.com/openai/triton/blob/main/lib/Analysis/AxisInfo.cpp, The main changes are discussed above.

We have a question about PtrAnalysis, currently it cannot deal with arith::RemOp, how does it extend to deal with the following code(https://github.com/openai/triton/blob/main/python/triton/ops/matmul.py#L101)

rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
ram = tl.max_contiguous(tl.multiple_of(rm % M, BLOCK_M), BLOCK_M)
rbn = tl.max_contiguous(tl.multiple_of(rn % N, BLOCK_N), BLOCK_N)

without the user hints tl.max_contiguous and tl.multiple_of to mark rm % M as a contiguous tensor, It may be difficult for us to get this contiguous information from the program.

manbearian Jul 18, 2023
Collaborator Author

We've only taken a cursory look at implementing RemOp. I know support is required to handle the recent changes to the tutorials, but it hasn't been a priority for us yet. I've talked to my folks about adding leveraging these hints in our analysis and it should be doable.

One of my folks, who is out this week, has spent some time looking at AxisInfo analysis after your recent comment. I've asked him to join this discussion when he's back.

haishanzzz Jul 25, 2023

My high-level takeaway is that PtrAnalysis and AxisInfo compliments each other. Depending on the goal, the best approach could be adding either one into the other.

To elaborate, these two analyses capture slightly different information. PtrAnalysis currently converts each tensor of pointers to a single memref; it captures offset and stride of the addressing pattern. AFAIK AxisInfo is designed to map tensors onto fix sized thread blocks and SW coalescing; it captures contiguity and divisibility (please comment if I am missing things).

There are apparently two approaches here. One is to add PtrAnalysis's capability into AxisInfo, which seems to be what you are proposing @sethbrin. I think the approach makes sense at a high-level, but I have some feedback:

Does stride contain the same information as contiguity?
I don't believe you can use DimVectorT (or SmallVector<int64_t>) to represent stride generally, since strides may not be compile-time constants
If the goal is to create memrefs from tensor of pointers, I believe you also need an offset field.

I'd also want to discuss the other approach, which is to add AxisInfo's capability into PtrAnalysis. The main upside I see is we get to take advantage of the rest of the pass (lowering to Linalg, mask analysis, removing meta-data computation ops, etc.). It should be relatively easy to extend PtrAnalysis to incorporate contiguity and divisibility, which will enable support of arith::RemOp. In the pass itself, it should also be possible to create multiple memrefs use the analysis result and combine them into a single memref / tensor on loads (vice versa on stores).

The best approach here of course depends on your goal. I'd just want to discuss the tradeoffs.

sethbrin Jul 27, 2023

Yes, theoretically, both of these methods can solve the problem. By reusing AxisInfoAnalysis, we can leverage the same infrastructure as the official path, may without the need to maintain multiple versions of similar functionality. Another important reason why we decided to adopt this solution initially is that it can assist in non-contiguous pointer access scenarios, making it easier to gradually expand functionality and obtain more information to aid in optimization.

The current version of AxisInfoAnalysis in the master branch is unable to fulfill our desired requirements and needs further expansion. AxisInfoAnalysis is used to analyze the pointer's memory access pattern. Based on this pattern, we can calculate the offset and size using the ptr tensor.

We use stride + stride_value to encompass contiguity and constancy, where these two are special scenarios of stride. contiguity indicates stride_value=1, while constancy indicates stride_value=0
If strides are not compile-time constants, we just set stride_value to -1, but we know the maximum equal stride length from stride, so we can use ptr[1,0] - ptr[0, 0] to calculate the stride value, just as follows

    // View the origin tensor as a contiguous tensor, the sizes is stride of
    // each dim. Given 3-d tensor with a offset tensor,
    //
    // The strides:
    //    stride[0] = offset[1, 0, 0] - offset[0, 0, 0]
    //    stride[1] = offset[0, 1, 0] - offset[0, 0, 0]
    //    stride[2] = 1
    // The sizes:
    //    size[0] = resultTy[0]
    //    size[1] = stride[0] / stride[1]
    //    size[2] = stride[1] / stride[2]

When we get the stride&size, we delinerize ptr[0, 0] to get the offset value, the code snippet is as follows

   for (int64_t r = 0; r < rank; r++) {
      Value offset = rewriter.create<arith::DivSIOp>(loc, offsetVal, strides[r])
                         .getResult();
      offset = rewriter.create<arith::IndexCastOp>(loc, rewriter.getIndexType(),
                                                   offset);
      offsets[r] = offset;
      if (r != rank - 1)
        offsetVal = rewriter.create<arith::RemSIOp>(loc, offsetVal, strides[r]);
    }

I'm not sure if I have answered your question.

javedabsar1 · 2023-07-23T17:23:21Z

javedabsar1
Jul 23, 2023

Hi:

Thanks Ian, Nhat and folks at Microsoft for this triton-to-linalg contribution. Very useful.
Just some points for further discussions to take this branch forward and make it more useful for all who wish to use it.

MEMREF
In the triton-to-linalg conversion, a triton pointer argument e.g. %arg0: !tt.ptr<f32> get lowered to unranked memref such as %arg0: memref<*xf32>. Will it be possible to have during the conversion process (from triton to linalg) generate ranked (and static shaped) memref instead?

For instance, in MLIR it is possible to lower ranked memref to bare pointer using e.g. -convert-func-to-llvm='use-bare-ptr-memref-call-conv=1'. However, this does not work for unranked memref<*xf32>. Similar issue was mentioned above in discussions but just want to restart discussion and help in this regard.

MEMCOPY/ALLOC
The triton to linalg conversion in the general case I understand needs to memref.alloc and memref.copy to generate the SSA ‘mlir tensors’ before calling linalg-on-tensors.
However, even for simple kernel e.g. add two triton-tensors (passed via tt.ptr) and write the result to third tensor, we find the conversion currently generates alloc, memcopy, free, stacksave etc.
Wondering if this is unavoidable and what is your take on this?
Thanks again for the great work!
Javed.

2 replies

henryg-d-Matrix Jul 24, 2023

Hi, I'm not with the Microsoft folks but I've been experimenting with their triton-to-linalg conversion as well for a custom compiler stack.

With regards to unranked memrefs being unfriendly to further conversions, we see that the unranked memrefs from the function arguments are always processed with memref.reinterpret_cast and memref.subview into ranked memrefs with defined shapes. One proposed idea is to fold the first such realized memref shape back into the function argument memref such that all memrefs are ranked. This may raise further problems with function calls later on, but seeing as how the constant block size is what determines such realized ranked memref shapes, it would be reasonable to fold them up into the function arguments to begin with.

Memcopy and allocs are similarly not the friendliest operations to convert or lower. However, I understand the need to ensure SSA of the memrefs since people may want to run further analysis and optimization passes on the linalg from triton IR. It may be possible to do away with memcopy's and alloc's and only have casts and subviews even with masks and padding but at the expense of the memrefs no longer being SSA. I would be grateful for some comments and pointers from the Microsoft folks for this topic.

@javedabsar1, about your comment on SSA mlir tensors and calling linalg on tensors, are you suggesting converting the linalg on memrefs output of the triton-to-linalg pass into linalg operating on tensors? I would greatly appreciate this as a lowering option if it's indeed doable.

javedabsar1 Jul 24, 2023

Thanks @henryg-d-Matrix . While we wait for original authors to comment ...

On your point of proposed idea is to fold the first such realized memref shape back into the function argument memref such that all memrefs are ranked
(a) Yes I wonder the reason that while in the function the rank seems to be inferred via make_range etc, the func-arg is not modified retrospectively.
(b) Above would make func-arg ranked but still dynamic.
On linalg with memref yes I reckon that might eliminate need for alloc/copy. It will reduce certain optimizations perhaps but i think for hand crafted triton kernels anyways it may not matter.

manbearian · 2023-07-25T15:48:11Z

manbearian
Jul 25, 2023
Collaborator Author

Hi, folks. @javedabsar1 , let me address your first point:

the Triton language takes pointers to unstructured data as input; these pointers have no information regarding the size or structure of the pointed-to data. While the user may provide this information (strides, lengths, etc.) into the Triton program, there is no language construct that ties these additional data to the pointer or ensures they are provided at all.
Mapping the first load back to the function argument is an interesting idea and one we haven't explored. The challenge would be that this pointer is unstructured, so it could theoretically be used in multiple forms. I could see a potentially using casts form this type in such instances where that occurs. It is also of variable length, but this again could be handled, by providing a dynamic outer dimension.
Overall, i believe that the unranked-memref is strictly the correct type. it is provided in MLIR as an ABI boundary type with the intent is cast to ranked type before first use, which is how i'm using it here. I understand if this is causing issues though as i did have to make some changes in our own test backend to add support for unranked-memrefs. It wasn't much work for me, but i could see threading it through a compiler could be a bit of a challenge. the saving grace is that it is only used at the ABI boundary and if the compiler encounters it anywhere else it's a bug.
The new tensor-pointer/tile-pointer/block-pointer work that is in the Triton tree may allow us to avoid the raw pointer parameter, if the concept is extended to the ABI (i believe right now the block-pointers are generated inside the function, not on the python side.
Regarding this:

-convert-func-to-llvm='use-bare-ptr-memref-call-conv=1....does not work for unranked memref<*xf32>
i have someone from my team looking into this and will get back to you. If indeed we cannot make this work, this a problem (right now we have this hacked up in our test back-end which was making this until this thread brought it to my attention. If we have to fix upstream in MLIR we will do that, as i don't see why it can't be made to work.

Regarding the alloc/copy operations
In the proposed model, these operations are inserted by design and represent a copy from the global memory pointed to by the parameter pointer to the processor (SM) local memory. in this model, these allocs and copies are necessary to correctly represent the code. They represent the allocation of the BLOCK and the load/store operations to from that BLOCK.

Is it that the model doesn't make sense for your HW target? Or that doesn't work with your optimizer (e.g., copies might be inserted later in your optimizer flow)?

Thanks! And please keep the conversation going!

2 replies

manbearian Jul 25, 2023
Collaborator Author

Replying to update on the issue with bare-ptr-memref....

it turns out we're not lowering through LLVM right now; we're sending the kernels to a C++ emitter for another compiler to chew on. We update our C++ emitter to treat unranked-memrefs as raw pointers.

If unranked memrefs are giving you trouble a simple pass to convert to 1-d ranked memrefs of unknown size might unblock. Alternatively, it seems like LLVM conversion should be updated to handle this correctly.

henryg-d-Matrix Jul 25, 2023

Thanks for the suggestion @manbearian . I'm working on lowering the linalg on memrefs output of triton-to-linalg further through more MLIR dialects to couple onto an established MLIR compiler stack. Some cleanups and canonicalization are definitely needed and non-trivial but I'm still very grateful for Microsoft's work on triton-to-linalg conversion doing most of the heavy lifting in Triton to non-Nvidia-GPU custom hardware end to end compilation.

On another note, are updates to triton-to-linalg coming soon? It would be greatly appreciated if more developments can be shared here on this open-source repo.

manbearian · 2023-07-26T20:36:19Z

manbearian
Jul 26, 2023
Collaborator Author

We have three things planned to update "soon"-ish

merge updates from Triton main
support for remainder (to support the matmul tutorial)
support for block-pointer

i don't have a time-frame for these yet, but i hope to land something in August with some of this.

0 replies

henryg-d-Matrix · 2023-07-27T15:01:15Z

henryg-d-Matrix
Jul 27, 2023

@manbearian How are the Microsoft folks getting Triton IR to feed into the triton-to-linalg pass? My current solution is just a modified copy of the jit decorator which prints out Triton IR when the decorated function is called instead of letting it go through the default Triton compilation and execution pipeline. This Triton IR printer decorator mainly saves me the hassle of having to execute Triton programs all the way on a machine with a GPU then fetching the .ttir file from .triton/cache every time I want some organically compiled from source Triton IR.

My solution is a bit general purpose, so if the Microsoft folks have a more elegant integration of Triton IR compilation to triton-to-linalg it would be great if it can be shared.

0 replies

manbearian · 2023-07-27T20:53:08Z

manbearian
Jul 27, 2023
Collaborator Author

With what is in this branch i think working form the triton IR (ttir) makes the most sense. If you look at the tests we've checked-in this is what they're doing (none of them invoke the python code path).

In order to generate the TTIR we do something like this:

ret = triton.compile(kernel, signature="*fp32,i32,*fp32,i32", constants={"BLOCK_M": 64, "BLOCK_N": 64})
print(ret.asm["ttir"])

But, as you mentioned i believe this invokes the full pipeline.

We have an end-to-end test compiler that we're using for this work internally and it creates its own pass pipeline. We pass an extra argument to the triton.compile method to trigger this alternate path. It sounds like you might have landed on similar solution. This area of how to integrate new back-ends into the triton pass pipeline is not really defined yet.

0 replies

manbearian · 2023-09-22T21:13:53Z

manbearian
Sep 22, 2023
Collaborator Author

FYI. We've republished the work as a plug-in: PR redone as a plug-in: #2374

0 replies

Linalg in Triton #1842

Uh oh!

Uh oh!

manbearian Jun 26, 2023 Collaborator

Replies: 11 comments · 10 replies

Uh oh!

manbearian Jun 26, 2023 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

manbearian Jun 29, 2023 Collaborator Author

Uh oh!

Uh oh!

manbearian Jul 5, 2023 Collaborator Author

Uh oh!

Uh oh!

manbearian Jul 5, 2023 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

manbearian Jul 18, 2023 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

manbearian Jul 25, 2023 Collaborator Author

Uh oh!

manbearian Jul 25, 2023 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

manbearian Jul 26, 2023 Collaborator Author

Uh oh!

Uh oh!

manbearian Jul 27, 2023 Collaborator Author

Uh oh!

manbearian Sep 22, 2023 Collaborator Author

manbearian
Jun 26, 2023
Collaborator

Replies: 11 comments 10 replies

manbearian
Jun 26, 2023
Collaborator Author

manbearian
Jun 29, 2023
Collaborator Author

manbearian Jul 5, 2023
Collaborator Author

manbearian Jul 5, 2023
Collaborator Author

manbearian Jul 18, 2023
Collaborator Author

manbearian
Jul 25, 2023
Collaborator Author

manbearian Jul 25, 2023
Collaborator Author

manbearian
Jul 26, 2023
Collaborator Author

manbearian
Jul 27, 2023
Collaborator Author

manbearian
Sep 22, 2023
Collaborator Author