-
Notifications
You must be signed in to change notification settings - Fork 7
[examples][xegpu-matmul] Add XeGPU matrix multiplication example #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2ba2713 to
58e3f28
Compare
rolfmorel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Left some comments inline.
|
For the failing CI, have a look at adding both a For the non-runnable files, add a file |
| one = arith.constant(index_t, 1) | ||
| nwarmup_cst = arith.constant(index_t, nwarmup) | ||
| for i in scf.for_(zero, nwarmup_cst, one): | ||
| # FIXME(upstream): func.call is broken for this use case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can confirm that CamelCaseOp is subclassed (and that this subclass shadows the autogen-ed CamelCaseOp version) while the autogen-ed snake_case wrapper is not shadowed. Hence using func.call returns the autogen-ed CallOp and not its subclass.
| fargs.append(memref_c_t) | ||
|
|
||
| @func.func(*fargs, name=func_name) | ||
| def payload(*args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To demonstrate full end-to-end, we should have a look at how easy it is to get IR in the following form automatically converted to a form that works with your schedule:
module {
func.func @main(%arg0: tensor<2048x8192xf32>, %arg1: tensor<8192x4096xf32>) -> tensor<2048x4096xf32> {
%cst = arith.constant 0.000000e+00 : f32
%0 = tensor.empty() : tensor<2048x4096xf32>
%1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<2048x4096xf32>) -> tensor<2048x4096xf32>
%2 = linalg.matmul ins(%arg0, %arg1 : tensor<2048x8192xf32>, tensor<8192x4096xf32>) outs(%1 : tensor<2048x4096xf32>) -> tensor<2048x4096xf32>
return %2 : tensor<2048x4096xf32>
}
}
That is, the above is the IR we get from torch-mlir from a basic matmul in Torch: https://github.com/ScalingIntelligence/KernelBench/blob/5c88b2319076e8d44b9901914de7b45d220944e9/KernelBench/level1/2_Standard_matrix_multiplication_.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would make the demonstration "fully upstream" and "fully end-to-end."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The above kernel bufferizes to a kernel that allocates the output memref. In this case, we'd need a mechanism to convert the alloc, e.g., to a gpu alloc if necessary. And we'd need to track the allocated buffers and deallocate them later.
If the kernel updates a tensor in-place, it's a little trickier. At tensor level the function must return the updated tensor. This return value becomes redundant after bufferization. In fact, the return value can cause a copy, and then the input and output memrefs are different, breaking the semantics. We could drop the return value after bufferization, but that changes the function signature which is often not desirable.
This matmul example demonstrates an update-in-place kernel. In this case its easiest if we define the function boundary with memrefs and keep it fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For autotuning, we cannot use kernels that allocate the output buffer. So yes, we'd need to find a way to convert, say, a torch-mlir kernel to in-place-update semantics. Should not be too hard actually, as the input/output roles of the arguments are clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simple kernels like this, there are bufferization passes that can help with that. However, we might have to also explore more robust solutions long-term.
I'd keep the current example as is, and iterate later.
Co-authored-by: Adam Siemieniuk <[email protected]>
2c0acab to
163e4a7
Compare
163e4a7 to
fbcc453
Compare
I've set up CI such that it just dumps the IR at XeGPU WG level. This does not require custom LLVM build and can thus be executed with the standard install. |
Adds XeGPU matrix multiplication example that runs the payload, checks correctness and measures performance.
matmul.pyis the main script with CLI.README.mdhas installation instructions and usage examples.