Skip to content

Conversation

@tkarna
Copy link
Contributor

@tkarna tkarna commented Nov 25, 2025

Adds XeGPU matrix multiplication example that runs the payload, checks correctness and measures performance.

  • matmul.py is the main script with CLI.
  • README.md has installation instructions and usage examples.

@tkarna tkarna force-pushed the xegpu-matmul-example branch from 2ba2713 to 58e3f28 Compare November 25, 2025 21:08
Copy link
Contributor

@rolfmorel rolfmorel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Left some comments inline.

@rolfmorel
Copy link
Contributor

For the failing CI, have a look at adding both a RUN: ... and a REQUIRES: ... line to the runnable files: https://llvm.org/docs/TestingGuide.html#constraining-test-execution and https://llvm.org/docs/TestingGuide.html#tips-for-writing-constraints

For the non-runnable files, add a file lit.local.cfg alongside with an excludes clause, e.g. https://github.com/llvm/lighthouse/blob/2f3e92f88ebb98ddddd29239b3aacc8d07e9ef28/python/examples/ingress/torch/lit.local.cfg

one = arith.constant(index_t, 1)
nwarmup_cst = arith.constant(index_t, nwarmup)
for i in scf.for_(zero, nwarmup_cst, one):
# FIXME(upstream): func.call is broken for this use case?
Copy link
Contributor

@rolfmorel rolfmorel Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can confirm that CamelCaseOp is subclassed (and that this subclass shadows the autogen-ed CamelCaseOp version) while the autogen-ed snake_case wrapper is not shadowed. Hence using func.call returns the autogen-ed CallOp and not its subclass.

fargs.append(memref_c_t)

@func.func(*fargs, name=func_name)
def payload(*args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To demonstrate full end-to-end, we should have a look at how easy it is to get IR in the following form automatically converted to a form that works with your schedule:

module {
  func.func @main(%arg0: tensor<2048x8192xf32>, %arg1: tensor<8192x4096xf32>) -> tensor<2048x4096xf32> {
    %cst = arith.constant 0.000000e+00 : f32
    %0 = tensor.empty() : tensor<2048x4096xf32>
    %1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<2048x4096xf32>) -> tensor<2048x4096xf32>
    %2 = linalg.matmul ins(%arg0, %arg1 : tensor<2048x8192xf32>, tensor<8192x4096xf32>) outs(%1 : tensor<2048x4096xf32>) -> tensor<2048x4096xf32>
    return %2 : tensor<2048x4096xf32>
  }
}

That is, the above is the IR we get from torch-mlir from a basic matmul in Torch: https://github.com/ScalingIntelligence/KernelBench/blob/5c88b2319076e8d44b9901914de7b45d220944e9/KernelBench/level1/2_Standard_matrix_multiplication_.py

Copy link
Contributor

@rolfmorel rolfmorel Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would make the demonstration "fully upstream" and "fully end-to-end."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The above kernel bufferizes to a kernel that allocates the output memref. In this case, we'd need a mechanism to convert the alloc, e.g., to a gpu alloc if necessary. And we'd need to track the allocated buffers and deallocate them later.

If the kernel updates a tensor in-place, it's a little trickier. At tensor level the function must return the updated tensor. This return value becomes redundant after bufferization. In fact, the return value can cause a copy, and then the input and output memrefs are different, breaking the semantics. We could drop the return value after bufferization, but that changes the function signature which is often not desirable.

This matmul example demonstrates an update-in-place kernel. In this case its easiest if we define the function boundary with memrefs and keep it fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For autotuning, we cannot use kernels that allocate the output buffer. So yes, we'd need to find a way to convert, say, a torch-mlir kernel to in-place-update semantics. Should not be too hard actually, as the input/output roles of the arguments are clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simple kernels like this, there are bufferization passes that can help with that. However, we might have to also explore more robust solutions long-term.
I'd keep the current example as is, and iterate later.

@tkarna tkarna force-pushed the xegpu-matmul-example branch 3 times, most recently from 2c0acab to 163e4a7 Compare November 27, 2025 14:08
@tkarna tkarna force-pushed the xegpu-matmul-example branch from 163e4a7 to fbcc453 Compare November 27, 2025 14:10
@tkarna
Copy link
Contributor Author

tkarna commented Nov 27, 2025

For the failing CI, have a look at adding both a RUN: ... and a REQUIRES: ... line to the runnable files: https://llvm.org/docs/TestingGuide.html#constraining-test-execution and https://llvm.org/docs/TestingGuide.html#tips-for-writing-constraints

For the non-runnable files, add a file lit.local.cfg alongside with an excludes clause, e.g. https://github.com/llvm/lighthouse/blob/2f3e92f88ebb98ddddd29239b3aacc8d07e9ef28/python/examples/ingress/torch/lit.local.cfg

I've set up CI such that it just dumps the IR at XeGPU WG level. This does not require custom LLVM build and can thus be executed with the standard install.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants