xe: jit: gemm: introduce protoype IR based gemm #3729

rjoursler · 2025-08-06T20:09:10Z

Introduces a prototype IR based gemm kernel. Currently, this has only be vetted on well aligned problems of the form:

--matmul --engine=gpu --dt=bf16:bf16:f32 --stag=ab --wtag=ab --dtag=ab 16x48x272:16x272x8208

Currently, many features are missing, most notably are a lack of data type conversions, dequantization, most post-ops, and handling of "poorly" aligned cases. This PR does implement support for batched GEMM and (common) bias. The prototype implementation can be executed by setting the environment variable enable_gemm_ir=1. Furthermore, problem specific specialization can be enabled by setting gemm_ir_specialize=1 (although the kernel cache will not work correctly with this knob). For those unfamiliar with IR development, running with DNNL_VEROBSE=debuginfo=200 will log IR as it is transformed into the final output.

echeresh

@rjoursler The changes mostly look good to me. I have a few comments about the DSL-based kernel design.

But I think it's better to merge as is, and do incremental changes in the future. If you have a specific plan on the future changes, please share.

src/gpu/intel/jit/gemm/ir/builder.hpp

src/gpu/intel/jit/gemm/ir/ir_interop.hpp

src/gpu/intel/jit/gemm/ir/ir_interop.cpp

src/gpu/intel/jit/gemm/gen_gemm.cpp

src/gpu/intel/jit/dsl/dsl.hpp

echeresh · 2025-08-09T00:19:25Z

src/gpu/intel/jit/gemm/ir/builder.cpp

+    gemm_ir(const gemm_ir_desc_t &desc)
+        : problem(desc.problem), strategy(desc.strategy) {}
+
+    ir::stmt_t build(ir::kernel_iface_t iface, ir::ir_context_t &ctx) {


It would be nice to fully move away from the IR "building" to the "generation" paradigm via DSL, that is drop explicit stmt_t usage, eliminate direct IR usage and use proper naming.

Do you have a general idea on the interfaces you are intending here? Would it just be to add interfaces like

cl_kernel dsl::build(cl_context ctx, cl_device)

and modifying the code to use those interfaces?

I was thinking about an nGEN-style interface (and generator/kernel abstraction):

namespace dsl { class kernel_t { public: void generate() { auto A = alloc(...); multiply(A, B, ...); } ... #if WITH_OPENCL cl_kernel get_kernel(cl_context, cl_device_id); #endif #if WITH_SYCL sycl::kernel get_kernel(sycl::context, sycl::device); #endif }; }

We drop the engine-based dispatching to avoid dependency on oneDNN

We drop templates (comparing with nGEN)

But other than that, we can have similar configuration capabilities as in nGEN, like dsl::kernel_t::requireGRF(...), etc. What do you think?

I think this is a good idea, although I suggest we modify the get_kernel functions be standalone functions. By doing this, we can consolidate runtime specific functionality into dedicated headers (if we accumulate multiple runtime dependent interfaces).

Sounds good. Just a general comment: it would be nice to have less forwarding and extra functions around this place. This is a relatively simple interface which should just forward arguments to the underlying nGEN generator so it can be kept simple.

@rjoursler Thanks. I noticed there is still code that could be part of a common DSL kernel abstraction, e.g. this make_kernel include some steps that should be shared between all DSL kernels:

oneDNN/src/gpu/intel/gemm/jit/generator_dsl/builder.cpp

Lines 768 to 784 in 18e3682

kernel_t make_kernel(

const generator_dsl_desc_t &desc, ir::constraint_set_t cset) {

ir::ir_context_t ctx(desc.exec_cfg, cset);

ir::trace_start();

auto k = generator_dsl_t(desc).build(desc.kernel_iface(), ctx);

ir::trace_pass("build generator_dsl_t", k.body, ctx);

k.body = ir::simplify(k.body, ctx);

k.body = ir::inject_send(k.body, ctx);

// TODO: This should be unnecessary as it could happen at codegen

k.body = ir::fixup_if_conditions(k.body, ctx);

k.body = ir::eliminate_common_subexprs(

k.body, ctx, desc.strategy.GRFs * ctx.hw().grf_size());

return k;

}

What do you think about following the nGEN approach where a generator/kernel is derived from the base generator/kernel and we end up with most things hidden and implemented in the base class:

class dsl_kernel_t { public: virtual void generate() = 0; // OpenCL here as an example. cl_kernel get_ocl_kernel() { declare_kernel(...); generate(); auto stmt = end_kernel(); stmt = ir_pass(stmt); ... return make_ocl_kernel(stmt, ...); } }; class my_kernel_t : public dsl_kernel_t { public: my_kernel_t(my_desc_t desc): desc(desc) {} void generate() override { // kernel code } private: my_desc_t desc; }; // Usage: my_kernel_t kernel; auto cl_kernel = kernel.get_ocl_kernel();

Why I think this is important: I anticipate we will try to implement fusions via DSL and having such interfaces would be a great help to quickly start implementing the algorithm without dealing with unnecessary boilerplate.

I'm fine to merge as is and adjust in following iterations. But I think it'd be good to agree on how it should look like in the final version.

@echeresh, I was thinking about this some yesterday and came away against using an abstract interface for this purpose. The purpose the abstract interface is to gain access to a callable object, but it will be more flexible and require less boilerplate to directly target callable objects. For example, I think it would be much better to implement your sample function as

template <typename F> cl kernel get_ocl_kernel(F&& generate) { declare_kernel(...) generate(); auto stmt = end_kernel(); ...

By doing this, usage can be as simple as implementing a lambda:

cl_kernel get_ocl_kernel([]{ A = dsl::arg("A"); ... });

Just to give an overview for how I was designing this interface, I have been treating it similar to the compile -> link paradigm. While this is an inexact analogy, under this paradigm, we have source code + compiler options -> object file and object file(s) + link options -> executable. This creates the following mapping:

source code -> kernel_iface + generate compile options -> exec_config_t + ir::ctx() object file -> dsl::kernel_t link options -> ngen::DebugInfo executable -> cl_kernel, sycl::kernel, etc

and believe we should be wrap the compiler options and link options into their own structures as we work to finalize these interfaces. As part of this, I think the standard ir transformations you mentioned would largely be relegated to a list of passes within the compile options and applied during the call to end_kernel().

but it will be more flexible and require less boilerplate to directly target callable objects

Can you please be more specific here, what do we gain? My snippet is quite simple - no boilerplate and pretty clear where to put kernel code. I read your suggestion as moving away from the object model to lambdas but IMO, it's mostly a style choice. Having nGEN-style kernel generators in oneDNN is already idiomatic so why steering away from it?

Moreover, for example, in your snippet:

cl_kernel get_ocl_kernel([]{ A = dsl::arg("A"); ... });

The kernel is implemented in a lambda while in practice we won't be probably writing kernels this way as a barely complex kernel would require some internal state to implement its logic - so we would probably continue implementing kernel generators in dedicated classes. In this case, introducing a base DSL kernel class is reasonable to me.

I think the standard ir transformations you mentioned would largely be relegated to a list of passes within the compile options

As a side comment: I believe we should focus on a default set of passes - to keep things simple from the beginning. I don't like the idea of allowing too much customization - due to complexity. At the same time, customization can be useful but keeping it under more control. My main concern is that development won't scale well as long as we assume every kernel will have its own customization for passes - so I would assume the opposite.

My snippet is quite simple - no boilerplate and pretty clear where to put kernel code.

All the class setup is unnecessary boiler plate. Consider a simple fusion kernel, using a callable object based approach is just 4 lines of code, while the structure based approach is more like 12.

// Structure based approach class simple_fused_kernel_t : public dsl_kernel_t { public: my_kernel_t(kernel1_desc_t desc1, kernel2_desc_t desc2): kernel1(desc1), kernel2(desc2) {} void generate() override { kernel1.generate(); kernel2.generate()j; } private: kernel1_t kernel1; kernel2_t kernel2; }; // Callable object based API generate_simple_fused(kernel1_desc_t desc1, kernel2_desc_t desc2) { generate_kernel1(desc1); generate_kernel2(desc2); }

I read your suggestion as moving away from the object model to lambdas but IMO, it's mostly a style choice.

I mostly agree, but the purpose of the DSL is to be more like a higher level language. Up to now the DSL does not use inheritance to align with this, so injecting it at this point is not aligned with the DSL style. In addition, the structure approach is not how OpenCL or SYCL programming work, where as using a function based API is very similar to SYCL programming.

All the class setup is unnecessary boiler plate. Consider a simple fusion kernel, using a callable object based approach is just 4 lines of code, while the structure based approach is more like 12.

I have two comments here:

Are such kernels our main interest? oneDNN kernels are usually more complex, I would target them. We can't implement complex kernels in simple lambdas

This example uses fusion - I agree that for fusion we need internal API for composability (I think we discussed that). But it's not kernel-level reuse

It won't be a simple stacking like in the example above - we have to forward arguments properly, etc. Rather, something like gemm(A, B, C). We usually can't reuse the whole kernels

I mostly agree, but the purpose of the DSL is to be more like a higher level language. Up to now the DSL does not use inheritance, so injecting it at this point is not aligned with the DSL style. In addition, the structure approach is not how OpenCL or SYCL programming work, where as using a function based API is very similar to SYCL programming.

Well, it's a style choice to me. I wouldn't say SYCL is closer here, to me DSL is closer to nGEN as both are runtime-configurable and JITted at runtime (i.e. every nGEN instruction or DSL construct execution results in some code being produced). Contrary, any high-level SYCL/OpenCL kernel is parsed and compiled by the compiler at once. Having a compiler for example means that in SYCL we have implicit kernel ABI or can rely on compiler-assisted attributes to control kernel behavior. DSL/nGEN kernels are strictly API-configurable.

I'm fine to start with lambdas and check after a while. If generators start re-implementing the same things - I suggest to switch to a base class if that's the case.

src/gpu/intel/jit/gemm/ir/builder.cpp

The default initializer prevents aggregate initialization in C++11.

src/gpu/intel/gemm/jit/generator_dsl/builder.hpp

src/gpu/intel/jit/dsl/dsl_runtime.hpp

echeresh · 2025-08-14T00:36:38Z

src/gpu/intel/jit/gemm/ir/builder.cpp

+    gemm_ir(const gemm_ir_desc_t &desc)
+        : problem(desc.problem), strategy(desc.strategy) {}
+
+    ir::stmt_t build(ir::kernel_iface_t iface, ir::ir_context_t &ctx) {


@rjoursler Thanks. I noticed there is still code that could be part of a common DSL kernel abstraction, e.g. this make_kernel include some steps that should be shared between all DSL kernels:

oneDNN/src/gpu/intel/gemm/jit/generator_dsl/builder.cpp

Lines 768 to 784 in 18e3682

kernel_t make_kernel(

const generator_dsl_desc_t &desc, ir::constraint_set_t cset) {

ir::ir_context_t ctx(desc.exec_cfg, cset);

ir::trace_start();

auto k = generator_dsl_t(desc).build(desc.kernel_iface(), ctx);

ir::trace_pass("build generator_dsl_t", k.body, ctx);

k.body = ir::simplify(k.body, ctx);

k.body = ir::inject_send(k.body, ctx);

// TODO: This should be unnecessary as it could happen at codegen

k.body = ir::fixup_if_conditions(k.body, ctx);

k.body = ir::eliminate_common_subexprs(

k.body, ctx, desc.strategy.GRFs * ctx.hw().grf_size());

return k;

}

What do you think about following the nGEN approach where a generator/kernel is derived from the base generator/kernel and we end up with most things hidden and implemented in the base class:

class dsl_kernel_t { public: virtual void generate() = 0; // OpenCL here as an example. cl_kernel get_ocl_kernel() { declare_kernel(...); generate(); auto stmt = end_kernel(); stmt = ir_pass(stmt); ... return make_ocl_kernel(stmt, ...); } }; class my_kernel_t : public dsl_kernel_t { public: my_kernel_t(my_desc_t desc): desc(desc) {} void generate() override { // kernel code } private: my_desc_t desc; }; // Usage: my_kernel_t kernel; auto cl_kernel = kernel.get_ocl_kernel();

Why I think this is important: I anticipate we will try to implement fusions via DSL and having such interfaces would be a great help to quickly start implementing the algorithm without dealing with unnecessary boilerplate.

I'm fine to merge as is and adjust in following iterations. But I think it'd be good to agree on how it should look like in the final version.

src/gpu/intel/jit/codegen/codegen.hpp

rjoursler · 2025-08-14T19:53:46Z

make test
disable test_device_cpu

petercad · 2025-08-14T20:02:52Z

src/gpu/intel/gemm/jit/gen_kernel.cpp

+        }
+        auto k = make_kernel(gemm_desc, cset);
+        return engine->create_kernel(kernel, k);
+    }


Let's refactor this a bit and put the DSL kernel generation code in one method (get_kernel_dsl say) and the "classic" code generation in another (get_kernel_classic, say, not sure if that's a good name).

petercad · 2025-08-14T20:08:36Z

src/gpu/intel/gemm/jit/gen_kernel.cpp

+            strategy_.kParallelLocal = false;
+            strategy_.kInterleave = false;
+        }
+    }


Maybe add a quick comment here to explain ("these strategies not supported yet via DSL, automatically disable for the benefit of code below" or something like that).

It might be nice to factor this out into its own routine so you can update it as DSL capabilities evolve.

petercad · 2025-08-14T20:28:14Z

src/gpu/intel/gemm/jit.cpp

+                auto b_size = types::data_type_size(pd()->eff_b_type());
+                gpu_assert(stride_b * b_size % base_alignment == 0)
+                        << "Unimplemented load transform";
+            }


Shouldn't this check go in pd init?

Once this starts being used in production, yes. My current goal here is that executing

enable_generator_dsl=1 ./benchdnn --matmul ...

will result in a test failure so that I can accurately track what is functionally enabled rather than silently testing the current implementation.

Why not move it to pd initialization and either:

Error when dsl fails instead of returning unimplemented (i.e. move this code as-is into pd init)

Add a new env variable like require_generator_dsl to throw an error if we're not using the dsl implementation - to separate use of the dsl implementation and testing it for coverage.

@rjoursler As an option we could align this with the existing approach for v2 convolution - to rely on --impl=dsl to test this implementation explicitly. For that we also need to change the implementation name in case of DSL (which can also be helpful when looking at verbose logs).

This approach is more explicit - we'll know at PD creation time what implementation is picked up.

However, for initial prototyping and enabling any approach is fine with me.

@Simonsays095 It looks like your suggestion doesn't work, oneDNN is just catching the error and falling back to the next implementation. I am going to leave this and we can remove it in the future.

@echeresh Agreed, I plan to work on finalizing the Gemmstone API for this feature before doing this work though. In particular, I plan to finalize the API during upstreaming to ensure compatibility with Gemmstone requirements.

kealan-barbieri · 2025-08-18T22:59:52Z

It may be worth having a small DSL README like we do for IR.

src/gpu/intel/gemm/jit/generator_dsl/builder.cpp

src/gpu/intel/sycl/engine.cpp

src/gpu/intel/gemm/jit/generator_dsl/builder.cpp

rjoursler · 2025-08-18T23:52:50Z

make test
disable test_device_cpu

rjoursler · 2025-08-19T20:49:18Z

It may be worth having a small DSL README like we do for IR.

@kealan-barbieri Agreed, but I think this will be reserved for future work.

rjoursler requested a review from a team as a code owner August 6, 2025 20:09

github-actions bot added the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Aug 6, 2025

rjoursler force-pushed the rjoursle/further_simplify branch 5 times, most recently from bc7e93c to 8d5d831 Compare August 7, 2025 13:00

Base automatically changed from rjoursle/further_simplify to main August 7, 2025 13:32

rjoursler force-pushed the rjoursle/gemm_ir branch 2 times, most recently from 71c5a97 to a371e67 Compare August 7, 2025 13:44

echeresh reviewed Aug 12, 2025

View reviewed changes

rjoursler force-pushed the rjoursle/gemm_ir branch 5 times, most recently from afbac73 to 75a8caf Compare August 13, 2025 21:42

rjoursler added 2 commits August 13, 2025 14:54

xe: jit: dsl: introduce sub_tensor interface

c35de33

xe: jit: dsl: remove default initializer

594ef1f

The default initializer prevents aggregate initialization in C++11.

rjoursler force-pushed the rjoursle/gemm_ir branch from 75a8caf to 18e3682 Compare August 13, 2025 21:57

echeresh reviewed Aug 14, 2025

View reviewed changes

echeresh approved these changes Aug 14, 2025

View reviewed changes

echeresh mentioned this pull request Aug 14, 2025

GPU DSL extensions #3765

Merged

rjoursler force-pushed the rjoursle/gemm_ir branch from 18e3682 to e209fee Compare August 14, 2025 14:22

rjoursler added 3 commits August 14, 2025 11:16

xe: jit: dsl: introduce kernel_t

e2e7613

xe: jit: dsl: create dsl::layout_t

3463409

xe: jit: dsl: add expr_t to dsl

c07c839

rjoursler force-pushed the rjoursle/gemm_ir branch from e209fee to a2c5725 Compare August 14, 2025 19:52

petercad reviewed Aug 14, 2025

View reviewed changes

rjoursler force-pushed the rjoursle/gemm_ir branch from a2c5725 to 0ee5f92 Compare August 14, 2025 21:01

rjoursler mentioned this pull request Aug 15, 2025

xe: jit: dsl: add forward declaration concept to DSL #3772

Closed

rjoursler force-pushed the rjoursle/gemm_ir branch from 0ee5f92 to 69c8f96 Compare August 15, 2025 18:34

Simonsays095 reviewed Aug 18, 2025

View reviewed changes

xe: jit: gemm: initial ir commit

3b8db30

rjoursler force-pushed the rjoursle/gemm_ir branch from 69c8f96 to 3b8db30 Compare August 18, 2025 23:50

Simonsays095 approved these changes Aug 19, 2025

View reviewed changes

rjoursler merged commit 4d8ab88 into main Aug 19, 2025
9 of 10 checks passed

rjoursler deleted the rjoursle/gemm_ir branch August 19, 2025 20:49

	kernel_t make_kernel(
	const generator_dsl_desc_t &desc, ir::constraint_set_t cset) {
	ir::ir_context_t ctx(desc.exec_cfg, cset);

	ir::trace_start();
	auto k = generator_dsl_t(desc).build(desc.kernel_iface(), ctx);
	ir::trace_pass("build generator_dsl_t", k.body, ctx);

	k.body = ir::simplify(k.body, ctx);
	k.body = ir::inject_send(k.body, ctx);

	// TODO: This should be unnecessary as it could happen at codegen
	k.body = ir::fixup_if_conditions(k.body, ctx);
	k.body = ir::eliminate_common_subexprs(
	k.body, ctx, desc.strategy.GRFs * ctx.hw().grf_size());
	return k;
	}

xe: jit: gemm: introduce protoype IR based gemm #3729

xe: jit: gemm: introduce protoype IR based gemm #3729

Uh oh!

Conversation

rjoursler commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

echeresh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

echeresh Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

rjoursler Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

echeresh Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjoursler Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

echeresh Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

echeresh Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjoursler Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

echeresh Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjoursler Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

echeresh Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

echeresh Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rjoursler commented Aug 14, 2025

Uh oh!

petercad Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

petercad Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petercad Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

rjoursler Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Simonsays095 Aug 14, 2025

rjoursler commented Aug 6, 2025 •

edited

Loading

echeresh Aug 12, 2025 •

edited

Loading

rjoursler Aug 12, 2025 •

edited

Loading

echeresh Aug 14, 2025 •

edited

Loading

rjoursler Aug 14, 2025 •

edited

Loading

echeresh Aug 14, 2025 •

edited

Loading

rjoursler Aug 14, 2025 •

edited

Loading

echeresh Aug 14, 2025 •

edited

Loading

echeresh Aug 14, 2025 •

edited

Loading

petercad Aug 14, 2025 •

edited

Loading

rjoursler Aug 14, 2025 •

edited

Loading

echeresh Aug 15, 2025 •

edited

Loading

rjoursler Aug 15, 2025 •

edited

Loading

kealan-barbieri commented Aug 18, 2025 •

edited

Loading