Skip to content

Conversation

rjoursler
Copy link
Contributor

@rjoursler rjoursler commented Aug 6, 2025

Introduces a prototype IR based gemm kernel. Currently, this has only be vetted on well aligned problems of the form:

--matmul --engine=gpu --dt=bf16:bf16:f32 --stag=ab --wtag=ab --dtag=ab 16x48x272:16x272x8208

Currently, many features are missing, most notably are a lack of data type conversions, dequantization, most post-ops, and handling of "poorly" aligned cases. This PR does implement support for batched GEMM and (common) bias. The prototype implementation can be executed by setting the environment variable enable_gemm_ir=1. Furthermore, problem specific specialization can be enabled by setting gemm_ir_specialize=1 (although the kernel cache will not work correctly with this knob). For those unfamiliar with IR development, running with DNNL_VEROBSE=debuginfo=200 will log IR as it is transformed into the final output.

@rjoursler rjoursler requested a review from a team as a code owner August 6, 2025 20:09
@github-actions github-actions bot added the platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel label Aug 6, 2025
@rjoursler rjoursler force-pushed the rjoursle/further_simplify branch 5 times, most recently from bc7e93c to 8d5d831 Compare August 7, 2025 13:00
Base automatically changed from rjoursle/further_simplify to main August 7, 2025 13:32
@rjoursler rjoursler force-pushed the rjoursle/gemm_ir branch 2 times, most recently from 71c5a97 to a371e67 Compare August 7, 2025 13:44
Copy link
Contributor

@echeresh echeresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjoursler The changes mostly look good to me. I have a few comments about the DSL-based kernel design.

But I think it's better to merge as is, and do incremental changes in the future. If you have a specific plan on the future changes, please share.

gemm_ir(const gemm_ir_desc_t &desc)
: problem(desc.problem), strategy(desc.strategy) {}

ir::stmt_t build(ir::kernel_iface_t iface, ir::ir_context_t &ctx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to fully move away from the IR "building" to the "generation" paradigm via DSL, that is drop explicit stmt_t usage, eliminate direct IR usage and use proper naming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a general idea on the interfaces you are intending here? Would it just be to add interfaces like

cl_kernel dsl::build(cl_context ctx, cl_device)

and modifying the code to use those interfaces?

Copy link
Contributor

@echeresh echeresh Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about an nGEN-style interface (and generator/kernel abstraction):

namespace dsl {
class kernel_t {
public:
    void generate() {
        auto A = alloc(...);
        multiply(A, B, ...);
    }

...
#if WITH_OPENCL
    cl_kernel get_kernel(cl_context, cl_device_id);
#endif
#if WITH_SYCL
    sycl::kernel get_kernel(sycl::context, sycl::device);
#endif
};
}
  • We drop the engine-based dispatching to avoid dependency on oneDNN
  • We drop templates (comparing with nGEN)

But other than that, we can have similar configuration capabilities as in nGEN, like dsl::kernel_t::requireGRF(...), etc. What do you think?

Copy link
Contributor Author

@rjoursler rjoursler Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea, although I suggest we modify the get_kernel functions be standalone functions. By doing this, we can consolidate runtime specific functionality into dedicated headers (if we accumulate multiple runtime dependent interfaces).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Just a general comment: it would be nice to have less forwarding and extra functions around this place. This is a relatively simple interface which should just forward arguments to the underlying nGEN generator so it can be kept simple.

Copy link
Contributor

@echeresh echeresh Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjoursler Thanks. I noticed there is still code that could be part of a common DSL kernel abstraction, e.g. this make_kernel include some steps that should be shared between all DSL kernels:

kernel_t make_kernel(
const generator_dsl_desc_t &desc, ir::constraint_set_t cset) {
ir::ir_context_t ctx(desc.exec_cfg, cset);
ir::trace_start();
auto k = generator_dsl_t(desc).build(desc.kernel_iface(), ctx);
ir::trace_pass("build generator_dsl_t", k.body, ctx);
k.body = ir::simplify(k.body, ctx);
k.body = ir::inject_send(k.body, ctx);
// TODO: This should be unnecessary as it could happen at codegen
k.body = ir::fixup_if_conditions(k.body, ctx);
k.body = ir::eliminate_common_subexprs(
k.body, ctx, desc.strategy.GRFs * ctx.hw().grf_size());
return k;
}

What do you think about following the nGEN approach where a generator/kernel is derived from the base generator/kernel and we end up with most things hidden and implemented in the base class:

class dsl_kernel_t {
public:
    virtual void generate() = 0;

    // OpenCL here as an example.
    cl_kernel get_ocl_kernel() {
        declare_kernel(...);
        generate();
        auto stmt = end_kernel();
        stmt = ir_pass(stmt);
        ...
        return make_ocl_kernel(stmt, ...);
    }
};

class my_kernel_t : public dsl_kernel_t {
public:
    my_kernel_t(my_desc_t desc): desc(desc) {}
    
    void generate() override {
        // kernel code
    }
private:
    my_desc_t desc;
};

// Usage:
my_kernel_t kernel;
auto cl_kernel = kernel.get_ocl_kernel();

Why I think this is important: I anticipate we will try to implement fusions via DSL and having such interfaces would be a great help to quickly start implementing the algorithm without dealing with unnecessary boilerplate.

I'm fine to merge as is and adjust in following iterations. But I think it'd be good to agree on how it should look like in the final version.

Copy link
Contributor Author

@rjoursler rjoursler Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@echeresh, I was thinking about this some yesterday and came away against using an abstract interface for this purpose. The purpose the abstract interface is to gain access to a callable object, but it will be more flexible and require less boilerplate to directly target callable objects. For example, I think it would be much better to implement your sample function as

template <typename F>
cl kernel get_ocl_kernel(F&& generate) {
    declare_kernel(...)
    generate();
    auto stmt = end_kernel();
    ...

By doing this, usage can be as simple as implementing a lambda:

cl_kernel get_ocl_kernel([]{
      A = dsl::arg("A");
      ...
});

Just to give an overview for how I was designing this interface, I have been treating it similar to the compile -> link paradigm. While this is an inexact analogy, under this paradigm, we have source code + compiler options -> object file and object file(s) + link options -> executable. This creates the following mapping:

source code -> kernel_iface + generate
compile options -> exec_config_t + ir::ctx()
object file -> dsl::kernel_t
link options -> ngen::DebugInfo
executable -> cl_kernel, sycl::kernel, etc

and believe we should be wrap the compiler options and link options into their own structures as we work to finalize these interfaces. As part of this, I think the standard ir transformations you mentioned would largely be relegated to a list of passes within the compile options and applied during the call to end_kernel().

Copy link
Contributor

@echeresh echeresh Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it will be more flexible and require less boilerplate to directly target callable objects

Can you please be more specific here, what do we gain? My snippet is quite simple - no boilerplate and pretty clear where to put kernel code. I read your suggestion as moving away from the object model to lambdas but IMO, it's mostly a style choice. Having nGEN-style kernel generators in oneDNN is already idiomatic so why steering away from it?

Moreover, for example, in your snippet:

cl_kernel get_ocl_kernel([]{
      A = dsl::arg("A");
      ...
});

The kernel is implemented in a lambda while in practice we won't be probably writing kernels this way as a barely complex kernel would require some internal state to implement its logic - so we would probably continue implementing kernel generators in dedicated classes. In this case, introducing a base DSL kernel class is reasonable to me.

I think the standard ir transformations you mentioned would largely be relegated to a list of passes within the compile options

As a side comment: I believe we should focus on a default set of passes - to keep things simple from the beginning. I don't like the idea of allowing too much customization - due to complexity. At the same time, customization can be useful but keeping it under more control. My main concern is that development won't scale well as long as we assume every kernel will have its own customization for passes - so I would assume the opposite.

Copy link
Contributor Author

@rjoursler rjoursler Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My snippet is quite simple - no boilerplate and pretty clear where to put kernel code.

All the class setup is unnecessary boiler plate. Consider a simple fusion kernel, using a callable object based approach is just 4 lines of code, while the structure based approach is more like 12.

// Structure based approach
class simple_fused_kernel_t : public dsl_kernel_t {
public:
    my_kernel_t(kernel1_desc_t desc1, kernel2_desc_t desc2): kernel1(desc1), kernel2(desc2) {}
    
    void generate() override {
        kernel1.generate();
        kernel2.generate()j;
    }
private:
    kernel1_t kernel1;
    kernel2_t kernel2;
};

// Callable object based API
generate_simple_fused(kernel1_desc_t desc1, kernel2_desc_t desc2) {
    generate_kernel1(desc1);
    generate_kernel2(desc2);
}

I read your suggestion as moving away from the object model to lambdas but IMO, it's mostly a style choice.

I mostly agree, but the purpose of the DSL is to be more like a higher level language. Up to now the DSL does not use inheritance to align with this, so injecting it at this point is not aligned with the DSL style. In addition, the structure approach is not how OpenCL or SYCL programming work, where as using a function based API is very similar to SYCL programming.

Copy link
Contributor

@echeresh echeresh Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the class setup is unnecessary boiler plate. Consider a simple fusion kernel, using a callable object based approach is just 4 lines of code, while the structure based approach is more like 12.

I have two comments here:

  • Are such kernels our main interest? oneDNN kernels are usually more complex, I would target them. We can't implement complex kernels in simple lambdas
  • This example uses fusion - I agree that for fusion we need internal API for composability (I think we discussed that). But it's not kernel-level reuse
    • It won't be a simple stacking like in the example above - we have to forward arguments properly, etc. Rather, something like gemm(A, B, C). We usually can't reuse the whole kernels

I mostly agree, but the purpose of the DSL is to be more like a higher level language. Up to now the DSL does not use inheritance, so injecting it at this point is not aligned with the DSL style. In addition, the structure approach is not how OpenCL or SYCL programming work, where as using a function based API is very similar to SYCL programming.

Well, it's a style choice to me. I wouldn't say SYCL is closer here, to me DSL is closer to nGEN as both are runtime-configurable and JITted at runtime (i.e. every nGEN instruction or DSL construct execution results in some code being produced). Contrary, any high-level SYCL/OpenCL kernel is parsed and compiled by the compiler at once. Having a compiler for example means that in SYCL we have implicit kernel ABI or can rely on compiler-assisted attributes to control kernel behavior. DSL/nGEN kernels are strictly API-configurable.

I'm fine to start with lambdas and check after a while. If generators start re-implementing the same things - I suggest to switch to a base class if that's the case.

@rjoursler rjoursler force-pushed the rjoursle/gemm_ir branch 5 times, most recently from afbac73 to 75a8caf Compare August 13, 2025 21:42
The default initializer prevents aggregate initialization in C++11.
gemm_ir(const gemm_ir_desc_t &desc)
: problem(desc.problem), strategy(desc.strategy) {}

ir::stmt_t build(ir::kernel_iface_t iface, ir::ir_context_t &ctx) {
Copy link
Contributor

@echeresh echeresh Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjoursler Thanks. I noticed there is still code that could be part of a common DSL kernel abstraction, e.g. this make_kernel include some steps that should be shared between all DSL kernels:

kernel_t make_kernel(
const generator_dsl_desc_t &desc, ir::constraint_set_t cset) {
ir::ir_context_t ctx(desc.exec_cfg, cset);
ir::trace_start();
auto k = generator_dsl_t(desc).build(desc.kernel_iface(), ctx);
ir::trace_pass("build generator_dsl_t", k.body, ctx);
k.body = ir::simplify(k.body, ctx);
k.body = ir::inject_send(k.body, ctx);
// TODO: This should be unnecessary as it could happen at codegen
k.body = ir::fixup_if_conditions(k.body, ctx);
k.body = ir::eliminate_common_subexprs(
k.body, ctx, desc.strategy.GRFs * ctx.hw().grf_size());
return k;
}

What do you think about following the nGEN approach where a generator/kernel is derived from the base generator/kernel and we end up with most things hidden and implemented in the base class:

class dsl_kernel_t {
public:
    virtual void generate() = 0;

    // OpenCL here as an example.
    cl_kernel get_ocl_kernel() {
        declare_kernel(...);
        generate();
        auto stmt = end_kernel();
        stmt = ir_pass(stmt);
        ...
        return make_ocl_kernel(stmt, ...);
    }
};

class my_kernel_t : public dsl_kernel_t {
public:
    my_kernel_t(my_desc_t desc): desc(desc) {}
    
    void generate() override {
        // kernel code
    }
private:
    my_desc_t desc;
};

// Usage:
my_kernel_t kernel;
auto cl_kernel = kernel.get_ocl_kernel();

Why I think this is important: I anticipate we will try to implement fusions via DSL and having such interfaces would be a great help to quickly start implementing the algorithm without dealing with unnecessary boilerplate.

I'm fine to merge as is and adjust in following iterations. But I think it'd be good to agree on how it should look like in the final version.

@rjoursler
Copy link
Contributor Author

make test
disable test_device_cpu

}
auto k = make_kernel(gemm_desc, cset);
return engine->create_kernel(kernel, k);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's refactor this a bit and put the DSL kernel generation code in one method (get_kernel_dsl say) and the "classic" code generation in another (get_kernel_classic, say, not sure if that's a good name).

strategy_.kParallelLocal = false;
strategy_.kInterleave = false;
}
}
Copy link
Contributor

@petercad petercad Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a quick comment here to explain ("these strategies not supported yet via DSL, automatically disable for the benefit of code below" or something like that).

It might be nice to factor this out into its own routine so you can update it as DSL capabilities evolve.

auto b_size = types::data_type_size(pd()->eff_b_type());
gpu_assert(stride_b * b_size % base_alignment == 0)
<< "Unimplemented load transform";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this check go in pd init?

Copy link
Contributor Author

@rjoursler rjoursler Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this starts being used in production, yes. My current goal here is that executing

enable_generator_dsl=1 ./benchdnn --matmul ...

will result in a test failure so that I can accurately track what is functionally enabled rather than silently testing the current implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not move it to pd initialization and either:

  1. Error when dsl fails instead of returning unimplemented (i.e. move this code as-is into pd init)
  2. Add a new env variable like require_generator_dsl to throw an error if we're not using the dsl implementation - to separate use of the dsl implementation and testing it for coverage.

Copy link
Contributor

@echeresh echeresh Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjoursler As an option we could align this with the existing approach for v2 convolution - to rely on --impl=dsl to test this implementation explicitly. For that we also need to change the implementation name in case of DSL (which can also be helpful when looking at verbose logs).

This approach is more explicit - we'll know at PD creation time what implementation is picked up.

However, for initial prototyping and enabling any approach is fine with me.

Copy link
Contributor Author

@rjoursler rjoursler Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Simonsays095 It looks like your suggestion doesn't work, oneDNN is just catching the error and falling back to the next implementation. I am going to leave this and we can remove it in the future.

@echeresh Agreed, I plan to work on finalizing the Gemmstone API for this feature before doing this work though. In particular, I plan to finalize the API during upstreaming to ensure compatibility with Gemmstone requirements.

@kealan-barbieri
Copy link
Contributor

kealan-barbieri commented Aug 18, 2025

It may be worth having a small DSL README like we do for IR.

@rjoursler
Copy link
Contributor Author

make test
disable test_device_cpu

@rjoursler
Copy link
Contributor Author

It may be worth having a small DSL README like we do for IR.

@kealan-barbieri Agreed, but I think this will be reserved for future work.

@rjoursler rjoursler merged commit 4d8ab88 into main Aug 19, 2025
9 of 10 checks passed
@rjoursler rjoursler deleted the rjoursle/gemm_ir branch August 19, 2025 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants