-
Notifications
You must be signed in to change notification settings - Fork 1.1k
xe: jit: gemm: introduce protoype IR based gemm #3729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bc7e93c
to
8d5d831
Compare
71c5a97
to
a371e67
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjoursler The changes mostly look good to me. I have a few comments about the DSL-based kernel design.
But I think it's better to merge as is, and do incremental changes in the future. If you have a specific plan on the future changes, please share.
gemm_ir(const gemm_ir_desc_t &desc) | ||
: problem(desc.problem), strategy(desc.strategy) {} | ||
|
||
ir::stmt_t build(ir::kernel_iface_t iface, ir::ir_context_t &ctx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to fully move away from the IR "building" to the "generation" paradigm via DSL, that is drop explicit stmt_t
usage, eliminate direct IR usage and use proper naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a general idea on the interfaces you are intending here? Would it just be to add interfaces like
cl_kernel dsl::build(cl_context ctx, cl_device)
and modifying the code to use those interfaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about an nGEN-style interface (and generator/kernel abstraction):
namespace dsl {
class kernel_t {
public:
void generate() {
auto A = alloc(...);
multiply(A, B, ...);
}
...
#if WITH_OPENCL
cl_kernel get_kernel(cl_context, cl_device_id);
#endif
#if WITH_SYCL
sycl::kernel get_kernel(sycl::context, sycl::device);
#endif
};
}
- We drop the
engine
-based dispatching to avoid dependency on oneDNN - We drop templates (comparing with nGEN)
But other than that, we can have similar configuration capabilities as in nGEN, like dsl::kernel_t::requireGRF(...)
, etc. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea, although I suggest we modify the get_kernel
functions be standalone functions. By doing this, we can consolidate runtime specific functionality into dedicated headers (if we accumulate multiple runtime dependent interfaces).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Just a general comment: it would be nice to have less forwarding and extra functions around this place. This is a relatively simple interface which should just forward arguments to the underlying nGEN generator so it can be kept simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjoursler Thanks. I noticed there is still code that could be part of a common DSL kernel abstraction, e.g. this make_kernel
include some steps that should be shared between all DSL kernels:
oneDNN/src/gpu/intel/gemm/jit/generator_dsl/builder.cpp
Lines 768 to 784 in 18e3682
kernel_t make_kernel( | |
const generator_dsl_desc_t &desc, ir::constraint_set_t cset) { | |
ir::ir_context_t ctx(desc.exec_cfg, cset); | |
ir::trace_start(); | |
auto k = generator_dsl_t(desc).build(desc.kernel_iface(), ctx); | |
ir::trace_pass("build generator_dsl_t", k.body, ctx); | |
k.body = ir::simplify(k.body, ctx); | |
k.body = ir::inject_send(k.body, ctx); | |
// TODO: This should be unnecessary as it could happen at codegen | |
k.body = ir::fixup_if_conditions(k.body, ctx); | |
k.body = ir::eliminate_common_subexprs( | |
k.body, ctx, desc.strategy.GRFs * ctx.hw().grf_size()); | |
return k; | |
} |
What do you think about following the nGEN approach where a generator/kernel is derived from the base generator/kernel and we end up with most things hidden and implemented in the base class:
class dsl_kernel_t {
public:
virtual void generate() = 0;
// OpenCL here as an example.
cl_kernel get_ocl_kernel() {
declare_kernel(...);
generate();
auto stmt = end_kernel();
stmt = ir_pass(stmt);
...
return make_ocl_kernel(stmt, ...);
}
};
class my_kernel_t : public dsl_kernel_t {
public:
my_kernel_t(my_desc_t desc): desc(desc) {}
void generate() override {
// kernel code
}
private:
my_desc_t desc;
};
// Usage:
my_kernel_t kernel;
auto cl_kernel = kernel.get_ocl_kernel();
Why I think this is important: I anticipate we will try to implement fusions via DSL and having such interfaces would be a great help to quickly start implementing the algorithm without dealing with unnecessary boilerplate.
I'm fine to merge as is and adjust in following iterations. But I think it'd be good to agree on how it should look like in the final version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echeresh, I was thinking about this some yesterday and came away against using an abstract interface for this purpose. The purpose the abstract interface is to gain access to a callable object, but it will be more flexible and require less boilerplate to directly target callable objects. For example, I think it would be much better to implement your sample function as
template <typename F>
cl kernel get_ocl_kernel(F&& generate) {
declare_kernel(...)
generate();
auto stmt = end_kernel();
...
By doing this, usage can be as simple as implementing a lambda:
cl_kernel get_ocl_kernel([]{
A = dsl::arg("A");
...
});
Just to give an overview for how I was designing this interface, I have been treating it similar to the compile -> link paradigm. While this is an inexact analogy, under this paradigm, we have source code + compiler options -> object file
and object file(s) + link options -> executable
. This creates the following mapping:
source code -> kernel_iface + generate
compile options -> exec_config_t + ir::ctx()
object file -> dsl::kernel_t
link options -> ngen::DebugInfo
executable -> cl_kernel, sycl::kernel, etc
and believe we should be wrap the compiler options and link options into their own structures as we work to finalize these interfaces. As part of this, I think the standard ir transformations you mentioned would largely be relegated to a list of passes within the compile options and applied during the call to end_kernel()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it will be more flexible and require less boilerplate to directly target callable objects
Can you please be more specific here, what do we gain? My snippet is quite simple - no boilerplate and pretty clear where to put kernel code. I read your suggestion as moving away from the object model to lambdas but IMO, it's mostly a style choice. Having nGEN-style kernel generators in oneDNN is already idiomatic so why steering away from it?
Moreover, for example, in your snippet:
cl_kernel get_ocl_kernel([]{
A = dsl::arg("A");
...
});
The kernel is implemented in a lambda while in practice we won't be probably writing kernels this way as a barely complex kernel would require some internal state to implement its logic - so we would probably continue implementing kernel generators in dedicated classes. In this case, introducing a base DSL kernel class is reasonable to me.
I think the standard ir transformations you mentioned would largely be relegated to a list of passes within the compile options
As a side comment: I believe we should focus on a default set of passes - to keep things simple from the beginning. I don't like the idea of allowing too much customization - due to complexity. At the same time, customization can be useful but keeping it under more control. My main concern is that development won't scale well as long as we assume every kernel will have its own customization for passes - so I would assume the opposite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My snippet is quite simple - no boilerplate and pretty clear where to put kernel code.
All the class setup is unnecessary boiler plate. Consider a simple fusion kernel, using a callable object based approach is just 4 lines of code, while the structure based approach is more like 12.
// Structure based approach
class simple_fused_kernel_t : public dsl_kernel_t {
public:
my_kernel_t(kernel1_desc_t desc1, kernel2_desc_t desc2): kernel1(desc1), kernel2(desc2) {}
void generate() override {
kernel1.generate();
kernel2.generate()j;
}
private:
kernel1_t kernel1;
kernel2_t kernel2;
};
// Callable object based API
generate_simple_fused(kernel1_desc_t desc1, kernel2_desc_t desc2) {
generate_kernel1(desc1);
generate_kernel2(desc2);
}
I read your suggestion as moving away from the object model to lambdas but IMO, it's mostly a style choice.
I mostly agree, but the purpose of the DSL is to be more like a higher level language. Up to now the DSL does not use inheritance to align with this, so injecting it at this point is not aligned with the DSL style. In addition, the structure approach is not how OpenCL or SYCL programming work, where as using a function based API is very similar to SYCL programming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the class setup is unnecessary boiler plate. Consider a simple fusion kernel, using a callable object based approach is just 4 lines of code, while the structure based approach is more like 12.
I have two comments here:
- Are such kernels our main interest? oneDNN kernels are usually more complex, I would target them. We can't implement complex kernels in simple lambdas
- This example uses fusion - I agree that for fusion we need internal API for composability (I think we discussed that). But it's not kernel-level reuse
- It won't be a simple stacking like in the example above - we have to forward arguments properly, etc. Rather, something like
gemm(A, B, C)
. We usually can't reuse the whole kernels
- It won't be a simple stacking like in the example above - we have to forward arguments properly, etc. Rather, something like
I mostly agree, but the purpose of the DSL is to be more like a higher level language. Up to now the DSL does not use inheritance, so injecting it at this point is not aligned with the DSL style. In addition, the structure approach is not how OpenCL or SYCL programming work, where as using a function based API is very similar to SYCL programming.
Well, it's a style choice to me. I wouldn't say SYCL is closer here, to me DSL is closer to nGEN as both are runtime-configurable and JITted at runtime (i.e. every nGEN instruction or DSL construct execution results in some code being produced). Contrary, any high-level SYCL/OpenCL kernel is parsed and compiled by the compiler at once. Having a compiler for example means that in SYCL we have implicit kernel ABI or can rely on compiler-assisted attributes to control kernel behavior. DSL/nGEN kernels are strictly API-configurable.
I'm fine to start with lambdas and check after a while. If generators start re-implementing the same things - I suggest to switch to a base class if that's the case.
afbac73
to
75a8caf
Compare
The default initializer prevents aggregate initialization in C++11.
75a8caf
to
18e3682
Compare
gemm_ir(const gemm_ir_desc_t &desc) | ||
: problem(desc.problem), strategy(desc.strategy) {} | ||
|
||
ir::stmt_t build(ir::kernel_iface_t iface, ir::ir_context_t &ctx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjoursler Thanks. I noticed there is still code that could be part of a common DSL kernel abstraction, e.g. this make_kernel
include some steps that should be shared between all DSL kernels:
oneDNN/src/gpu/intel/gemm/jit/generator_dsl/builder.cpp
Lines 768 to 784 in 18e3682
kernel_t make_kernel( | |
const generator_dsl_desc_t &desc, ir::constraint_set_t cset) { | |
ir::ir_context_t ctx(desc.exec_cfg, cset); | |
ir::trace_start(); | |
auto k = generator_dsl_t(desc).build(desc.kernel_iface(), ctx); | |
ir::trace_pass("build generator_dsl_t", k.body, ctx); | |
k.body = ir::simplify(k.body, ctx); | |
k.body = ir::inject_send(k.body, ctx); | |
// TODO: This should be unnecessary as it could happen at codegen | |
k.body = ir::fixup_if_conditions(k.body, ctx); | |
k.body = ir::eliminate_common_subexprs( | |
k.body, ctx, desc.strategy.GRFs * ctx.hw().grf_size()); | |
return k; | |
} |
What do you think about following the nGEN approach where a generator/kernel is derived from the base generator/kernel and we end up with most things hidden and implemented in the base class:
class dsl_kernel_t {
public:
virtual void generate() = 0;
// OpenCL here as an example.
cl_kernel get_ocl_kernel() {
declare_kernel(...);
generate();
auto stmt = end_kernel();
stmt = ir_pass(stmt);
...
return make_ocl_kernel(stmt, ...);
}
};
class my_kernel_t : public dsl_kernel_t {
public:
my_kernel_t(my_desc_t desc): desc(desc) {}
void generate() override {
// kernel code
}
private:
my_desc_t desc;
};
// Usage:
my_kernel_t kernel;
auto cl_kernel = kernel.get_ocl_kernel();
Why I think this is important: I anticipate we will try to implement fusions via DSL and having such interfaces would be a great help to quickly start implementing the algorithm without dealing with unnecessary boilerplate.
I'm fine to merge as is and adjust in following iterations. But I think it'd be good to agree on how it should look like in the final version.
18e3682
to
e209fee
Compare
e209fee
to
a2c5725
Compare
make test |
} | ||
auto k = make_kernel(gemm_desc, cset); | ||
return engine->create_kernel(kernel, k); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's refactor this a bit and put the DSL kernel generation code in one method (get_kernel_dsl say) and the "classic" code generation in another (get_kernel_classic, say, not sure if that's a good name).
strategy_.kParallelLocal = false; | ||
strategy_.kInterleave = false; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a quick comment here to explain ("these strategies not supported yet via DSL, automatically disable for the benefit of code below" or something like that).
It might be nice to factor this out into its own routine so you can update it as DSL capabilities evolve.
auto b_size = types::data_type_size(pd()->eff_b_type()); | ||
gpu_assert(stride_b * b_size % base_alignment == 0) | ||
<< "Unimplemented load transform"; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this check go in pd init?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once this starts being used in production, yes. My current goal here is that executing
enable_generator_dsl=1 ./benchdnn --matmul ...
will result in a test failure so that I can accurately track what is functionally enabled rather than silently testing the current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not move it to pd initialization and either:
- Error when dsl fails instead of returning unimplemented (i.e. move this code as-is into pd init)
- Add a new env variable like
require_generator_dsl
to throw an error if we're not using the dsl implementation - to separate use of the dsl implementation and testing it for coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjoursler As an option we could align this with the existing approach for v2 convolution - to rely on --impl=dsl
to test this implementation explicitly. For that we also need to change the implementation name in case of DSL (which can also be helpful when looking at verbose logs).
This approach is more explicit - we'll know at PD creation time what implementation is picked up.
However, for initial prototyping and enabling any approach is fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Simonsays095 It looks like your suggestion doesn't work, oneDNN is just catching the error and falling back to the next implementation. I am going to leave this and we can remove it in the future.
@echeresh Agreed, I plan to work on finalizing the Gemmstone API for this feature before doing this work though. In particular, I plan to finalize the API during upstreaming to ensure compatibility with Gemmstone requirements.
a2c5725
to
0ee5f92
Compare
0ee5f92
to
69c8f96
Compare
It may be worth having a small DSL README like we do for IR. |
69c8f96
to
3b8db30
Compare
make test |
@kealan-barbieri Agreed, but I think this will be reserved for future work. |
Introduces a prototype IR based gemm kernel. Currently, this has only be vetted on well aligned problems of the form:
Currently, many features are missing, most notably are a lack of data type conversions, dequantization, most post-ops, and handling of "poorly" aligned cases. This PR does implement support for batched GEMM and (common) bias. The prototype implementation can be executed by setting the environment variable
enable_gemm_ir=1
. Furthermore, problem specific specialization can be enabled by settinggemm_ir_specialize=1
(although the kernel cache will not work correctly with this knob). For those unfamiliar with IR development, running withDNNL_VEROBSE=debuginfo=200
will log IR as it is transformed into the final output.