-
Notifications
You must be signed in to change notification settings - Fork 119
Add unaligned_line_read and unaligned_line_write as cpu-only cubecl extensions #1052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
as cpu-only cubecl extensions
|
There is a way to reinterpret the line sizes of slices, you can call: |
|
Hmm, I had not found that method - probably due to only being available on Slice<Line<_>> and not Slice<_>. For context, this is my kernel #[cube(launch_unchecked)]
pub fn convolution(
input: Tensor<f32>,
mut output: Tensor<Line<f32>>,
weights: Array<f32>,
offsets: Array<i32>,
#[comptime] n_weights: u32,
#[comptime] ilp: u32,
) {
let line_size = output.line_size();
let row = ABSOLUTE_POS;
let rows = output.shape(0);
let cols = output.shape(1);
if row >= rows {
terminate!()
}
let padding_rows = (input.shape(0) - output.shape(0)) / 2;
let padding_cols = (input.shape(1) - output.shape(1) * line_size) / 2;
let inp_flat_row = (row + padding_rows) * input.shape(1);
let out_flat_row = row * output.shape(1);
let mut col = 0;
let mut accumulators = Array::<Line<f32>>::vectorized(ilp, line_size);
while col < cols {
#[unroll]
for ilp_idx in 0..ilp {
accumulators[ilp_idx] = Line::<f32>::empty(line_size);
}
let idx = inp_flat_row + col * line_size + padding_cols;
let weight_line = Line::<f32>::empty(line_size);
#[unroll]
for weight_idx in 0..n_weights {
let weight = weight_line.fill(weights[weight_idx]);
let offset = offsets[weight_idx];
#[unroll]
for ilp_idx in 0..ilp {
let index = (idx + ilp_idx * line_size) as i32 + offset;
let inp = input.unaligned_line_read(index as u32, line_size);
accumulators[ilp_idx] = fma(inp, weight, accumulators[ilp_idx]);
}
}
let output_idx = out_flat_row + col;
#[unroll]
for ilp_idx in 0..ilp {
output[output_idx + ilp_idx] = accumulators[ilp_idx];
}
col += ilp;
}
}I am doing direct convolution / filtering, which is why I need the unaligned line read. The offsets in the column direction are (of course) in pixel-units, not in units of whatever line size I decide to use for the kernel. I think that with what you propose, I could change // From
input: Tensor<f32>;
// To
input: Tensor<Line<f32>>;
// From
let inp = input.unaligned_line_read(index as u32, line_size);
// To
let inp = input
.slice(index as u32, line_size)
.with_line_size(line_size)[0];I'd then pass However, with this setup, the cubecl-cpu backend panics: This is an assert that was present in the cubecl-cpu compiler before I made any changes fn visit_index(&mut self, index: &IndexOperator, out: Variable) -> Value<'a, 'a> {
assert!(index.line_size == 0);
let mut index_value = self.get_index(
index.index,
out.ty,
index.list.ty.is_vectorized(),
);
if !self.is_memory(index.list) {
// ...It could probably also be worked around, if the above syntax is desirable. I think a nice side-benefit is also opening up the frontend for extensions - I can imagine e.g. SIMD scatter / gather operations and masks could be extensions as well. |
|
Since we're on the topic of syntax, I'd also prefer if there were a method for splatting a scalar to a Line of some size, rather than having to do |
This is about unaligned loads, not unlined loads I think - loading a full line (i.e. 512 bits) from an address that is not aligned to that same width. An unlined load would load individual elements from an address aligned to the element size. But I don't know if this is the right way to do it because I don't fully understand the internals of the CPU backends. let slice = &values_f32[1..];
let vector = read_unaligned(slice.as_ptr() as *const f32x8);We have a lot of these in the ndarray SIMD functions because |
|
@wingertge You're hitting the nail on the head with this being about unaligned line reads of full width.
pub fn get_index(
&self,
variable: Variable,
target_item: ir::Type,
list_is_vectorized: bool,
) -> Value<'a, 'a> {
let index = self.get_variable(variable);
let mut index = self.append_operation_with_result(index::casts(
index,
Type::index(self.context),
self.location,
));
if target_item.is_vectorized() && list_is_vectorized { // Here we manually multiply the index to go from an index in vector units to an index in scalar units, since the underlying buffer is always in scalar units
let vectorization = target_item.line_size() as i64;
let shift = vectorization.ilog2() as i64; // The line size is apparently assumed to always be a power of 2, meaning we can do the multiplication with a left shift instead
let constant = self.append_operation_with_result(arith::constant(
self.context,
Type::index(self.context),
IntegerAttribute::new(Type::index(self.context), shift).into(),
self.location,
));
index = self.append_operation_with_result(arith::shli(
self.context,
index,
constant,
self.location,
));
}
index
}All this to say that, as far as I can tell, the cubecl-cpu backend has always been emitting unaligned vector reads, even when the buffer is supposedly aligned. Ultimately with this change I just want to create a way to actually request this capability explicitly from the frontend. There might be more elegant / better ways to do this. Though again this does seem to work, atleast in my usecase. |
nathanielsimard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand better the problem and the solution is fine. I would add more documentation, but also a device properties flag (unaligned IO supported) with a runtime test in cubecl-core
| #[cube] | ||
| pub trait UnalignedLine<E: CubePrimitive>: | ||
| CubeType<ExpandType = ExpandElementTyped<Self>> + Sized | ||
| { | ||
| fn unaligned_line_read(&self, index: u32, #[comptime] line_size: u32) -> Line<E>; | ||
|
|
||
| fn unaligned_line_write(&mut self, index: u32, value: Line<E>); | ||
| } | ||
|
|
||
| macro_rules! impl_unaligned_line { | ||
| ($type:ident) => { | ||
| paste::paste! { | ||
| type [<$type Expand>]<E> = ExpandElementTyped<$type<E>>; | ||
| } | ||
| #[cube] | ||
| impl<E: CubePrimitive> UnalignedLine<E> for $type<E> { | ||
| fn unaligned_line_read(&self, index: u32, #[comptime] line_size: u32) -> Line<E> { | ||
| unaligned_line_read::<$type<E>, E>(self, index, line_size) | ||
| } | ||
|
|
||
| fn unaligned_line_write(&mut self, index: u32, value: Line<E>) { | ||
| unaligned_line_write::<$type<E>, E>(self, index, value) | ||
| } | ||
| } | ||
| }; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is now user APIs I would add more docs here with maybe an example.
|
Makes sense - I've tried adding in some comments and an "unaligned_io" device feature. I am not sure how to actually wire it up to a check or anything like that though - perhaps you can point me in the right direction? |
Unaligned line reads and writes
When interacting with Line, cubecl generally expects you to have all of your data laid out with the Line chunking in mind, i.e. Array<Line>. This makes a lot of sense in GPU-land, where unaligned line reads aren't really possible in any of the APIs, to the best of my knowledge.
However, I think that the new cubecl-cpu backend represents a very exciting way to write high-performance CPU kernels in a very configurable and quite portable manner. And in many cases the penalty for unaligned vector reads on CPU isn't very high, so it can be a very useful algorithmic tool. I've recently used it play around with a simple direct convolution kernel.
API considerations
However, this presents a problem - unaligned reads / writes make no sense on the GPU backends (as far as I know), but they can be very useful on CPU. So how do we present an interface that best communicates that you shouldn't be trying to do unaligned line reads on GPU?
I propose that we add a
frontendmodule to cubecl-cpu and add an extension trait to the relevant datastructures in here, forcing users to depend oncubecl-cpuand import itsfrontendmodule to make use of this functionality.If you don't think this is the right decision, I'd love to hear other ideas for how this usecase can be accommodated.
In order to actually do this, I was forced to make the ExpandType of ExpandElementTyped public, so that I could use the
intrinsicmacro outside of cubecl-frontend. I'd argue that this is a good thing, as it will allow people to make their own extensions to cubecl more easily and have access to the same tools as cubecl itself uses. In the same vein, I think the ArrayExpand, TensorExpand and so on that the container modules themselves define should probably also be made public. For now I left them private, but that forced me to redefine them here to get the macros to work.Implementation details
I found that the IndexOperator and IndexAssignOperator actually already supported this usecase, so no new IR is added. They both carry a line_size, but this seems to be unused in favor of looking at the return / input value's return type - the current cpu backend goes so far as to assert that line_size == 0.
Furthermore, it seems like all buffers are treated as consisting of scalars at the MLIR level, with some manual instructions added for proper indexing if the underlying buffer is actually vectorized in cubecl land. This makes it easy to do unaligned reads / writes - no change in the MLIR representation of the memory is required, since that was already scalar.
When pulling a vector from scalar memory, we should simply /not/ fix up the index beforehand. To facilitate this, I added another argument to
get_index, which toggles whether the underlying buffer should be treated as having the same vectorization as the output variable.Tests pass, and it works for my convolution kernel, but I'd like some eyes on whether the new argument is passed correctly in all cases.