Skip to content

Conversation

@amfaber
Copy link
Contributor

@amfaber amfaber commented Nov 16, 2025

Unaligned line reads and writes

When interacting with Line, cubecl generally expects you to have all of your data laid out with the Line chunking in mind, i.e. Array<Line>. This makes a lot of sense in GPU-land, where unaligned line reads aren't really possible in any of the APIs, to the best of my knowledge.

However, I think that the new cubecl-cpu backend represents a very exciting way to write high-performance CPU kernels in a very configurable and quite portable manner. And in many cases the penalty for unaligned vector reads on CPU isn't very high, so it can be a very useful algorithmic tool. I've recently used it play around with a simple direct convolution kernel.

API considerations

However, this presents a problem - unaligned reads / writes make no sense on the GPU backends (as far as I know), but they can be very useful on CPU. So how do we present an interface that best communicates that you shouldn't be trying to do unaligned line reads on GPU?
I propose that we add a frontend module to cubecl-cpu and add an extension trait to the relevant datastructures in here, forcing users to depend on cubecl-cpu and import its frontend module to make use of this functionality.

If you don't think this is the right decision, I'd love to hear other ideas for how this usecase can be accommodated.

In order to actually do this, I was forced to make the ExpandType of ExpandElementTyped public, so that I could use the intrinsic macro outside of cubecl-frontend. I'd argue that this is a good thing, as it will allow people to make their own extensions to cubecl more easily and have access to the same tools as cubecl itself uses. In the same vein, I think the ArrayExpand, TensorExpand and so on that the container modules themselves define should probably also be made public. For now I left them private, but that forced me to redefine them here to get the macros to work.

Implementation details

I found that the IndexOperator and IndexAssignOperator actually already supported this usecase, so no new IR is added. They both carry a line_size, but this seems to be unused in favor of looking at the return / input value's return type - the current cpu backend goes so far as to assert that line_size == 0.

Furthermore, it seems like all buffers are treated as consisting of scalars at the MLIR level, with some manual instructions added for proper indexing if the underlying buffer is actually vectorized in cubecl land. This makes it easy to do unaligned reads / writes - no change in the MLIR representation of the memory is required, since that was already scalar.
When pulling a vector from scalar memory, we should simply /not/ fix up the index beforehand. To facilitate this, I added another argument to get_index, which toggles whether the underlying buffer should be treated as having the same vectorization as the output variable.

Tests pass, and it works for my convolution kernel, but I'd like some eyes on whether the new argument is passed correctly in all cases.

@nathanielsimard
Copy link
Member

There is a way to reinterpret the line sizes of slices, you can call: let slice = slice.with_line_size(1) which gives you the ability to have unaligned line read and unaligned line write. Let me know if that works for your use case. In that case, we wouldn't need to change the API.

@amfaber
Copy link
Contributor Author

amfaber commented Nov 24, 2025

Hmm, I had not found that method - probably due to only being available on Slice<Line<_>> and not Slice<_>. For context, this is my kernel

#[cube(launch_unchecked)]
pub fn convolution(
    input: Tensor<f32>,
    mut output: Tensor<Line<f32>>,
    weights: Array<f32>,
    offsets: Array<i32>,
    #[comptime] n_weights: u32,
    #[comptime] ilp: u32,
) {
    let line_size = output.line_size();
    let row = ABSOLUTE_POS;
    let rows = output.shape(0);
    let cols = output.shape(1);
    if row >= rows {
        terminate!()
    }

    let padding_rows = (input.shape(0) - output.shape(0)) / 2;
    let padding_cols = (input.shape(1) - output.shape(1) * line_size) / 2;
    let inp_flat_row = (row + padding_rows) * input.shape(1);
    let out_flat_row = row * output.shape(1);
    let mut col = 0;

    let mut accumulators = Array::<Line<f32>>::vectorized(ilp, line_size);

    while col < cols {
        #[unroll]
        for ilp_idx in 0..ilp {
            accumulators[ilp_idx] = Line::<f32>::empty(line_size);
        }

        let idx = inp_flat_row + col * line_size + padding_cols;
        let weight_line = Line::<f32>::empty(line_size);

        #[unroll]
        for weight_idx in 0..n_weights {
            let weight = weight_line.fill(weights[weight_idx]);
            let offset = offsets[weight_idx];
            #[unroll]
            for ilp_idx in 0..ilp {
                let index = (idx + ilp_idx * line_size) as i32 + offset;
                let inp = input.unaligned_line_read(index as u32, line_size);
                accumulators[ilp_idx] = fma(inp, weight, accumulators[ilp_idx]);
            }
        }
        let output_idx = out_flat_row + col;

        #[unroll]
        for ilp_idx in 0..ilp {
            output[output_idx + ilp_idx] = accumulators[ilp_idx];
        }
        col += ilp;
    }
}

I am doing direct convolution / filtering, which is why I need the unaligned line read. The offsets in the column direction are (of course) in pixel-units, not in units of whatever line size I decide to use for the kernel.

I think that with what you propose, I could change

// From
input: Tensor<f32>;
// To
input: Tensor<Line<f32>>;

// From
let inp = input.unaligned_line_read(index as u32, line_size);
// To
let inp = input
    .slice(index as u32, line_size)
    .with_line_size(line_size)[0];

I'd then pass input with a line_size of 1 always, and output with my actual line_size. I suppose that's possible - I haven't checked if the generated assembly is as nice, but I suspect it wouldn't be too bad, and if it were, there'd probably be ways to fix it.

However, with this setup, the cubecl-cpu backend panics:

thread 'main' panicked at ...\cubecl\crates\cubecl-cpu\src\compiler\visitor\operation\operator.rs:156:9:
assertion failed: index.line_size == 0

This is an assert that was present in the cubecl-cpu compiler before I made any changes

    fn visit_index(&mut self, index: &IndexOperator, out: Variable) -> Value<'a, 'a> {
        assert!(index.line_size == 0);
        let mut index_value = self.get_index(
            index.index,
            out.ty,
            index.list.ty.is_vectorized(),
        );
        if !self.is_memory(index.list) {
    // ...

It could probably also be worked around, if the above syntax is desirable.
I'll go ahead and say that the above syntax does not spark joy for me though. I like the explicit method call a lot more though - its not a wholly uncommon operation, and having an explicit method for it would be better in my view. I am in no way married to the names I've proposed here, but I like having a method where I can explicitly say "take a SIMD vector of length N from this offset into the buffer"

I think a nice side-benefit is also opening up the frontend for extensions - I can imagine e.g. SIMD scatter / gather operations and masks could be extensions as well.

@amfaber
Copy link
Contributor Author

amfaber commented Nov 24, 2025

Since we're on the topic of syntax, I'd also prefer if there were a method for splatting a scalar to a Line of some size, rather than having to do Line::empty(_) and then Line::fill(_), but that's not as bad. It also feels weird that Line::new(_) is forced to a length of 1 - any new function seems like it should be a natural entrypoint to a struct, but off the top of my head I don't know when I'd want to use Line::new

@wingertge
Copy link
Collaborator

wingertge commented Nov 25, 2025

There is a way to reinterpret the line sizes of slices, you can call: let slice = slice.with_line_size(1) which gives you the ability to have unaligned line read and unaligned line write. Let me know if that works for your use case. In that case, we wouldn't need to change the API.

This is about unaligned loads, not unlined loads I think - loading a full line (i.e. 512 bits) from an address that is not aligned to that same width. An unlined load would load individual elements from an address aligned to the element size. But I don't know if this is the right way to do it because I don't fully understand the internals of the CPU backends.
This is about a case like this

let slice = &values_f32[1..];
let vector = read_unaligned(slice.as_ptr() as *const f32x8);

We have a lot of these in the ndarray SIMD functions because Vec<T> is only aligned to align_of::<T>(), not necessarily align_of::<Simd<T, N>>(). So when executing vector loads on unvectorized data you need to do that. Same with the morphology ops in vision, you need to iterate pixel by pixel but on each pixel you load an entire SIMD vector starting from that pixel. So unaligned loads are fairly important on CPU.

@amfaber
Copy link
Contributor Author

amfaber commented Nov 25, 2025

@wingertge You're hitting the nail on the head with this being about unaligned line reads of full width.
There might very well be a better way to implement this, but I will note 2 things

  1. These changes actually work for with the kernel I posted above, and I am able to do image filtering / convolution with it, and with quite nice assembly output too
  2. The CPU backend works by first lowering the cubecl IR to MLIR, which is then optimized and handed off to an LLVM execution engine, which JIT compiles it. The generated MLIR can be inspected, and the cubecl-cpu backend currently always emits unaligned read instructions, best that I can tell. Even when the buffer is actually aligned on the cubecl frontend side. This is due to the fact that the underlying buffer is always represented as a memref<?xf32> (a runtime-sized buffer of scalar f32 values), never as e.g. memref<?x vector<16xf32>>. The cubecl IR -> MLIR compiler manually multiplies the index we're supposed to read from with the line size, if the memory is semantically Lined on the cubecl side. The MLIR is not "told" about this alignment. This is can be seen in crates/cubecl-cpu/src/compiler/visitor/variables.rs
pub fn get_index(
    &self,
    variable: Variable,
    target_item: ir::Type,
    list_is_vectorized: bool,
) -> Value<'a, 'a> {
    let index = self.get_variable(variable);
    let mut index = self.append_operation_with_result(index::casts(
        index,
        Type::index(self.context),
        self.location,
    ));
    if target_item.is_vectorized() && list_is_vectorized { // Here we manually multiply the index to go from an index in vector units to an index in scalar units, since the underlying buffer is always in scalar units
        let vectorization = target_item.line_size() as i64;
        let shift = vectorization.ilog2() as i64; // The line size is apparently assumed to always be a power of 2, meaning we can do the multiplication with a left shift instead
        let constant = self.append_operation_with_result(arith::constant(
            self.context,
            Type::index(self.context),
            IntegerAttribute::new(Type::index(self.context), shift).into(),
            self.location,
        ));
        index = self.append_operation_with_result(arith::shli(
            self.context,
            index,
            constant,
            self.location,
        ));
    }
    index
}

All this to say that, as far as I can tell, the cubecl-cpu backend has always been emitting unaligned vector reads, even when the buffer is supposedly aligned. Ultimately with this change I just want to create a way to actually request this capability explicitly from the frontend.

There might be more elegant / better ways to do this. Though again this does seem to work, atleast in my usecase.

Copy link
Member

@nathanielsimard nathanielsimard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand better the problem and the solution is fine. I would add more documentation, but also a device properties flag (unaligned IO supported) with a runtime test in cubecl-core

Comment on lines 6 to 32
#[cube]
pub trait UnalignedLine<E: CubePrimitive>:
CubeType<ExpandType = ExpandElementTyped<Self>> + Sized
{
fn unaligned_line_read(&self, index: u32, #[comptime] line_size: u32) -> Line<E>;

fn unaligned_line_write(&mut self, index: u32, value: Line<E>);
}

macro_rules! impl_unaligned_line {
($type:ident) => {
paste::paste! {
type [<$type Expand>]<E> = ExpandElementTyped<$type<E>>;
}
#[cube]
impl<E: CubePrimitive> UnalignedLine<E> for $type<E> {
fn unaligned_line_read(&self, index: u32, #[comptime] line_size: u32) -> Line<E> {
unaligned_line_read::<$type<E>, E>(self, index, line_size)
}

fn unaligned_line_write(&mut self, index: u32, value: Line<E>) {
unaligned_line_write::<$type<E>, E>(self, index, value)
}
}
};
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is now user APIs I would add more docs here with maybe an example.

@amfaber
Copy link
Contributor Author

amfaber commented Dec 4, 2025

Makes sense - I've tried adding in some comments and an "unaligned_io" device feature. I am not sure how to actually wire it up to a check or anything like that though - perhaps you can point me in the right direction?
As for a check in cubecl-core/src/runtime_tests, I think that's complicated by this cpu-only extension currently living in the cubecl-cpu crate, which depends on the cubecl-core crate already, of course.
Any thoughts on dealing with that? Perhaps it would be better for all this to live in cubecl-frontend proper?
Are there other frontend methods that should only be called when the device has certain capabilities? How are those handled today?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants