Add unaligned_line_read and unaligned_line_write as cpu-only cubecl extensions #1052

amfaber · 2025-11-16T14:23:55Z

Unaligned line reads and writes

When interacting with Line, cubecl generally expects you to have all of your data laid out with the Line chunking in mind, i.e. Array<Line>. This makes a lot of sense in GPU-land, where unaligned line reads aren't really possible in any of the APIs, to the best of my knowledge.

However, I think that the new cubecl-cpu backend represents a very exciting way to write high-performance CPU kernels in a very configurable and quite portable manner. And in many cases the penalty for unaligned vector reads on CPU isn't very high, so it can be a very useful algorithmic tool. I've recently used it play around with a simple direct convolution kernel.

API considerations

However, this presents a problem - unaligned reads / writes make no sense on the GPU backends (as far as I know), but they can be very useful on CPU. So how do we present an interface that best communicates that you shouldn't be trying to do unaligned line reads on GPU?
I propose that we add a frontend module to cubecl-cpu and add an extension trait to the relevant datastructures in here, forcing users to depend on cubecl-cpu and import its frontend module to make use of this functionality.

If you don't think this is the right decision, I'd love to hear other ideas for how this usecase can be accommodated.

In order to actually do this, I was forced to make the ExpandType of ExpandElementTyped public, so that I could use the intrinsic macro outside of cubecl-frontend. I'd argue that this is a good thing, as it will allow people to make their own extensions to cubecl more easily and have access to the same tools as cubecl itself uses. In the same vein, I think the ArrayExpand, TensorExpand and so on that the container modules themselves define should probably also be made public. For now I left them private, but that forced me to redefine them here to get the macros to work.

Implementation details

I found that the IndexOperator and IndexAssignOperator actually already supported this usecase, so no new IR is added. They both carry a line_size, but this seems to be unused in favor of looking at the return / input value's return type - the current cpu backend goes so far as to assert that line_size == 0.

Furthermore, it seems like all buffers are treated as consisting of scalars at the MLIR level, with some manual instructions added for proper indexing if the underlying buffer is actually vectorized in cubecl land. This makes it easy to do unaligned reads / writes - no change in the MLIR representation of the memory is required, since that was already scalar.
When pulling a vector from scalar memory, we should simply /not/ fix up the index beforehand. To facilitate this, I added another argument to get_index, which toggles whether the underlying buffer should be treated as having the same vectorization as the output variable.

Tests pass, and it works for my convolution kernel, but I'd like some eyes on whether the new argument is passed correctly in all cases.

as cpu-only cubecl extensions

nathanielsimard · 2025-11-24T17:45:41Z

There is a way to reinterpret the line sizes of slices, you can call: let slice = slice.with_line_size(1) which gives you the ability to have unaligned line read and unaligned line write. Let me know if that works for your use case. In that case, we wouldn't need to change the API.

amfaber · 2025-11-24T20:14:51Z

Hmm, I had not found that method - probably due to only being available on Slice<Line<_>> and not Slice<_>. For context, this is my kernel

#[cube(launch_unchecked)]
pub fn convolution(
    input: Tensor<f32>,
    mut output: Tensor<Line<f32>>,
    weights: Array<f32>,
    offsets: Array<i32>,
    #[comptime] n_weights: u32,
    #[comptime] ilp: u32,
) {
    let line_size = output.line_size();
    let row = ABSOLUTE_POS;
    let rows = output.shape(0);
    let cols = output.shape(1);
    if row >= rows {
        terminate!()
    }

    let padding_rows = (input.shape(0) - output.shape(0)) / 2;
    let padding_cols = (input.shape(1) - output.shape(1) * line_size) / 2;
    let inp_flat_row = (row + padding_rows) * input.shape(1);
    let out_flat_row = row * output.shape(1);
    let mut col = 0;

    let mut accumulators = Array::<Line<f32>>::vectorized(ilp, line_size);

    while col < cols {
        #[unroll]
        for ilp_idx in 0..ilp {
            accumulators[ilp_idx] = Line::<f32>::empty(line_size);
        }

        let idx = inp_flat_row + col * line_size + padding_cols;
        let weight_line = Line::<f32>::empty(line_size);

        #[unroll]
        for weight_idx in 0..n_weights {
            let weight = weight_line.fill(weights[weight_idx]);
            let offset = offsets[weight_idx];
            #[unroll]
            for ilp_idx in 0..ilp {
                let index = (idx + ilp_idx * line_size) as i32 + offset;
                let inp = input.unaligned_line_read(index as u32, line_size);
                accumulators[ilp_idx] = fma(inp, weight, accumulators[ilp_idx]);
            }
        }
        let output_idx = out_flat_row + col;

        #[unroll]
        for ilp_idx in 0..ilp {
            output[output_idx + ilp_idx] = accumulators[ilp_idx];
        }
        col += ilp;
    }
}

I am doing direct convolution / filtering, which is why I need the unaligned line read. The offsets in the column direction are (of course) in pixel-units, not in units of whatever line size I decide to use for the kernel.

I think that with what you propose, I could change

// From
input: Tensor<f32>;
// To
input: Tensor<Line<f32>>;

// From
let inp = input.unaligned_line_read(index as u32, line_size);
// To
let inp = input
    .slice(index as u32, line_size)
    .with_line_size(line_size)[0];

I'd then pass input with a line_size of 1 always, and output with my actual line_size. I suppose that's possible - I haven't checked if the generated assembly is as nice, but I suspect it wouldn't be too bad, and if it were, there'd probably be ways to fix it.

However, with this setup, the cubecl-cpu backend panics:

thread 'main' panicked at ...\cubecl\crates\cubecl-cpu\src\compiler\visitor\operation\operator.rs:156:9:
assertion failed: index.line_size == 0

This is an assert that was present in the cubecl-cpu compiler before I made any changes

    fn visit_index(&mut self, index: &IndexOperator, out: Variable) -> Value<'a, 'a> {
        assert!(index.line_size == 0);
        let mut index_value = self.get_index(
            index.index,
            out.ty,
            index.list.ty.is_vectorized(),
        );
        if !self.is_memory(index.list) {
    // ...

It could probably also be worked around, if the above syntax is desirable.
I'll go ahead and say that the above syntax does not spark joy for me though. I like the explicit method call a lot more though - its not a wholly uncommon operation, and having an explicit method for it would be better in my view. I am in no way married to the names I've proposed here, but I like having a method where I can explicitly say "take a SIMD vector of length N from this offset into the buffer"

I think a nice side-benefit is also opening up the frontend for extensions - I can imagine e.g. SIMD scatter / gather operations and masks could be extensions as well.

amfaber · 2025-11-24T20:23:10Z

Since we're on the topic of syntax, I'd also prefer if there were a method for splatting a scalar to a Line of some size, rather than having to do Line::empty(_) and then Line::fill(_), but that's not as bad. It also feels weird that Line::new(_) is forced to a length of 1 - any new function seems like it should be a natural entrypoint to a struct, but off the top of my head I don't know when I'd want to use Line::new

wingertge · 2025-11-25T14:28:13Z

There is a way to reinterpret the line sizes of slices, you can call: let slice = slice.with_line_size(1) which gives you the ability to have unaligned line read and unaligned line write. Let me know if that works for your use case. In that case, we wouldn't need to change the API.

This is about unaligned loads, not unlined loads I think - loading a full line (i.e. 512 bits) from an address that is not aligned to that same width. An unlined load would load individual elements from an address aligned to the element size. But I don't know if this is the right way to do it because I don't fully understand the internals of the CPU backends.
This is about a case like this

let slice = &values_f32[1..];
let vector = read_unaligned(slice.as_ptr() as *const f32x8);

We have a lot of these in the ndarray SIMD functions because Vec<T> is only aligned to align_of::<T>(), not necessarily align_of::<Simd<T, N>>(). So when executing vector loads on unvectorized data you need to do that. Same with the morphology ops in vision, you need to iterate pixel by pixel but on each pixel you load an entire SIMD vector starting from that pixel. So unaligned loads are fairly important on CPU.

amfaber · 2025-11-25T17:50:58Z

@wingertge You're hitting the nail on the head with this being about unaligned line reads of full width.
There might very well be a better way to implement this, but I will note 2 things

These changes actually work for with the kernel I posted above, and I am able to do image filtering / convolution with it, and with quite nice assembly output too
The CPU backend works by first lowering the cubecl IR to MLIR, which is then optimized and handed off to an LLVM execution engine, which JIT compiles it. The generated MLIR can be inspected, and the cubecl-cpu backend currently always emits unaligned read instructions, best that I can tell. Even when the buffer is actually aligned on the cubecl frontend side. This is due to the fact that the underlying buffer is always represented as a memref<?xf32> (a runtime-sized buffer of scalar f32 values), never as e.g. memref<?x vector<16xf32>>. The cubecl IR -> MLIR compiler manually multiplies the index we're supposed to read from with the line size, if the memory is semantically Lined on the cubecl side. The MLIR is not "told" about this alignment. This is can be seen in crates/cubecl-cpu/src/compiler/visitor/variables.rs

pub fn get_index(
    &self,
    variable: Variable,
    target_item: ir::Type,
    list_is_vectorized: bool,
) -> Value<'a, 'a> {
    let index = self.get_variable(variable);
    let mut index = self.append_operation_with_result(index::casts(
        index,
        Type::index(self.context),
        self.location,
    ));
    if target_item.is_vectorized() && list_is_vectorized { // Here we manually multiply the index to go from an index in vector units to an index in scalar units, since the underlying buffer is always in scalar units
        let vectorization = target_item.line_size() as i64;
        let shift = vectorization.ilog2() as i64; // The line size is apparently assumed to always be a power of 2, meaning we can do the multiplication with a left shift instead
        let constant = self.append_operation_with_result(arith::constant(
            self.context,
            Type::index(self.context),
            IntegerAttribute::new(Type::index(self.context), shift).into(),
            self.location,
        ));
        index = self.append_operation_with_result(arith::shli(
            self.context,
            index,
            constant,
            self.location,
        ));
    }
    index
}

All this to say that, as far as I can tell, the cubecl-cpu backend has always been emitting unaligned vector reads, even when the buffer is supposedly aligned. Ultimately with this change I just want to create a way to actually request this capability explicitly from the frontend.

There might be more elegant / better ways to do this. Though again this does seem to work, atleast in my usecase.

nathanielsimard

I understand better the problem and the solution is fine. I would add more documentation, but also a device properties flag (unaligned IO supported) with a runtime test in cubecl-core

nathanielsimard · 2025-11-27T16:14:07Z

crates/cubecl-cpu/src/frontend/unaligned_line.rs

+#[cube]
+pub trait UnalignedLine<E: CubePrimitive>:
+    CubeType<ExpandType = ExpandElementTyped<Self>> + Sized
+{
+    fn unaligned_line_read(&self, index: u32, #[comptime] line_size: u32) -> Line<E>;
+
+    fn unaligned_line_write(&mut self, index: u32, value: Line<E>);
+}
+
+macro_rules! impl_unaligned_line {
+    ($type:ident) => {
+        paste::paste! {
+            type [<$type Expand>]<E> = ExpandElementTyped<$type<E>>;
+        }
+        #[cube]
+        impl<E: CubePrimitive> UnalignedLine<E> for $type<E> {
+            fn unaligned_line_read(&self, index: u32, #[comptime] line_size: u32) -> Line<E> {
+                unaligned_line_read::<$type<E>, E>(self, index, line_size)
+            }
+
+            fn unaligned_line_write(&mut self, index: u32, value: Line<E>) {
+                unaligned_line_write::<$type<E>, E>(self, index, value)
+            }
+        }
+    };
+}
+


Since this is now user APIs I would add more docs here with maybe an example.

amfaber · 2025-12-04T19:41:38Z

Makes sense - I've tried adding in some comments and an "unaligned_io" device feature. I am not sure how to actually wire it up to a check or anything like that though - perhaps you can point me in the right direction?
As for a check in cubecl-core/src/runtime_tests, I think that's complicated by this cpu-only extension currently living in the cubecl-cpu crate, which depends on the cubecl-core crate already, of course.
Any thoughts on dealing with that? Perhaps it would be better for all this to live in cubecl-frontend proper?
Are there other frontend methods that should only be called when the device has certain capabilities? How are those handled today?

amfaber added 4 commits November 16, 2025 14:06

Add unaligned_line_read and unaligned_line_write

ad70193

as cpu-only cubecl extensions

keep visibility change out of this PR

c619f93

remove commented-out dbg

74d1e92

cargo fmt

04d2c61

nathanielsimard reviewed Nov 27, 2025

View reviewed changes

amfaber added 2 commits December 4, 2025 20:35

unaligned_io feature

278aeee

More comments

850d1bc

cargo fmt

31974be

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add unaligned_line_read and unaligned_line_write as cpu-only cubecl extensions #1052

Add unaligned_line_read and unaligned_line_write as cpu-only cubecl extensions #1052

Uh oh!

amfaber commented Nov 16, 2025 •

edited

Loading

Uh oh!

nathanielsimard commented Nov 24, 2025

Uh oh!

amfaber commented Nov 24, 2025 •

edited

Loading

Uh oh!

amfaber commented Nov 24, 2025

Uh oh!

wingertge commented Nov 25, 2025 •

edited

Loading

Uh oh!

amfaber commented Nov 25, 2025

Uh oh!

nathanielsimard left a comment

Uh oh!

nathanielsimard Nov 27, 2025

Uh oh!

amfaber commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add unaligned_line_read and unaligned_line_write as cpu-only cubecl extensions #1052

Are you sure you want to change the base?

Add unaligned_line_read and unaligned_line_write as cpu-only cubecl extensions #1052

Uh oh!

Conversation

amfaber commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unaligned line reads and writes

API considerations

Implementation details

Uh oh!

nathanielsimard commented Nov 24, 2025

Uh oh!

amfaber commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amfaber commented Nov 24, 2025

Uh oh!

wingertge commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amfaber commented Nov 25, 2025

Uh oh!

nathanielsimard left a comment

Choose a reason for hiding this comment

Uh oh!

nathanielsimard Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

amfaber commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amfaber commented Nov 16, 2025 •

edited

Loading

amfaber commented Nov 24, 2025 •

edited

Loading

wingertge commented Nov 25, 2025 •

edited

Loading