The current ByteStreamSplit decoder in arrow-rs processes one value at a time, which may results in many small loads, poor cache locality, and prevents vectorization.
i wonder if a blocked transpose approach that processes 8 values (columnar engines do this) per iteration would help with the performance here?