[parquet] reader perf - Optimize ByteStreamSplit decoding by batching 8 values

The current ByteStreamSplit decoder in arrow-rs processes one value at a time, which may results in many small loads, poor cache locality, and prevents vectorization.

i wonder if a blocked transpose approach that processes 8 values (columnar engines do this) per iteration would help with the performance here?