Skip to content
This repository was archived by the owner on Feb 18, 2024. It is now read-only.
This repository was archived by the owner on Feb 18, 2024. It is now read-only.

Arrow2 read parquet file did not reuse the page decoder buffer to array #1324

@sundy-li

Description

@sundy-li

Let's look at these codes in
https://github.com/jorgecarleitao/arrow2/blob/main/src/io/parquet/read/deserialize/primitive/basic.rs#L219-L226

  State::Required(page) => {
                values.extend(
                    page.values
                        .by_ref()
                        .map(decode)
                        .map(self.op)
                        .take(remaining),
                );
            }

It had extra memcpy in values.extend and decode, I think maybe we could optimize it by using Buffer clone.

The first motivation is to move

#[derive(Debug, Clone)]
pub struct DataPage {
    pub(super) header: DataPageHeader,
    pub(super) buffer: Vec<u8>,
    ...
}

to

#[derive(Debug, Clone)]
pub struct DataPage {
    pub(super) header: DataPageHeader,
    pub(super) buffer: Buffer<u8>,
    ...
}

@jorgecarleitao what do you think about this?

I found arrow-rs had addressed this improvement in https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader/byte_array.rs#L115-L138

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions