Skip to content

"colsize" subcommand #60

@asayers

Description

@asayers

I have a small but handy tool called parquet-colsize:

$ parquet-colsize stream-1.parquet
order_id        1.2 GiB  27%  (RLE, DELTA_BINARY_PACKED, ZSTD(1))
recv_time       1.1 GiB  25%  (RLE, DELTA_BINARY_PACKED, ZSTD(1))
proc_time       1.1 GiB  24%  (RLE, DELTA_BINARY_PACKED, ZSTD(1))
price         480.5 MiB  10%  (PLAIN, RLE, RLE_DICTIONARY, ZSTD(1))
seq           277.6 MiB   6%  (RLE, DELTA_BINARY_PACKED, ZSTD(1))
qty           165.2 MiB   4%  (PLAIN, RLE, RLE_DICTIONARY, ZSTD(1))
msg_type       84.2 MiB   2%  (PLAIN, RLE, RLE_DICTIONARY, ZSTD(1))
dir            53.6 MiB   1%  (PLAIN, RLE, RLE_DICTIONARY, ZSTD(1))
counterparty   22.9 MiB   0%  (RLE, DELTA_BINARY_PACKED, ZSTD(1))
origin         22.5 MiB   0%  (PLAIN, RLE, RLE_DICTIONARY, ZSTD(1))
isin            1.5 MiB   0%  (PLAIN, RLE, RLE_DICTIONARY, ZSTD(1))
date            1.2 MiB   0%  (PLAIN, RLE, RLE_DICTIONARY, ZSTD(1))
inst_id         1.2 MiB   0%  (PLAIN, RLE, RLE_DICTIONARY, ZSTD(1))
stream          1.2 MiB   0%  (PLAIN, RLE, RLE_DICTIONARY, ZSTD(1))

I find it really useful when optimising the encoding. AFAIK this functionality doesn't exist in any other parquet tools.

Would you be interested in rolling this functionality into pqrs? It's 70 LOC, and there are some deps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions