-
Notifications
You must be signed in to change notification settings - Fork 37
Open
Description
I have a small but handy tool called parquet-colsize:
$ parquet-colsize stream-1.parquet
order_id 1.2 GiB 27% (RLE, DELTA_BINARY_PACKED, ZSTD(1))
recv_time 1.1 GiB 25% (RLE, DELTA_BINARY_PACKED, ZSTD(1))
proc_time 1.1 GiB 24% (RLE, DELTA_BINARY_PACKED, ZSTD(1))
price 480.5 MiB 10% (PLAIN, RLE, RLE_DICTIONARY, ZSTD(1))
seq 277.6 MiB 6% (RLE, DELTA_BINARY_PACKED, ZSTD(1))
qty 165.2 MiB 4% (PLAIN, RLE, RLE_DICTIONARY, ZSTD(1))
msg_type 84.2 MiB 2% (PLAIN, RLE, RLE_DICTIONARY, ZSTD(1))
dir 53.6 MiB 1% (PLAIN, RLE, RLE_DICTIONARY, ZSTD(1))
counterparty 22.9 MiB 0% (RLE, DELTA_BINARY_PACKED, ZSTD(1))
origin 22.5 MiB 0% (PLAIN, RLE, RLE_DICTIONARY, ZSTD(1))
isin 1.5 MiB 0% (PLAIN, RLE, RLE_DICTIONARY, ZSTD(1))
date 1.2 MiB 0% (PLAIN, RLE, RLE_DICTIONARY, ZSTD(1))
inst_id 1.2 MiB 0% (PLAIN, RLE, RLE_DICTIONARY, ZSTD(1))
stream 1.2 MiB 0% (PLAIN, RLE, RLE_DICTIONARY, ZSTD(1))I find it really useful when optimising the encoding. AFAIK this functionality doesn't exist in any other parquet tools.
Would you be interested in rolling this functionality into pqrs? It's 70 LOC, and there are some deps.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels