Commit 677f3ee
authored
Compact Compression (#3765)
Added a CompactCompression strategy that generally uses Zstd for
binary/string data and Pco for numeric data types.
Size comparison, using the NYC Taxi dataset and running e.g. `cargo run
--release -p vortex-tui convert fhvhv_tripdata_2023-04.parquet
--strategy compact` to produce files:
```
433M fhvhv_tripdata_2023-04_btrblocks.vortex
334M fhvhv_tripdata_2023-04_compact_8192.vortex
321M fhvhv_tripdata_2023-04_compact_inf.vortex
469M fhvhv_tripdata_2023-04.parquet (zstd compressed)
```
Here the two compact strategies are using up to 8192 values per page
versus "inf", as many as possible. Using 8192 (the default I put in the
code) slightly increases size but allows for faster access into slices
and can (in the non-null, non-list case) line up with statistics nicely
for potential pushdown filters.
NOT HAPPENING IN THIS PR (leaving this to future work):
* compression for bool arrays
* compression for decimal array
* dict encoding for variable-length types
Other changes this incurred:
* Added --strategy arg to convert command (and simplified flags stuff a
bit)
* Added VarBinView support for Zstd encoding
* Added unit tests for all the new functionality
Fixes #3611 .
---------
Signed-off-by: mwlon <m.w.loncaric@gmail.com>1 parent 6564035 commit 677f3ee
File tree
14 files changed
+762
-98
lines changed- encodings/zstd/src
- vortex-file
- src
- vortex-layout
- src/layouts
- vortex-tui/src
- vortex
- src
14 files changed
+762
-98
lines changedSome generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
0 commit comments