feat: add experimental onpair string encoding#8144
Conversation
Adds the vortex-onpair array encoding under encodings/experimental/onpair, sourcing the compression algorithm from the standalone `onpair` crate (local path dependency) rather than vendored code. - vortex-onpair: Vortex array wrapping, serialisation, and cast/filter pushdown only; train/encode/decode live in the onpair crate. - btrblocks: register OnPairScheme alongside FSSTScheme so the sample-based selector keeps the smaller per column; delta-encode the monotonic dict_offsets/codes_offsets children (>= 2048 rows) when it wins. - vortex-file: register the OnPair encoding and allow it in the write strategy. Note: onpair is a local path dependency for now (to be published to crates.io). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
- Remove unused public DEFAULT_BITS const and config_with_bits fn (plus their public-api.lock entries). - Drop stale "C++" references in comments; the algorithm is the pure-Rust onpair crate, not the old FFI shim. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
onpair string encoding
Merging this PR will not alter performance
|
Polar Signals Profiling ResultsLatest Run
Previous Runs (11)
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.114x ❌ How to read Verdict and Engines
datafusion / vortex-file-compressed (1.114x ❌, 0↑ 5↓)
File Size Changes (1 files changed, +0.5% overall, 1↑ 0↓)
Totals:
|
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.035x ➖, 0↑ 2↓)
datafusion / vortex-compact (1.039x ➖, 0↑ 2↓)
datafusion / parquet (0.988x ➖, 1↑ 0↓)
datafusion / arrow (1.006x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.004x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.006x ➖, 0↑ 0↓)
duckdb / parquet (1.026x ➖, 0↑ 3↓)
duckdb / duckdb (1.002x ➖, 0↑ 0↓)
File Size Changes (18 files changed, -5.3% overall, 9↑ 9↓)
Totals:
Full attributed analysis
|
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.997x ➖, 1↑ 1↓)
datafusion / vortex-compact (1.000x ➖, 3↑ 2↓)
datafusion / parquet (0.991x ➖, 2↑ 1↓)
duckdb / vortex-file-compressed (0.997x ➖, 1↑ 2↓)
duckdb / vortex-compact (0.989x ➖, 0↑ 1↓)
duckdb / parquet (1.000x ➖, 1↑ 1↓)
duckdb / duckdb (0.992x ➖, 0↑ 1↓)
File Size Changes (48 files changed, -0.2% overall, 43↑ 5↓)
Totals:
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) How to read Verdict and Engines
duckdb / vortex-file-compressed (1.011x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.009x ➖, 0↑ 0↓)
duckdb / parquet (1.006x ➖, 0↑ 0↓)
File Size Changes (2 files changed, -0.2% overall, 1↑ 1↓)
Totals:
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.045x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.031x ➖, 0↑ 0↓)
datafusion / parquet (1.022x ➖, 0↑ 0↓)
datafusion / arrow (1.055x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.043x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.037x ➖, 0↑ 0↓)
duckdb / parquet (1.014x ➖, 0↑ 0↓)
duckdb / duckdb (1.020x ➖, 0↑ 0↓)
File Size Changes (48 files changed, -5.2% overall, 22↑ 26↓)
Totals:
Full attributed analysis
|
Replace the hand-rolled SIMD decoder (DecodeView::decode_rows_unchecked, build_dict_table) with onpair::decompress_into / decompress_row_into / decompressed_len. OwnedDecodeInputs is now just four flat host buffers plus a Parts<'_, u32> view; the hot loop lives upstream where the aarch64 NEON intrinsic path also lives. Bench (UrlLog, 1M rows): decompress_into median 8.4 ms, canonicalize_to_varbinview 14.7 ms. Adds num-traits as a direct dep to support the generic widen helpers (AsPrimitive::as_() side-steps clippy::cast_* lints on the match_each_integer_ptype! arms). Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.935x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.967x ➖, 0↑ 0↓)
datafusion / parquet (0.918x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.883x ➖, 1↑ 0↓)
duckdb / vortex-compact (0.850x ➖, 1↑ 0↓)
duckdb / parquet (0.875x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.828x ➖, 2↑ 0↓)
datafusion / vortex-compact (0.809x ➖, 3↑ 0↓)
datafusion / parquet (0.924x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.861x ➖, 1↑ 0↓)
duckdb / vortex-compact (0.859x ➖, 1↑ 0↓)
duckdb / parquet (0.792x ➖, 2↑ 0↓)
Full attributed analysis
|
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.075x ➖, 1↑ 3↓)
datafusion / vortex-compact (0.984x ➖, 0↑ 0↓)
datafusion / parquet (0.991x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.171x ❌, 0↑ 4↓)
duckdb / vortex-compact (0.988x ➖, 0↑ 0↓)
duckdb / parquet (0.997x ➖, 0↑ 0↓)
File Size Changes (2 files changed, -12.0% overall, 1↑ 1↓)
Totals:
Full attributed analysis
|
Benchmarks: FineWeb S3Verdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.789x ➖, 1↑ 0↓)
datafusion / vortex-compact (0.947x ➖, 1↑ 0↓)
datafusion / parquet (0.907x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.775x ➖, 2↑ 0↓)
duckdb / vortex-compact (0.914x ➖, 0↑ 0↓)
duckdb / parquet (0.858x ➖, 0↑ 0↓)
Full attributed analysis
|
|
fineweb regression are quite small due to like contains and prefix. |
Adds the
vortex-onpairarray encoding underencodings/experimental/onpair, sourcing the compression algorithm from the standaloneonpaircrate (local path dependency) rather than vendored code.