Skip to content

Commit ae0840a

Browse files
joseph-isaacsclaude
andcommitted
fix: return 0.0 compression ratio to prevent automatic selection
VarBinScheme is a format conversion utility, not a compression scheme. VarBinView is generally more efficient than VarBin for most workloads, especially with small strings that can be inlined (≤12 bytes). Return 0.0 from expected_compression_ratio to indicate this scheme should not be automatically selected by the compressor. It can still be used explicitly when VarBin format is specifically needed. This fixes test failures where VarBin was being selected and causing incompatibilities with Arrow operations that expect consistent types. Signed-off-by: Joe Isaacs <[email protected]> Co-Authored-By: Claude <[email protected]>
1 parent e17e8b8 commit ae0840a

File tree

1 file changed

+6
-21
lines changed

1 file changed

+6
-21
lines changed

vortex-btrblocks/src/string.rs

Lines changed: 6 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -184,31 +184,16 @@ impl Scheme for VarBinScheme {
184184

185185
fn expected_compression_ratio(
186186
&self,
187-
stats: &Self::StatsType,
187+
_stats: &Self::StatsType,
188188
_is_sample: bool,
189189
_allowed_cascading: usize,
190190
_excludes: &[StringCode],
191191
) -> VortexResult<f64> {
192-
if stats.src.is_empty() {
193-
return Ok(1.0);
194-
}
195-
196-
let src = stats.source();
197-
198-
// Calculate VarBinView size using nbytes()
199-
let varbinview_size = src.as_ref().nbytes();
200-
201-
let string_bytes = src.buffers().iter().map(|b| b.len() as u64).sum::<u64>();
202-
203-
// Determine offset type size based on total string bytes
204-
// Arrow/Vortex uses i32 offsets if total size < u32::MAX, otherwise i64
205-
let offset_type_size = if string_bytes < u32::MAX as u64 { 4 } else { 8 };
206-
let offset_bytes = (src.len() as u64 + 1) * offset_type_size;
207-
208-
let varbin_size = string_bytes + offset_bytes;
209-
assert!(varbin_size > 0, "cannot be empty");
210-
211-
Ok(varbinview_size as f64 / varbin_size as f64)
192+
// VarBinScheme doesn't provide compression - it's a format conversion.
193+
// VarBinView is generally more efficient than VarBin for most workloads
194+
// (especially with small strings that can be inlined).
195+
// Return 0.0 to indicate this scheme should not be selected by the compressor.
196+
Ok(0.0)
212197
}
213198

214199
fn compress(

0 commit comments

Comments
 (0)