You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
List hashing fast-path — opteryx compiled/table_ops/hash_ops
2
+
3
+
Purpose
4
+
- Document when the list hashing implementation takes a buffer-aware fast path (no Python object allocation) and when it falls back to per-element Python hashing.
5
+
6
+
Where it lives
7
+
- Implementation: `opteryx/compiled/table_ops/hash_ops.pyx` — function `process_list_chunk`.
- The list handler will use buffer-aware, zero-Python-object inner loops when the list child type is one of:
12
+
- integer types (signed/unsigned, fixed-width)
13
+
- floating point types
14
+
- temporal types (timestamps/dates)
15
+
- string or binary child types (string buffers + offsets)
16
+
17
+
- For the above child types the code reads child buffers directly and computes element hashes without creating Python objects. This gives a large performance win for dense numeric/string lists.
18
+
19
+
Fallback cases
20
+
- If the list child type is a complex/unrecognized Arrow type (for example, structs, maps, or arbitrary Python objects), the implementation falls back to slicing the child array and calling Python-level hashing for each element. This is correct but slower.
21
+
22
+
Correctness notes
23
+
- All paths account for Arrow `chunk.offset` on both the parent list array and on the child array. Validity bitmaps are checked with proper bit/byte arithmetic.
24
+
- 8-byte primitive loads are done via `memcpy` into a local `uint64_t` to avoid unaligned memory reads.
25
+
26
+
Testing and benchmarks
27
+
- Unit tests in `tests/unit/diagnostic/test_list_fast_paths.py` validate parity between flat and chunked arrays and basic correctness for nested and boolean lists.
28
+
- Benchmarks live in `tests/performance/benchmarks/bench_hash_ops.py`.
29
+
30
+
When to extend
31
+
- If you see nested lists of primitives commonly in workloads, consider implementing a dedicated nested-list stack-based fast path to avoid repeated slice() allocations.
32
+
- If child types are frequently small fixed-width types, additional micro-optimizations (incremental bit/byte pointers rather than recomputing shifts) can pay off.
33
+
34
+
"Why not always buffer-aware?"
35
+
- Some Arrow child types are not stored as simple contiguous buffers accessible by offset arithmetic (e.g., structs or other nested variable-width complex types). In those cases, the safe and correct approach is to create Python objects and hash them.
36
+
37
+
Contact
38
+
- If you have a representative large dataset that still performs poorly, attach it or a small reproducer and I'll benchmark and iterate.
0 commit comments