Skip to content

Commit 7a38744

Browse files
AntoinePrvpitrou
andauthored
GH-47572: [C++][Parquet] Uniform unpack interface (#47573)
### Rationale for this change Refactor the unpack function code generator and dispatcher to accommodate more use cases: - ~`uint16_t`~ and `uint64_t` new sizes - ~A *dispatch once* function returning a function pointer to the correct bit width~ *Update 2025-10-13*: the dispatch once and `uint16_t` implementation were removed as they turned out to be slower. ### What changes are included in this PR? The diff are hard to look at but the important files to look at are: - The two python files for code generation accommodate new sizes (~with the exception of `uint16` on Avx512 for which the algorithms assumptions break~); - The two code generators have a uniform structure for the "batch unpackers" they generate: each one of them is a specialization of a struct template `unpack29_32` > `Unpacker<uint32_t, 29>::unpack` - Using specialization instead of hard coded number in function names makes it possible to use them in more generic code - Wrapping the functions in a struct makes it possible to carry information along the function (such as the number of values that said function unpacks) and to leave the door open for partial specialization for future improvements. - The public functions in `bpacking_internal.h` are also template specialization `unpack32` -> `unpack<uint32_t>` - The large `switch` statements and for loops used to dispatch the bit width to its appropriate implementation are now all generic with a `constexpr` generated jump table. The logic is in `bpacking_dispatch_internal.h` From a performance perspective: - there is no improvement to the individual "batch unpacker" generated - The SIMD code is actually doing worst that scalar on SSE4.2 OR `uint16_t` - there are new sizes that can bring improvements - `unpack<uint64_t>` has an SIMD implementation that should benefit `DeltaBitPackDecoder` - ~Novel `unpack<uint16_t>` should benefit the level decoders~ - ~The dispatch once mechanism should benefit all repeated calls to unpack functions (still need mixing with dynamic dispatch, but see `get_unpack_fn` for the building block).~ *Update 2025-10-13*: - Added an unpack implementation for `uint8_t` and `uint16_t` that call the `uint32_t` version on a local buffer combined with a `static_cast` loop (what was done on the call site before). - The performances should remain on par with previous implementations. The PR as it is now mainly changes the interface of unpack functions for future iterations and cleaner use. ### Are these changes tested? Very much. ### Are there any user-facing changes? * GitHub Issue: #47572 Lead-authored-by: AntoinePrv <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
1 parent fc5fd48 commit 7a38744

26 files changed

+58822
-14978
lines changed

.gitattributes

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
cpp/src/arrow/util/bpacking_*_generated.h linguist-generated=true
1+
cpp/src/arrow/util/bpacking_*_generated_internal.h linguist-generated=true
22
cpp/src/parquet/chunker_*_generated.h linguist-generated=true
33
cpp/src/generated/*.cpp linguist-generated=true
44
cpp/src/generated/*.h linguist-generated=true

cpp/apidoc/Doxyfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1095,6 +1095,7 @@ EXCLUDE_PATTERNS = *-test.cc \
10951095
*test* \
10961096
*_generated.h \
10971097
*-benchmark.cc \
1098+
*_codegen.py \
10981099
*internal*
10991100

11001101
# The EXCLUDE_SYMBOLS tag can be used to specify one or more symbol names

cpp/src/arrow/CMakeLists.txt

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -490,6 +490,8 @@ set(ARROW_UTIL_SRCS
490490
util/bitmap_builders.cc
491491
util/bitmap_ops.cc
492492
util/bpacking.cc
493+
util/bpacking_scalar.cc
494+
util/bpacking_simd_default.cc
493495
util/byte_size.cc
494496
util/byte_stream_split_internal.cc
495497
util/cancel.cc
@@ -533,11 +535,8 @@ set(ARROW_UTIL_SRCS
533535

534536
append_runtime_avx2_src(ARROW_UTIL_SRCS util/byte_stream_split_internal_avx2.cc)
535537

536-
append_runtime_avx2_src(ARROW_UTIL_SRCS util/bpacking_avx2.cc)
537-
append_runtime_avx512_src(ARROW_UTIL_SRCS util/bpacking_avx512.cc)
538-
if(ARROW_HAVE_NEON)
539-
list(APPEND ARROW_UTIL_SRCS util/bpacking_neon.cc)
540-
endif()
538+
append_runtime_avx2_src(ARROW_UTIL_SRCS util/bpacking_simd_avx2.cc)
539+
append_runtime_avx512_src(ARROW_UTIL_SRCS util/bpacking_simd_avx512.cc)
541540

542541
if(ARROW_WITH_BROTLI)
543542
list(APPEND ARROW_UTIL_SRCS util/compression_brotli.cc)

cpp/src/arrow/util/bit_stream_utils_internal.h

Lines changed: 18 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -296,6 +296,18 @@ inline bool BitReader::GetValue(int num_bits, T* v) {
296296
return GetBatch(num_bits, v, 1) == 1;
297297
}
298298

299+
namespace internal_bit_reader {
300+
template <typename T>
301+
struct unpack_detect {
302+
using type = std::make_unsigned_t<T>;
303+
};
304+
305+
template <>
306+
struct unpack_detect<bool> {
307+
using type = bool;
308+
};
309+
} // namespace internal_bit_reader
310+
299311
template <typename T>
300312
inline int BitReader::GetBatch(int num_bits, T* v, int batch_size) {
301313
ARROW_DCHECK(buffer_ != NULL);
@@ -323,47 +335,12 @@ inline int BitReader::GetBatch(int num_bits, T* v, int batch_size) {
323335
}
324336
}
325337

326-
if (sizeof(T) == 4) {
327-
int num_unpacked =
328-
internal::unpack32(buffer + byte_offset, reinterpret_cast<uint32_t*>(v + i),
329-
batch_size - i, num_bits);
330-
i += num_unpacked;
331-
byte_offset += num_unpacked * num_bits / 8;
332-
} else if (sizeof(T) == 8 && num_bits > 32) {
333-
// Use unpack64 only if num_bits is larger than 32
334-
// TODO (ARROW-13677): improve the performance of internal::unpack64
335-
// and remove the restriction of num_bits
336-
int num_unpacked =
337-
internal::unpack64(buffer + byte_offset, reinterpret_cast<uint64_t*>(v + i),
338-
batch_size - i, num_bits);
339-
i += num_unpacked;
340-
byte_offset += num_unpacked * num_bits / 8;
341-
} else {
342-
// TODO: revisit this limit if necessary
343-
ARROW_DCHECK_LE(num_bits, 32);
344-
const int buffer_size = 1024;
345-
uint32_t unpack_buffer[buffer_size];
346-
while (i < batch_size) {
347-
int unpack_size = std::min(buffer_size, batch_size - i);
348-
int num_unpacked =
349-
internal::unpack32(buffer + byte_offset, unpack_buffer, unpack_size, num_bits);
350-
if (num_unpacked == 0) {
351-
break;
352-
}
353-
for (int k = 0; k < num_unpacked; ++k) {
354-
#ifdef _MSC_VER
355-
# pragma warning(push)
356-
# pragma warning(disable : 4800)
357-
#endif
358-
v[i + k] = static_cast<T>(unpack_buffer[k]);
359-
#ifdef _MSC_VER
360-
# pragma warning(pop)
361-
#endif
362-
}
363-
i += num_unpacked;
364-
byte_offset += num_unpacked * num_bits / 8;
365-
}
366-
}
338+
using unpack_t = typename internal_bit_reader::unpack_detect<T>::type;
339+
340+
int num_unpacked = ::arrow::internal::unpack(
341+
buffer + byte_offset, reinterpret_cast<unpack_t*>(v + i), batch_size - i, num_bits);
342+
i += num_unpacked;
343+
byte_offset += num_unpacked * num_bits / 8;
367344

368345
buffered_values =
369346
detail::ReadLittleEndianWord(buffer + byte_offset, max_bytes - byte_offset);

0 commit comments

Comments
 (0)