Skip to content

Comments

Add the support of dictionnaries#91

Open
Alex-PLACET wants to merge 16 commits intosparrow-org:mainfrom
Alex-PLACET:support_dict
Open

Add the support of dictionnaries#91
Alex-PLACET wants to merge 16 commits intosparrow-org:mainfrom
Alex-PLACET:support_dict

Conversation

@Alex-PLACET
Copy link
Collaborator

No description provided.

@Alex-PLACET Alex-PLACET linked an issue Feb 16, 2026 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Feb 16, 2026

Codecov Report

❌ Patch coverage is 86.59218% with 48 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/dictionary_cache.cpp 67.74% 30 Missing ⚠️
src/deserialize.cpp 90.80% 8 Missing ⚠️
src/flatbuffer_utils.cpp 89.85% 7 Missing ⚠️
src/dictionary_tracker.cpp 93.87% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Apache Arrow IPC dictionary-encoding support to sparrow-ipc, covering dictionary batch emission during serialization, dictionary reconstruction during deserialization, and validation via Arrow integration fixtures.

Changes:

  • Emit DictionaryBatch messages during stream/file serialization and include dictionary blocks in the IPC file footer.
  • Deserialize DictionaryBatch messages and reattach dictionaries to index arrays to reconstruct dictionary-encoded arrays.
  • Extend integration tests/fixtures to exercise dictionary streams and file footers.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/deserialize.cpp Adds DictionaryBatch handling, dictionary caching, and dictionary application to decoded arrays.
src/serialize.cpp Adds serialize_dictionary_batch to write DictionaryBatch messages.
src/flatbuffer_utils.cpp Writes dictionary encoding metadata into schema fields and builds DictionaryBatch messages.
src/stream_file_serializer.cpp Writes footer with dictionary blocks tracked during file serialization.
src/dictionary_utils.cpp Adds helpers for dictionary metadata parsing and fallback dictionary id generation.
src/dictionary_tracker.cpp Scans record batches for dictionary-encoded columns to emit dictionaries ahead of record batches.
src/dictionary_cache.cpp Caches dictionaries during deserialization for later attachment to index arrays.
include/sparrow_ipc/serialize.hpp Exposes serialize_dictionary_batch API.
include/sparrow_ipc/flatbuffer_utils.hpp Exposes DictionaryBatch message builder and extends create_field.
include/sparrow_ipc/serializer.hpp Emits dictionary batches before record batches; tracks dictionaries.
include/sparrow_ipc/stream_file_serializer.hpp Emits dictionary batches, tracks dictionary footer blocks.
include/sparrow_ipc/dictionary_utils.hpp Declares dictionary metadata parsing + fallback id utility.
include/sparrow_ipc/dictionary_tracker.hpp Declares dictionary discovery/tracking for serialization.
include/sparrow_ipc/dictionary_cache.hpp Declares dictionary cache used by deserializer.
tests/test_dictionary.cpp Adds targeted dictionary stream/file round-trip tests.
tests/test_de_serialization_with_files.cpp Adds dictionary fixtures and adjusts raw-buffer comparison for dictionary arrays.
tests/CMakeLists.txt Registers new dictionary test.
CMakeLists.txt Adds new dictionary headers/sources to the build.

@Alex-PLACET Alex-PLACET marked this pull request as ready for review February 17, 2026 13:23
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 6 comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support of dictionary array

1 participant