Handle lz4 compression #30

Hind-M · 2025-10-07T08:05:32Z

Only handling lz4 codec for now (zstd will be handled in a next PR).

codecov-commenter · 2025-10-08T12:08:54Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 86.56126% with 34 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@dae8209). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/compression.cpp	85.50%	10 Missing ⚠️
src/arrow_interface/arrow_array.cpp	63.15%	7 Missing ⚠️
src/deserialize_fixedsizebinary_array.cpp	0.00%	6 Missing ⚠️
...row_ipc/deserialize_variable_size_binary_array.hpp	70.58%	5 Missing ⚠️
src/deserialize_utils.cpp	84.21%	3 Missing ⚠️
src/serialize_utils.cpp	95.00%	2 Missing ⚠️
include/sparrow_ipc/chunk_memory_serializer.hpp	66.66%	1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #30   +/-   ##
=======================================
  Coverage        ?   76.96%           
=======================================
  Files           ?       32           
  Lines           ?     1502           
  Branches        ?        0           
=======================================
  Hits            ?     1156           
  Misses          ?      346           
  Partials        ?        0

Flag	Coverage Δ
unittests	`76.96% <86.56%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Hind-M · 2025-10-16T12:04:25Z

There are still a few things missing and some room for improvement, but I suggest merging this ASAP to avoid further conflicts (I just resolved the ones after merging #29 and had to rework the compression in the serialization part. For now, this PR is just to get something working).
I'll address the TODO items in a follow-up PR. Any comments or reviews regarding design choices can be noted, but they will be addressed in the next PRs.

Alex-PLACET · 2025-10-16T12:44:26Z

CAn you add tests which only test the compression/decompression of a buffer ?

Alex-PLACET · 2025-10-16T12:25:14Z

include/sparrow_ipc/chunk_memory_serializer.hpp

+         * @param compression Optional: The compression type to use for record batch bodies.
         */
-        chunk_serializer(chunked_memory_output_stream<std::vector<std::vector<uint8_t>>>& stream);
+        chunk_serializer(chunked_memory_output_stream<std::vector<std::vector<uint8_t>>>& stream, std::optional<org::apache::arrow::flatbuf::CompressionType> compression = std::nullopt);


Create a sparrow-ipc enum to keep public signatures free from flatbuffers

This will be done in a follow-up PR (added a TODO for now).

src/compression.cpp

include/sparrow_ipc/deserialize_variable_size_binary_array.hpp

include/sparrow_ipc/chunk_memory_serializer.hpp

src/serialize.cpp

Alex-PLACET · 2025-10-16T13:14:25Z

src/serialize_utils.cpp

+        if (compression.has_value())
+        {
+            // If compressed, the body size is the sum of compressed buffer sizes + original size prefixes + padding
+            auto [compressed_body, compressed_buffers] = generate_compressed_body_and_buffers(record_batch, compression.value());


We don't want to compress the data to do the calculation of the size of the message.
I saw that LZ4F_compressFrameBound can give the maximum size of the compressed buffer. It think this should be used instead.

No, you want the exact size, not the maximum size (which is going to be some trivial calculation such as uncompressed size + K).

This function is used for the memory reservation, not for the message header.
Can you know the compressed size without compressing the data first ?

No, you can't.

We will have an issue with the fill_buffers function in flatbuffer_utils.cpp
In this function we create the flatbuffer::Buffer which are the offset and size of each buffer in the body.
As the sizes of the buffers are unknow before the data is compressed, you can't create the record_batch message.
It means that we have to compress the buffers before to create the message and keep the compressed buffers in memory.
Once all the buffers are compressed, we can finally create and send the record_batch message

BTW a test where we try to deserialize ou serialized with compression is missing. It should not work because of what I said in the previous message.

I'm thinking if we should split the code execution in two different branches for compressed vs uncompressed buffers when we create record batch messages.
Trying to keep the same code path seems to lead to code complexity without so much benefit.

I'm thinking if we should split the code execution in two different branches for compressed vs uncompressed buffers when we create record batch messages.

No, this is really a bad idea because you don't want to write buffers as compressed when they are not compressible (see my other comments about this).

We will have an issue with the fill_buffers function in flatbuffer_utils.cpp

fill_buffers in only used in get_buffers which is called when there is no compression, otherwise, we use generate_compressed_buffers, see get_record_batch_message_builder with the new changes.

BTW a test where we try to deserialize ou serialized with compression is missing. It should not work because of what I said in the previous message.

There are tests doing that (see "TEST_CASE("Compare record_batch serialization with stream file using LZ4 compression"), are you thinking about something else?
On another hand, I think we should eventually add tests to write streams and compare them...

environment-dev.yml

pitrou · 2025-10-16T12:40:07Z

include/sparrow_ipc/chunk_memory_serializer.hpp

+
 #include <sparrow/record_batch.hpp>

+#include "Message_generated.h"


For the record, in Arrow C++ we ensure that flatbuffers headers (and any other dependency) are not exposed through public Arrow headers.

Yes this will be done in a follow-up PR.

include/sparrow_ipc/chunk_memory_serializer.hpp

pitrou · 2025-10-16T12:52:06Z

include/sparrow_ipc/chunk_memory_serializer.hpp

            memory_output_stream stream(buffer);
            any_output_stream astream(stream);
-            serialize_record_batch(rb, astream);
+            serialize_record_batch(rb, astream, m_compression);


Side note: this concatenates all output buffers as a single chunk even though we have chunked_memory_output_stream which would avoid such copies. It is a bit of a waste.

pitrou · 2025-10-16T12:59:27Z

src/compression.cpp

+        if (data.empty())
+        {
+            return {};
+        }


Hmm, this should never happen according to the Flatbuffers spec. Did you encounter this situation somewhere?

Actually I realized we are compressing/decompressing all buffers (validity buffers and data buffer), and we need this empty buffer case when there is a validity bitmap with no nulls.
The spec doesn't really say anything about compressing these...
I'm wondering if we should not do it, or leave it as is since discriminating buffers could make things more complex...

I mean, if the buffer was compressed, then the compressed buffer cannot be empty as all compressors (LZ4, ZSTD) add a header of their own, even if the original buffer was empty:

>>> import lz4.frame >>> lz4.frame.compress(b"") b'\x04"M\x18`@\x82\x00\x00\x00\x00' >>> import zstandard >>> zstandard.compress(b"") b'(\xb5/\xfd \x00\x01\x00\x00'

(but of course, this also means that the writer should not have compressed an empty buffer, because doing so increases the data size instead of decreasing it :))

So getting an empty data span here is an error (you may still want to detect it, because its data pointer will be null and the underlying decompressor may not like that).

Well if you mean that a function called decompress, designed to do only that and shouldn't get empty data intrinsically, this check could be moved up. Is that what you meant?

Yes, an empty data buffer here would mean an invalid/corrupt IPC stream. So it's a matter of if you want to be resilient against that.

I would find it more helpful to check the validity of the buffer when we get it from the record batch (because the error mesage can explicitly state which buffer is empty). This can be done in a dedicated PR where we rework the error handling policy, but that would deserve a TODO.

src/compression.cpp

src/serialize_utils.cpp

…cleaned up)

JohanMabille · 2025-10-22T16:30:54Z

include/sparrow_ipc/arrow_interface/arrow_array/private_data.hpp

+    {
+    public:
+        using optionally_owned_buffer = std::variant<std::vector<uint8_t>, std::span<const uint8_t>>;
+        explicit arrow_array_private_data(std::vector<optionally_owned_buffer>&& buffers);


Do we mix owned and non-owned buffers? Otherwise a variant of vectors might be more suitable here.

As I previously commented elsewhere, since some buffers can be compressed and others not, you need this flexibility otherwise you'll end up copying non-compressed buffer into an owned version.

(and I think at some sparrow may want a more versatile buffer facility such that ownership needn't be under the form of a std::vector?).

As I previously commented elsewhere, since some buffers can be compressed and others not, you need this flexibility otherwise you'll end up copying non-compressed buffer into an owned version.

Ah indeed, I totally forgot it when writing my comment.

JohanMabille · 2025-10-22T16:39:07Z

src/arrow_interface/arrow_array.cpp

+            }
+            delete array->dictionary;
+            array->dictionary = nullptr;
+        }


It would be nice to find a way to factorize this implementation and https://github.com/man-group/sparrow/blob/c7feca8ca0dd87c2ead88cfce425ebd121b18c54/include/sparrow/arrow_interface/arrow_array_schema_common_release.hpp#L31 to avoid code duplication. Tihs requires changes in sparrow and should not block this PR, but I think a TODO would be nice here.

JohanMabille · 2025-10-22T16:59:19Z

src/compression.cpp

+        if (data.empty())
+        {
+            return {};
+        }


I would find it more helpful to check the validity of the buffer when we get it from the record batch (because the error mesage can explicitly state which buffer is empty). This can be done in a dedicated PR where we rework the error handling policy, but that would deserve a TODO.

Hind-M force-pushed the compr_ser branch from 404ded8 to ed87642 Compare October 9, 2025 08:04

Hind-M changed the title ~~Handle compression in serialization~~ Handle lz4 compression Oct 14, 2025

Hind-M force-pushed the compr_ser branch 2 times, most recently from 0d8c498 to c5eb667 Compare October 16, 2025 10:00

Hind-M marked this pull request as ready for review October 16, 2025 12:04

Hind-M requested review from Alex-PLACET and JohanMabille October 16, 2025 12:09

Alex-PLACET requested changes Oct 16, 2025

View reviewed changes

Alex-PLACET reviewed Oct 16, 2025

View reviewed changes

include/sparrow_ipc/chunk_memory_serializer.hpp Outdated Show resolved Hide resolved

Alex-PLACET reviewed Oct 16, 2025

View reviewed changes

src/serialize.cpp Outdated Show resolved Hide resolved

Alex-PLACET reviewed Oct 16, 2025

View reviewed changes

pitrou reviewed Oct 16, 2025

View reviewed changes

Hind-M added 15 commits October 21, 2025 16:48

First attempt for compression in serialization (not tested and to be …

16068f5

…cleaned up)

Clean up and fetch

4efa40c

More cleanup

5ecfe01

Fix fetching lz4

9be57ad

Add lz4 targets to be exported

da7b9f6

Add lz4 compression in deserialization

e52ccf2

Add owning_arrow_array_private_data

4d52af9

Factorize

7c59a7d

Remove macro from template

8bc880c

Add fct and remove cout

10fcf73

Move implementation details to cpp and add check boundaries

c667acd

Factorize more

568a8e7

Fix conflicts and rework serialization with compression

d410d15

Add compression test

78d0150

Minor changes

bed6421

Hind-M added 4 commits October 21, 2025 16:58

Rework compression and add tests

4b171cc

Add missing header in test

7b59436

Simplify

5c1b6ba

Simplify get_record_batch_message_builder

b5cc681

Hind-M force-pushed the compr_ser branch from a179d52 to b5cc681 Compare October 21, 2025 15:06

Hind-M added 2 commits October 22, 2025 13:48

Simplify deserialization

35ebb09

Add get_buffers docstring

177dbfa

Hind-M requested a review from Alex-PLACET October 22, 2025 12:44

Handle empty validity buffers differently

bc5415c

JohanMabille reviewed Oct 22, 2025

View reviewed changes


		#include <sparrow/record_batch.hpp>

		#include "Message_generated.h"

Handle lz4 compression #30

Are you sure you want to change the base?

Handle lz4 compression #30

Uh oh!

Conversation

Hind-M commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Hind-M commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alex-PLACET commented Oct 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Hind-M Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Hind-M commented Oct 7, 2025 •

edited

Loading

codecov-commenter commented Oct 8, 2025 •

edited

Loading

Hind-M commented Oct 16, 2025 •

edited

Loading

Hind-M Oct 22, 2025 •

edited

Loading