Skip to content

Improve byte slice encoding#30

Merged
frankmcsherry merged 1 commit intomasterfrom
encode_neu
Feb 9, 2025
Merged

Improve byte slice encoding#30
frankmcsherry merged 1 commit intomasterfrom
encode_neu

Conversation

@frankmcsherry
Copy link
Copy Markdown
Owner

This PR introduces a new encoding of a sequence of byte slices, intended to provide more efficient access to the slices when one does not plan on examining all of them. Specifically, to iterate over the byte slices one only has to touch a compact prefix of the memory, in which we record the offsets of each of the slices into the larger memory. This contrasts with the current encoding: a sequence of length delimited byte sequences, which means one must traverse them to find any one slice, or to count their number.

This is a breaking change, as the encoding only works for A: AsBytes, rather than an iterator over slices. We could extend it to an iterator that supports clone(), but we need to perform three passes to extract the information (number of offsets, values of the offsets, and the bytes themselves).

More testing and integration is needed before this is published.

@frankmcsherry
Copy link
Copy Markdown
Owner Author

One integration observation is that timely has hard-coded the prior serialization approach, in part because without #28 there is no other way to write into a W: Write.

@frankmcsherry frankmcsherry force-pushed the encode_neu branch 2 times, most recently from 3c8930d to 535c63b Compare February 9, 2025 16:03
@frankmcsherry frankmcsherry merged commit d91bd07 into master Feb 9, 2025
6 checks passed
antiguru added a commit to antiguru/materialize that referenced this pull request Sep 4, 2025
Switch column from sequence to indexed serialization. The difference is
that sequence stores data as a sequence of (length, data) and indexed puts
all lengths first, followed by all data. We expect this to be more
efficient if we do not read all data, or repeatedly borrow its serialized
representation.

For details, see frankmcsherry/columnar#30

Signed-off-by: Moritz Hoffmann <mh@materialize.com>
antiguru added a commit to antiguru/materialize that referenced this pull request Sep 4, 2025
Switch column from sequence to indexed serialization. The difference is
that sequence stores data as a sequence of (length, data) and indexed puts
all lengths first, followed by all data. We expect this to be more
efficient if we do not read all data, or repeatedly borrow its serialized
representation.

For details, see frankmcsherry/columnar#30

Signed-off-by: Moritz Hoffmann <mh@materialize.com>
antiguru added a commit to MaterializeInc/materialize that referenced this pull request Sep 4, 2025
Switch column from sequence to indexed serialization. The difference is
that sequence stores data as a sequence of (length, data) and indexed
puts all lengths first, followed by all data. We expect this to be more
efficient if we do not read all data, or repeatedly borrow its
serialized representation.

For details, see frankmcsherry/columnar#30

---------

Signed-off-by: Moritz Hoffmann <mh@materialize.com>
antiguru added a commit to antiguru/materialize that referenced this pull request Sep 4, 2025
Switch column from sequence to indexed serialization. The difference is
that sequence stores data as a sequence of (length, data) and indexed puts
all lengths first, followed by all data. We expect this to be more
efficient if we do not read all data, or repeatedly borrow its serialized
representation.

For details, see frankmcsherry/columnar#30

Signed-off-by: Moritz Hoffmann <mh@materialize.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant