Improve byte slice encoding by frankmcsherry · Pull Request #30 · frankmcsherry/columnar

frankmcsherry · 2025-02-08T20:03:27Z

This PR introduces a new encoding of a sequence of byte slices, intended to provide more efficient access to the slices when one does not plan on examining all of them. Specifically, to iterate over the byte slices one only has to touch a compact prefix of the memory, in which we record the offsets of each of the slices into the larger memory. This contrasts with the current encoding: a sequence of length delimited byte sequences, which means one must traverse them to find any one slice, or to count their number.

This is a breaking change, as the encoding only works for A: AsBytes, rather than an iterator over slices. We could extend it to an iterator that supports clone(), but we need to perform three passes to extract the information (number of offsets, values of the offsets, and the bytes themselves).

More testing and integration is needed before this is published.

frankmcsherry · 2025-02-08T22:28:21Z

One integration observation is that timely has hard-coded the prior serialization approach, in part because without #28 there is no other way to write into a W: Write.

Switch column from sequence to indexed serialization. The difference is that sequence stores data as a sequence of (length, data) and indexed puts all lengths first, followed by all data. We expect this to be more efficient if we do not read all data, or repeatedly borrow its serialized representation. For details, see frankmcsherry/columnar#30 Signed-off-by: Moritz Hoffmann <mh@materialize.com>

Switch column from sequence to indexed serialization. The difference is that sequence stores data as a sequence of (length, data) and indexed puts all lengths first, followed by all data. We expect this to be more efficient if we do not read all data, or repeatedly borrow its serialized representation. For details, see frankmcsherry/columnar#30 --------- Signed-off-by: Moritz Hoffmann <mh@materialize.com>

Switch column from sequence to indexed serialization. The difference is that sequence stores data as a sequence of (length, data) and indexed puts all lengths first, followed by all data. We expect this to be more efficient if we do not read all data, or repeatedly borrow its serialized representation. For details, see frankmcsherry/columnar#30 Signed-off-by: Moritz Hoffmann <mh@materialize.com>

frankmcsherry force-pushed the encode_neu branch 2 times, most recently from 3c8930d to 535c63b Compare February 9, 2025 16:03

Improve byte slice encoding

6f0899b

frankmcsherry force-pushed the encode_neu branch from 535c63b to 6f0899b Compare February 9, 2025 20:02

frankmcsherry merged commit d91bd07 into master Feb 9, 2025
6 checks passed

antiguru mentioned this pull request Sep 4, 2025

Switch to indexed serialization MaterializeInc/materialize#33512

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve byte slice encoding#30

Improve byte slice encoding#30
frankmcsherry merged 1 commit intomasterfrom
encode_neu

frankmcsherry commented Feb 8, 2025

Uh oh!

frankmcsherry commented Feb 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

frankmcsherry commented Feb 8, 2025

Uh oh!

frankmcsherry commented Feb 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant