Skip to content

Commit 7200efa

Browse files
Merge pull request #64 from triblespace/codex/evaluate-features-for-bytesarena
docs: add ByteArea metadata examples
2 parents f49b6b1 + 1c52cc2 commit 7200efa

25 files changed

+1375
-705
lines changed

CHANGELOG.md

Lines changed: 69 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,54 @@
11
# Changelog
22

33
## Unreleased
4-
- Added `WaveletMatrix::to_bytes` and `WaveletMatrix::from_bytes` returning metadata and bytes for zero-copy persistence.
5-
- Documented the serialized `WaveletMatrix` layout with ASCII art.
6-
- Added `CompactVector::to_bytes` and `from_bytes` for zero-copy serialization.
4+
- Introduced a `Serializable` trait for metadata-based reconstruction and
5+
implemented it for `CompactVector`, `DacsByte`, and `WaveletMatrix`.
6+
- Audited `DacsByte` and `WaveletMatrix` to leverage `SectionHandle::view`
7+
during deserialization, removing legacy `slice_to_bytes` helpers and fully
8+
adopting the `ByteArea`-backed reconstruction path.
9+
- Switched internal bit-vector words and handles from `usize` to `u64`, removing
10+
unsafe handle transmutes in `WaveletMatrixBuilder` and fixing word size to
11+
64-bit.
12+
- Reversed remaining layers and popped in `WaveletMatrixBuilder::freeze`
13+
to avoid repeated vector shifts.
14+
- `WaveletMatrixMeta` now stores a handle slice of per-layer handles, and
15+
`WaveletMatrixBuilder` allocates that slice from the `SectionWriter`.
16+
- `WaveletMatrixBuilder::with_capacity` records each layer's handle up front,
17+
eliminating handle assignment during `freeze`.
18+
- Switched to the zerocopy `SectionHandle` from `anybytes`, removing the
19+
interim `HandleRepr` shim.
20+
- Added `WaveletMatrixBuilder` for fixed-size construction, writing raw bits per
21+
layer and stably partitioning them on `freeze`; `WaveletMatrix::from_iter`
22+
now builds via this builder without requiring iterator cloning.
23+
- `WaveletMatrix` construction now goes through `from_iter`, which allocates
24+
layer bitvectors from a `SectionWriter` and consumes a single
25+
`ExactSizeIterator` without temporary `CompactVector` partitions.
26+
- `CompactVector::iter` now implements `ExactSizeIterator` to support the new
27+
constructor.
28+
- `WaveletMatrixBuilder::freeze` partitions layers in place, removing the
29+
temporary bit buffer previously used during construction.
30+
- Removed `order` and `next_order` buffers by sorting remaining layers in
31+
place during each `freeze` step.
32+
- Optimized `WaveletMatrixBuilder::freeze` using stable per-layer partitions
33+
and cycle-based permutations, reducing layer processing to linear time.
34+
- Replaced the `perm` array with a scratch `visited` bitmap and cycle
35+
rotations so each level permutes lower layers in place with only `O(n)`
36+
extra bits.
37+
- Stored row suffix bits in a `usize` during cycle rotations, removing the
38+
temporary `Vec<bool>` from `rotate_cycle_over_lower_levels`.
39+
- Reused `BitVectorBuilder` as the scratch `visited` bitmap for
40+
wavelet-matrix construction, eliminating the separate `BitArrayBuilder`.
41+
- Added `swap_bits` helper to `BitVectorBuilder` for in-place bit exchanges.
42+
- Reworked `WaveletMatrix::from_iter` to require a cloneable iterator and
43+
build layers in two passes without temporary buffers.
44+
- Rewrote `CompactVectorBuilder` to use fixed-size `set_int` and `set_ints`
45+
APIs, removing `push_int`/`extend` and updating builders and examples.
46+
- Added `with_capacity` constructor on `BitVectorBuilder` and honored capacity in
47+
`CompactVectorBuilder::with_capacity` to pre-allocate bit storage.
48+
- Replaced `BitVectorBuilder::new` with `with_capacity` that allocates from an
49+
`anybytes::ByteArea` section and plumbed `SectionWriter` through
50+
`CompactVectorBuilder` and wavelet matrix builders.
51+
- Builders now track capacity and error when pushes exceed the reserved size.
752
- Made `DacsByte` generic over its flag index type with a default of `Rank9SelIndex`.
853
- `DacsByte::from_slice` now accepts a generic index type, removing `from_slice_with_index`.
954
- Added `BitVectorBuilder` and zero-copy `BitVectorData` backed by `anybytes::View`.
@@ -12,11 +57,15 @@
1257
- Rename crate from `succdisk` to `jerky`.
1358
- Replaced the old `BitVector` with the generic `BitVector<I>` and renamed the
1459
mutable variant to `RawBitVector`.
15-
- Extended `BitVectorBuilder` with `push_bits` and `set_bit` APIs.
60+
- Replaced the push-based `BitVectorBuilder` with fixed-size `set_bit` and `set_bits` APIs and updated builders accordingly.
61+
- Added `set_bits_from_iter` to `BitVectorBuilder` and later revised it to take a
62+
start offset and consume bits until the iterator ends or the builder is
63+
full, leaving any unconsumed items to the caller.
1664
- Added `from_bit` constructor on `BitVectorBuilder` for repeating a single bit.
1765
- `DacsByte` now stores level data as zero-copy `View<[u8]>` values.
18-
- Added `to_bytes` and `from_bytes` on `DacsByte` for zero-copy serialization.
19-
- Documented the byte layout produced by `DacsByte::to_bytes` with ASCII art.
66+
- Replaced `to_bytes` helpers with `metadata` methods returning `SectionHandle`s
67+
so structures can be reconstructed zero-copy via `from_bytes`.
68+
- Documented the byte layout for `DacsByte` sequences with ASCII art.
2069
- Switched `anybytes` dependency to track the upstream Git repository for the
2170
latest changes.
2271
- Removed internal byte buffers from data structures; `WaveletMatrix`,
@@ -31,13 +80,15 @@
3180
- `Rank9Sel` now stores a `BitVector<Rank9SelIndex>` built via `BitVectorBuilder`.
3281
- Replaced `DArrayFullIndex` with new `DArrayIndex` that uses const generics
3382
to optionally include `select1` and `select0` support.
34-
- Introduced `CompactVectorBuilder` mutable APIs `push_int`, `set_int`, and `extend`.
83+
- Introduced `CompactVectorBuilder` mutable APIs `set_int` and `set_ints`.
3584
- Simplified bit vector imports by re-exporting `BitVectorBuilder` and `Rank9SelIndex` and updating examples.
3685
- Moved the `bit_vector::bit_vector` module contents directly into `bit_vector` for cleaner paths.
86+
- Recorded future work items for a metadata serialization trait and
87+
ByteArea-backed documentation examples.
3788
- Added README usage example demonstrating basic bit vector operations.
3889
- Removed `bit_vector::prelude`; import traits directly with `use jerky::bit_vector::*`.
3990
- Added `freeze()` on `CompactVectorBuilder` yielding an immutable `CompactVector` backed by `BitVector<NoIndex>`.
40-
- `CompactVector::new` and `with_capacity` now return builders; other constructors build via the builder pattern.
91+
- Removed `CompactVector::new`; use `with_capacity` to construct builders.
4192
- Wavelet matrix and DACs builders now use `BitVectorBuilder` for temporary bit
4293
vectors, storing only immutable `BitVector` data after construction.
4394
- Removed obsolete `RawBitVector` type.
@@ -66,3 +117,13 @@
66117
- Documented `WaveletMatrix` usage in `README.md`.
67118
- Moved README usage examples to runnable files in `examples/`.
68119
- Added `compact_vector` example showing construction and retrieval.
120+
- Serialized `WaveletMatrix` and `DacsByte` directly into a `ByteArea` to
121+
avoid intermediate copies and guarantee contiguous layout.
122+
- Enabled doctests for `WaveletMatrix` by removing `ignore` fences from its
123+
documentation examples.
124+
- `DacsByte::from_slice` now writes level bytes and flags directly into
125+
`SectionWriter` buffers, removing the intermediate `Vec` allocations.
126+
- Stored per-level `DacsByte` handles in the byte arena, allowing
127+
`DacsByteMeta` to reference a single handle slice like `WaveletMatrixMeta`.
128+
- Expanded examples and README with `ByteArea`/`SectionHandle` metadata
129+
reconstruction for set-based APIs, adding a `dacs_byte` usage demo.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ rust-version = "1.61.0"
1919
[dependencies]
2020
anyhow = "1.0"
2121
num-traits = "0.2.15"
22-
anybytes = { git = "https://github.com/triblespace/anybytes", features = ["zerocopy"] }
22+
anybytes = { git = "https://github.com/triblespace/anybytes", features = ["zerocopy", "mmap"] }
2323
zerocopy = "0.8"
2424

2525
[features]

INVENTORY.md

Lines changed: 38 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,44 @@
99
- Investigate alternative dense-select index strategies to replace removed `DArrayIndex`.
1010
- Explore additional index implementations leveraging the new generic `DacsByte<I>`.
1111
- Demonstrate the generic `from_slice` usage in examples and docs.
12-
- Showcase `DacsByte` byte serialization in an example.
13-
- Provide serialization helpers for additional structures beyond `WaveletMatrix`.
14-
- Show `CompactVector::to_bytes` and `from_bytes` in examples.
12+
- Apply `with_capacity` constructors across builders to avoid intermediate reallocations.
13+
- Transition builders to fixed-size APIs, removing growable variants.
14+
- Refactor builders and serializers to operate on `ByteArea` sections, enabling
15+
zero-copy persistence across all structures.
16+
- Move `DacsByte` metadata arrays into the arena and store per-level handles
17+
similar to `WaveletMatrixMeta`.
18+
- Add slice-based range setters for integer builders to minimize manual index
19+
tracking during construction.
20+
- Provide bulk bit setters like `set_bits_from_slice` for `BitVectorBuilder`
21+
to copy from packed data efficiently.
22+
- Provide convenience helpers to manage `ByteArea` and `SectionWriter` setup for
23+
common builder use cases.
24+
- Audit remaining constructors for zero-capacity variants and decide whether to
25+
offer explicit `empty` helpers instead of `with_capacity(0)`.
26+
- Allocate temporary wavelet-matrix buffers from `ByteArea` to avoid
27+
intermediate `Vec` copies and ensure fully contiguous construction.
28+
- Provide a derive or macro to reduce boilerplate when implementing the
29+
`Serializable` trait.
30+
- Consider a slice-based `WaveletMatrix` constructor to avoid requiring
31+
cloneable iterators.
32+
- Benchmark the cycle-based partitioning in `WaveletMatrixBuilder::freeze`
33+
and explore more efficient permutation strategies.
34+
- Explore specialized rotation helpers for `BitVectorBuilder` to speed up
35+
recursive partitioning without extra buffers.
36+
- Explore using `BitVectorBuilder` for other temporary bitmaps to reduce
37+
scattered `Vec<bool>` allocations.
38+
- Review documentation examples across modules and convert remaining ignored
39+
snippets into runnable doctests.
40+
- Explore iterating layer indices instead of reversing `remaining` to avoid
41+
the upfront `reverse` cost in `WaveletMatrixBuilder::freeze`.
42+
- Audit integer-vector constructors for opportunities to allocate directly
43+
from `SectionWriter` without temporary `Vec`s.
44+
- Document the fixed 64-bit word assumption across structures now that bit
45+
vectors use `u64` internally.
46+
- Provide helpers on `SectionHandle` to derive typed sub-handles, reducing
47+
manual offset math in complex `from_bytes` implementations like `DacsByte`.
48+
- Investigate slimming `DacsByte` per-level metadata to avoid storing unused
49+
flag handles for the last level.
1550

1651
## Discovered Issues
1752
- `katex.html` performs manual string replacements; consider DOM-based manipulation.

README.md

Lines changed: 36 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -25,48 +25,57 @@ RUSTDOCFLAGS="--html-in-header katex.html" cargo doc --no-deps
2525
## Zero-copy bit vectors
2626

2727
`BitVectorBuilder` can build a bit vector whose underlying `BitVectorData`
28-
is backed by `anybytes::View`. The data can be serialized with
29-
`BitVectorData::to_bytes` and reconstructed using `BitVectorData::from_bytes`,
30-
allowing zero-copy loading from an mmap or any other source by passing the
31-
byte region to `Bytes::from_source`.
28+
is backed by `anybytes::View`. Metadata describing a stored sequence includes
29+
[`SectionHandle`](anybytes::area::SectionHandle)s so the raw
30+
`Bytes` returned by `ByteArea::freeze` can be handed to
31+
`BitVectorData::from_bytes` for zero‑copy reconstruction.
3232

33-
`DacsByte` sequences support a similar interface with `to_bytes` returning
34-
metadata alongside the byte slice and `from_bytes` rebuilding the sequence
35-
using that metadata.
33+
Types following this pattern implement the [`Serializable`](src/serialization.rs) trait,
34+
which exposes a `metadata` accessor and a `from_bytes` constructor.
35+
36+
`DacsByte` sequences expose a `metadata` method returning a descriptor with a
37+
handle to a slice of per-level handles. Each entry stores the flag bitvector
38+
handle (if any), its bit length, and the payload byte handle. `from_bytes`
39+
rebuilds the sequence using that metadata.
3640

3741
```text
38-
Bytes layout from `DacsByte::to_bytes`:
42+
Bytes layout for a `DacsByte` sequence (current builders place sections
43+
contiguously, though layout is fully described by the stored handles):
3944
4045
| flag[0] words | flag[1] words | ... | flag[n-2] words | level[0] data | level[1] data | ... | level[n-1] data |
4146
4247
The flag vectors come first and store native-endian `usize` words. The level
4348
data immediately follows without any padding.
4449
```
4550

46-
`CompactVector` offers similar helpers: `CompactVector::to_bytes` returns a
47-
metadata struct along with the raw bytes, and `CompactVector::from_bytes`
48-
reconstructs the vector from that information.
49-
50-
`WaveletMatrix` sequences share this layout and can be serialized with
51-
`WaveletMatrix::to_bytes` (returning metadata and bytes) and reconstructed
52-
using `WaveletMatrix::from_bytes`.
53-
54-
The byte buffer returned by `to_bytes` stores each bit-vector layer
55-
contiguously. Given `num_words = ceil(len / WORD_LEN)`, the layout is:
56-
57-
```
58-
bytes:
59-
+------------+------------+-----+
60-
| layer 0 | layer 1 | ... |
61-
| num_words | num_words | |
62-
+------------+------------+-----+
51+
`CompactVector` and `WaveletMatrix` provide the same pattern: call `metadata`
52+
to obtain a descriptor with the required `SectionHandle`s, then hand both the
53+
metadata and the full `Bytes` region to `from_bytes`.
54+
55+
For a wavelet matrix the metadata stores a handle to a slice of per-layer
56+
handles. Each handle in that slice points to the native-endian `usize` words
57+
forming a single layer. Layers may reside anywhere in the arena and no longer
58+
need to be contiguous.
59+
60+
```rust
61+
use anybytes::ByteArea;
62+
use jerky::int_vectors::{CompactVector, CompactVectorBuilder};
63+
64+
let mut area = ByteArea::new()?;
65+
let mut sections = area.sections();
66+
let mut builder = CompactVectorBuilder::with_capacity(3, 3, &mut sections)?;
67+
builder.set_ints(0..3, [7, 2, 5])?;
68+
let cv = builder.freeze();
69+
let meta = cv.metadata();
70+
let bytes = area.freeze()?;
71+
let view = CompactVector::from_bytes(meta, bytes.clone())?;
72+
assert_eq!(view.get_int(1), Some(2));
6373
```
64-
where each segment contains `num_words` consecutive `usize` words for a layer.
6574

6675
## Examples
6776

6877
See the [examples](examples/) directory for runnable usage demos, including
69-
`bit_vector`, `wavelet_matrix`, and `compact_vector`.
78+
`bit_vector`, `compact_vector`, `dacs_byte`, and `wavelet_matrix`.
7079

7180
## Licensing
7281

bench/benches/timing_bitvec_rank.rs

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ use criterion::{
77
criterion_group, criterion_main, measurement::WallTime, BenchmarkGroup, Criterion, SamplingMode,
88
};
99

10+
use anybytes::ByteArea;
1011
use jerky::bit_vector::{BitVector, BitVectorBuilder, NoIndex, Rank, Rank9SelIndex};
1112

1213
const SAMPLE_SIZE: usize = 30;
@@ -39,15 +40,23 @@ fn run_queries<R: Rank>(idx: &R, queries: &[usize]) {
3940

4041
fn perform_bitvec_rank(group: &mut BenchmarkGroup<WallTime>, bits: &[bool], queries: &[usize]) {
4142
group.bench_function("jerky/BitVector", |b| {
42-
let mut builder = BitVectorBuilder::new();
43-
builder.extend_bits(bits.iter().cloned());
43+
let mut area = ByteArea::new().unwrap();
44+
let mut sections = area.sections();
45+
let mut builder = BitVectorBuilder::with_capacity(bits.len(), &mut sections).unwrap();
46+
for (i, &bval) in bits.iter().enumerate() {
47+
builder.set_bit(i, bval).unwrap();
48+
}
4449
let idx: BitVector<NoIndex> = builder.freeze();
4550
b.iter(|| run_queries(&idx, &queries));
4651
});
4752

4853
group.bench_function("jerky/BitVector<Rank9SelIndex>", |b| {
49-
let mut builder = BitVectorBuilder::new();
50-
builder.extend_bits(bits.iter().cloned());
54+
let mut area = ByteArea::new().unwrap();
55+
let mut sections = area.sections();
56+
let mut builder = BitVectorBuilder::with_capacity(bits.len(), &mut sections).unwrap();
57+
for (i, &bval) in bits.iter().enumerate() {
58+
builder.set_bit(i, bval).unwrap();
59+
}
5160
let idx = builder.freeze::<Rank9SelIndex>();
5261
b.iter(|| run_queries(&idx, &queries));
5362
});

bench/benches/timing_bitvec_select.rs

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ use std::time::Duration;
33
use rand::{Rng, SeedableRng};
44
use rand_chacha::ChaChaRng;
55

6+
use anybytes::ByteArea;
67
use criterion::{
78
criterion_group, criterion_main, measurement::WallTime, BenchmarkGroup, Criterion, SamplingMode,
89
};
@@ -42,15 +43,23 @@ fn run_queries<S: Select>(idx: &S, queries: &[usize]) {
4243

4344
fn perform_bitvec_select(group: &mut BenchmarkGroup<WallTime>, bits: &[bool], queries: &[usize]) {
4445
group.bench_function("jerky/BitVector", |b| {
45-
let mut builder = BitVectorBuilder::new();
46-
builder.extend_bits(bits.iter().cloned());
46+
let mut area = ByteArea::new().unwrap();
47+
let mut sections = area.sections();
48+
let mut builder = BitVectorBuilder::with_capacity(bits.len(), &mut sections).unwrap();
49+
for (i, &bval) in bits.iter().enumerate() {
50+
builder.set_bit(i, bval).unwrap();
51+
}
4752
let idx: BitVector<NoIndex> = builder.freeze();
4853
b.iter(|| run_queries(&idx, &queries));
4954
});
5055

5156
group.bench_function("jerky/BitVector<Rank9SelIndex>", |b| {
52-
let mut builder = BitVectorBuilder::new();
53-
builder.extend_bits(bits.iter().cloned());
57+
let mut area = ByteArea::new().unwrap();
58+
let mut sections = area.sections();
59+
let mut builder = BitVectorBuilder::with_capacity(bits.len(), &mut sections).unwrap();
60+
for (i, &bval) in bits.iter().enumerate() {
61+
builder.set_bit(i, bval).unwrap();
62+
}
5463
let idx = builder.freeze::<Rank9SelIndex>();
5564
b.iter(|| run_queries(&idx, &queries));
5665
});

bench/benches/timing_chrseq_access.rs

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ use std::time::Duration;
33
use rand::{Rng, SeedableRng};
44
use rand_chacha::ChaChaRng;
55

6+
use anybytes::ByteArea;
67
use jerky::bit_vector::*;
78
use jerky::char_sequences::WaveletMatrix;
89
use jerky::int_vectors::CompactVector;
@@ -25,8 +26,9 @@ const PROTEINS_PSEF_STR: &str = include_str!("../data/texts/proteins.1MiB.txt");
2526
// In effective alphabet
2627
fn load_text(s: &str) -> CompactVector {
2728
let mut text = s.as_bytes().to_vec();
28-
let mut builder = BitVectorBuilder::new();
29-
builder.extend_bits(core::iter::repeat(false).take(256));
29+
let mut area = ByteArea::new().unwrap();
30+
let mut sections = area.sections();
31+
let mut builder = BitVectorBuilder::with_capacity(256, &mut sections).unwrap();
3032
for &c in &text {
3133
builder.set_bit(c as usize, true).unwrap();
3234
}
@@ -92,7 +94,11 @@ fn perform_chrseq_access(group: &mut BenchmarkGroup<WallTime>, text: &CompactVec
9294
let queries = gen_random_ints(NUM_QUERIES, 0, text.len(), SEED_QUERIES);
9395

9496
group.bench_function("jerky/WaveletMatrix<Rank9SelIndex>", |b| {
95-
let idx = WaveletMatrix::<Rank9SelIndex>::new(text.clone()).unwrap();
97+
let alph_size = text.iter().max().unwrap() + 1;
98+
let mut area = ByteArea::new().unwrap();
99+
let mut sections = area.sections();
100+
let idx = WaveletMatrix::<Rank9SelIndex>::from_iter(alph_size, text.iter(), &mut sections)
101+
.unwrap();
96102
b.iter(|| run_queries(&idx, &queries));
97103
});
98104
}

bench/benches/timing_intvec_access.rs

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ use criterion::{
77
criterion_group, criterion_main, measurement::WallTime, BenchmarkGroup, Criterion, SamplingMode,
88
};
99

10+
use anybytes::ByteArea;
1011
use jerky::int_vectors::Access;
1112

1213
const SAMPLE_SIZE: usize = 30;
@@ -87,7 +88,10 @@ fn perform_intvec_access(group: &mut BenchmarkGroup<WallTime>, vals: &[u32]) {
8788
});
8889

8990
group.bench_function("jerky/DacsByte", |b| {
90-
let idx = jerky::int_vectors::DacsByte::from_slice(vals).unwrap();
91+
let mut area = ByteArea::new().unwrap();
92+
let mut writer = area.sections();
93+
let idx = jerky::int_vectors::DacsByte::from_slice(vals, &mut writer).unwrap();
94+
area.freeze().unwrap();
9195
b.iter(|| run_queries(&idx, &queries));
9296
});
9397
}

0 commit comments

Comments
 (0)