log k variants for grow_n, shrink_n, and grow_k#115
Open
aneubeck wants to merge 51 commits into
Open
Conversation
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
… into aneubeck/sampling
Co-authored-by: Luke Francl <look@github.com>
Co-authored-by: Luke Francl <look@github.com>
Co-authored-by: Luke Francl <look@github.com>
Co-authored-by: Luke Francl <look@github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new consistent-choose-k crate implementing stateless and stateful consistent choose‑k hashing primitives (replication/failover/bounded-load building blocks), plus associated examples and benchmarks; updates workspace membership and a few benchmark/test dependencies.
Changes:
- Introduces
crates/consistent-choose-k(hash iterators, fast shrink/grow variants, node map, supporting segment-tree/treap data structures) with docs, examples, and Criterion benchmarks. - Wires the new benchmark crate into the workspace and updates the repo README listing.
- Updates some benchmark/test dependencies (Criterion,
tiktoken-rs,tokenizers) and adjusts sparse-ngrams benchmarks to usestd::hint::black_box.
Show a summary per file
| File | Description |
|---|---|
| README.md | Adds consistent-choose-k to the workspace crate list. |
| Cargo.toml | Adds crates/consistent-choose-k/benchmarks to workspace members. |
| crates/sparse-ngrams/Cargo.toml | Bumps Criterion dev-dependency to 0.8. |
| crates/sparse-ngrams/benchmarks/performance.rs | Switches black_box import to std::hint::black_box. |
| crates/consistent-choose-k/Cargo.toml | New crate manifest for consistent-choose-k. |
| crates/consistent-choose-k/README.md | New crate-level documentation and design/complexity discussion. |
| crates/consistent-choose-k/examples/bounded_load.rs | Example demonstrating bounded-load assignment and comparisons vs hash ring. |
| crates/consistent-choose-k/src/lib.rs | New crate module wiring and public re-exports (+ bench internals feature gate). |
| crates/consistent-choose-k/src/consistent_hash.rs | Implements consistent hash iterators, builders, and bucket iteration primitives. |
| crates/consistent-choose-k/src/choose_k.rs | Stateless consistent choose‑k hasher implementation + tests. |
| crates/consistent-choose-k/src/fast_choose_k.rs | Fast shrink_n-optimized choose‑k variant using a compact segment tree. |
| crates/consistent-choose-k/src/fast_grow_n.rs | Fast “grow” variant using heap + SampleTreap life tracking. |
| crates/consistent-choose-k/src/node_map.rs | ConsistentNodeMap abstraction over choose‑k ranking with deletions. |
| crates/consistent-choose-k/src/min_seg_tree.rs | Full min-segment-tree implementation with relative-offset encoding. |
| crates/consistent-choose-k/src/compact_min_seg_tree.rs | Compact min-segment-tree variant used by fast shrink. |
| crates/consistent-choose-k/src/live_min_seg_tree.rs | Segment tree variant supporting live-only logical indexing + tombstones. |
| crates/consistent-choose-k/src/sample_treap.rs | Implicit treap with relative-offset encoding to support fast-grow operations. |
| crates/consistent-choose-k/benchmarks/Cargo.toml | New benchmark crate for consistent-choose-k. |
| crates/consistent-choose-k/benchmarks/performance.rs | Criterion benchmarks for choose-k, fast variants, and internal structures. |
| crates/consistent-choose-k/benchmarks/criterion.toml | Criterion output/chart configuration for benchmarks. |
| crates/bpe/Cargo.toml | Bumps tiktoken-rs dev-dependency to 0.11. |
| crates/bpe/tests/Cargo.toml | Bumps tiktoken-rs to 0.11 for test crate. |
| crates/bpe/benchmarks/Cargo.toml | Bumps tiktoken-rs to 0.11 and tokenizers to 0.23. |
| crates/bpe-openai/Cargo.toml | Bumps tiktoken-rs dev-dependency to 0.11. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 24/24 changed files
- Comments generated: 7
Comment on lines
+139
to
+148
| /// Grow `n` by one and update the choose-k set accordingly. Returns | ||
| /// `Some(new_sample)` if a sequence fired (i.e. some sample changed). | ||
| /// | ||
| /// Time: O(log k) expected. | ||
| pub fn grow_n(&mut self) -> Option<usize> { | ||
| loop { | ||
| let Reverse((next, packed_seq)) = self | ||
| .next_heap | ||
| .pop() | ||
| .expect("there are always entries in the heap!"); |
| self.k | ||
| } | ||
|
|
||
| /// Returns the currently-selected samples sorted by value. |
Comment on lines
+6
to
+10
| //! a single representation: `shrink_n` keeps `samples` sorted by value and | ||
| //! tracks per-position "block counts" for each *position-bound* sequence | ||
| //! id; `grow_n` keeps `samples` in insertion order and tracks per-sample | ||
| //! "life" = `seq_id - position`. | ||
| //! |
Comment on lines
+169
to
+173
| /// Push the next active bucket of `seq` into the heap. Skips | ||
| /// buckets whose samples are all below `self.n` (would be stale on | ||
| /// arrival) and any remaining samples within a straddling bucket | ||
| /// that are below `self.n`. Marks the first (largest) pushed sample | ||
| /// as the new owner for `seq`. |
Comment on lines
+175
to
+181
| let bld = &self.builders[seq]; | ||
| let bits = &mut self.bits[seq]; | ||
| let bit = *bits & bits.wrapping_neg(); | ||
| *bits ^= bit; | ||
| let iter = BucketIterator::new(bit as usize * 2, bit, bld.hash_seq(bit)); | ||
| let mut is_owner = true; | ||
| for l in iter { |
| smaller samples block this slot". `shrink_n` is then a logarithmic | ||
| segment-tree descent to the displaced slot. | ||
| * **`ConsistentChooseKFastGrowHasher`** (grow): stores samples in | ||
| insertion order, with a per-sample "life" `= seq_id - position`. |
| /// The hashes are consistent when `n` changes and when `k` changes! | ||
| /// I.e. one hash changes with probability `k/(n+1)` when `n` increases by one, | ||
| /// resp. one hash gets added when `k` is increased. Additionally, the returned `k` tuple | ||
| /// is guaranteed to be uniformely chosen from all possible `n-choose-k` tuples. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Even though these theoretically faster algorithms are slower in practice for k <100, it is an important part from a computer science/mathematical point of view. And maybe someone finds ways to speed it up further.