Skip to content

log k variants for grow_n, shrink_n, and grow_k#115

Open
aneubeck wants to merge 51 commits into
mainfrom
aneubeck/fast_choose_k
Open

log k variants for grow_n, shrink_n, and grow_k#115
aneubeck wants to merge 51 commits into
mainfrom
aneubeck/fast_choose_k

Conversation

@aneubeck
Copy link
Copy Markdown
Collaborator

@aneubeck aneubeck commented May 22, 2026

Even though these theoretically faster algorithms are slower in practice for k <100, it is an important part from a computer science/mathematical point of view. And maybe someone finds ways to speed it up further.

@aneubeck aneubeck requested a review from a team as a code owner May 22, 2026 13:18
Copilot AI review requested due to automatic review settings May 22, 2026 13:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new consistent-choose-k crate implementing stateless and stateful consistent choose‑k hashing primitives (replication/failover/bounded-load building blocks), plus associated examples and benchmarks; updates workspace membership and a few benchmark/test dependencies.

Changes:

  • Introduces crates/consistent-choose-k (hash iterators, fast shrink/grow variants, node map, supporting segment-tree/treap data structures) with docs, examples, and Criterion benchmarks.
  • Wires the new benchmark crate into the workspace and updates the repo README listing.
  • Updates some benchmark/test dependencies (Criterion, tiktoken-rs, tokenizers) and adjusts sparse-ngrams benchmarks to use std::hint::black_box.
Show a summary per file
File Description
README.md Adds consistent-choose-k to the workspace crate list.
Cargo.toml Adds crates/consistent-choose-k/benchmarks to workspace members.
crates/sparse-ngrams/Cargo.toml Bumps Criterion dev-dependency to 0.8.
crates/sparse-ngrams/benchmarks/performance.rs Switches black_box import to std::hint::black_box.
crates/consistent-choose-k/Cargo.toml New crate manifest for consistent-choose-k.
crates/consistent-choose-k/README.md New crate-level documentation and design/complexity discussion.
crates/consistent-choose-k/examples/bounded_load.rs Example demonstrating bounded-load assignment and comparisons vs hash ring.
crates/consistent-choose-k/src/lib.rs New crate module wiring and public re-exports (+ bench internals feature gate).
crates/consistent-choose-k/src/consistent_hash.rs Implements consistent hash iterators, builders, and bucket iteration primitives.
crates/consistent-choose-k/src/choose_k.rs Stateless consistent choose‑k hasher implementation + tests.
crates/consistent-choose-k/src/fast_choose_k.rs Fast shrink_n-optimized choose‑k variant using a compact segment tree.
crates/consistent-choose-k/src/fast_grow_n.rs Fast “grow” variant using heap + SampleTreap life tracking.
crates/consistent-choose-k/src/node_map.rs ConsistentNodeMap abstraction over choose‑k ranking with deletions.
crates/consistent-choose-k/src/min_seg_tree.rs Full min-segment-tree implementation with relative-offset encoding.
crates/consistent-choose-k/src/compact_min_seg_tree.rs Compact min-segment-tree variant used by fast shrink.
crates/consistent-choose-k/src/live_min_seg_tree.rs Segment tree variant supporting live-only logical indexing + tombstones.
crates/consistent-choose-k/src/sample_treap.rs Implicit treap with relative-offset encoding to support fast-grow operations.
crates/consistent-choose-k/benchmarks/Cargo.toml New benchmark crate for consistent-choose-k.
crates/consistent-choose-k/benchmarks/performance.rs Criterion benchmarks for choose-k, fast variants, and internal structures.
crates/consistent-choose-k/benchmarks/criterion.toml Criterion output/chart configuration for benchmarks.
crates/bpe/Cargo.toml Bumps tiktoken-rs dev-dependency to 0.11.
crates/bpe/tests/Cargo.toml Bumps tiktoken-rs to 0.11 for test crate.
crates/bpe/benchmarks/Cargo.toml Bumps tiktoken-rs to 0.11 and tokenizers to 0.23.
crates/bpe-openai/Cargo.toml Bumps tiktoken-rs dev-dependency to 0.11.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 24/24 changed files
  • Comments generated: 7

Comment on lines +139 to +148
/// Grow `n` by one and update the choose-k set accordingly. Returns
/// `Some(new_sample)` if a sequence fired (i.e. some sample changed).
///
/// Time: O(log k) expected.
pub fn grow_n(&mut self) -> Option<usize> {
loop {
let Reverse((next, packed_seq)) = self
.next_heap
.pop()
.expect("there are always entries in the heap!");
self.k
}

/// Returns the currently-selected samples sorted by value.
Comment on lines +6 to +10
//! a single representation: `shrink_n` keeps `samples` sorted by value and
//! tracks per-position "block counts" for each *position-bound* sequence
//! id; `grow_n` keeps `samples` in insertion order and tracks per-sample
//! "life" = `seq_id - position`.
//!
Comment on lines +169 to +173
/// Push the next active bucket of `seq` into the heap. Skips
/// buckets whose samples are all below `self.n` (would be stale on
/// arrival) and any remaining samples within a straddling bucket
/// that are below `self.n`. Marks the first (largest) pushed sample
/// as the new owner for `seq`.
Comment on lines +175 to +181
let bld = &self.builders[seq];
let bits = &mut self.bits[seq];
let bit = *bits & bits.wrapping_neg();
*bits ^= bit;
let iter = BucketIterator::new(bit as usize * 2, bit, bld.hash_seq(bit));
let mut is_owner = true;
for l in iter {
smaller samples block this slot". `shrink_n` is then a logarithmic
segment-tree descent to the displaced slot.
* **`ConsistentChooseKFastGrowHasher`** (grow): stores samples in
insertion order, with a per-sample "life" `= seq_id - position`.
/// The hashes are consistent when `n` changes and when `k` changes!
/// I.e. one hash changes with probability `k/(n+1)` when `n` increases by one,
/// resp. one hash gets added when `k` is increased. Additionally, the returned `k` tuple
/// is guaranteed to be uniformely chosen from all possible `n-choose-k` tuples.
@aneubeck aneubeck changed the title Aneubeck/fast choose k log k variants for grow_n, shrink_n, and grow_k May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants