Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
4426de9
most basic elias fano
aneubeck Oct 7, 2025
9bb6a3c
simple operation
aneubeck Oct 20, 2025
4c70132
benchmark + separate file
aneubeck Nov 24, 2025
07bf579
refine batch processing
aneubeck Nov 24, 2025
929ea0a
back to decoding a full batch of 32 values
aneubeck Nov 24, 2025
6c6f387
loop + don't read ahead
aneubeck Nov 24, 2025
85cfdf3
fix base implementation and try generics
aneubeck Nov 24, 2025
1cd9cbb
try unaligned words
aneubeck Nov 24, 2025
710102b
tune batch size
aneubeck Nov 24, 2025
2521d57
decode a multiple of 32
aneubeck Nov 24, 2025
8d62ad4
add avx version
aneubeck Nov 25, 2025
e36a92a
add some buffer
aneubeck Nov 25, 2025
04b8cff
Almost 1GB/sec!
aneubeck Nov 25, 2025
f57f67a
1.4 billion values/sec version
aneubeck Nov 25, 2025
4a56922
Add intersecting iterator
aneubeck Dec 1, 2025
5e41049
Create helper.rs
aneubeck Dec 5, 2025
79f028e
run encoding on real data
aneubeck Dec 10, 2025
e51ca63
add vbyte for comparison
aneubeck Dec 10, 2025
16973f1
add bitpacking
aneubeck Dec 10, 2025
ba02311
remove bin
aneubeck Dec 10, 2025
fa563f7
mapping
aneubeck Dec 10, 2025
0cbd36f
Update reorder_docids.rs
aneubeck Dec 10, 2025
50307a8
Update encode_pisa.rs
aneubeck Dec 10, 2025
6dd118c
Update encode_pisa.rs
aneubeck Dec 10, 2025
e05a326
Update encode_pisa.rs
aneubeck Dec 10, 2025
570f681
novel mst sorting
aneubeck Dec 27, 2025
696c48a
speed up transformation
aneubeck Dec 27, 2025
674c816
multiple ngram implementation
aneubeck Apr 25, 2026
a27a5b9
avx512
aneubeck Apr 25, 2026
95bb71c
Optimize masked AVX extraction path
aneubeck Apr 25, 2026
647dee1
inline and simplify scan
aneubeck Apr 25, 2026
3272bb1
wide_avx attempt
aneubeck Apr 27, 2026
295816d
change priority to u8 and make priority strict in both directions
aneubeck Apr 27, 2026
feceb04
move benches to different folder
aneubeck May 8, 2026
2521d02
increase priority
aneubeck May 8, 2026
d6bd09f
remove the slower implementations
aneubeck May 8, 2026
a540549
refactor tests
aneubeck May 8, 2026
a0b80ea
add readme
aneubeck May 8, 2026
48bfbbb
growth by max n-gram len
aneubeck May 8, 2026
03cb4a7
remove pef crate
aneubeck May 8, 2026
cc73c69
Update README.md
aneubeck May 8, 2026
8505846
review comments
aneubeck May 12, 2026
6f750ac
update readme + graphs
aneubeck May 12, 2026
2b57af8
Merge branch 'main' into aneubeck/sparse_ngrams
aneubeck May 12, 2026
703db7c
fix linter
aneubeck May 12, 2026
f7c1877
fix npm build
aneubeck May 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ members = [
"crates/*",
"crates/bpe/benchmarks",
"crates/bpe/tests",
"crates/hash-sorted-map/benchmarks",
]
Comment thread
aneubeck marked this conversation as resolved.
resolver = "2"

Expand Down
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ build:

.PHONY: build-js
build-js:
which wasm-pack || cargo install wasm-pack
npm --prefix crates/string-offsets/js install
npm --prefix crates/string-offsets/js run compile

Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ A collection of useful algorithms written in Rust. Currently contains:
- [`geo_filters`](crates/geo_filters): probabilistic data structures that solve the [Distinct Count Problem](https://en.wikipedia.org/wiki/Count-distinct_problem) using geometric filters.
- [`bpe`](crates/bpe): fast, correct, and novel algorithms for the [Byte Pair Encoding Algorithm](https://en.wikipedia.org/wiki/Large_language_model#BPE) which are particularly useful for chunking of documents.
- [`bpe-openai`](crates/bpe-openai): Fast tokenizers for OpenAI token sets based on the `bpe` crate.
- [`sparse-ngrams`](crates/sparse-ngrams): fast sparse n-gram extraction from byte slices. Selects variable-length n-grams (2–8 bytes) deterministically using bigram frequency priorities, suitable for substring search indexes.
- [`string-offsets`](crates/string-offsets): converts string positions between bytes, chars, UTF-16 code units, and line numbers. Useful when sending string indices across language boundaries.

## Background
Expand Down
20 changes: 20 additions & 0 deletions crates/sparse-ngrams/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[package]
name = "sparse-ngrams"
version = "0.1.0"
edition = "2021"
description = "Fast sparse n-gram extraction from byte slices."
repository = "https://github.com/github/rust-gems"
license = "MIT"
keywords = ["ngram", "algorithm", "search", "index"]
categories = ["algorithms", "data-structures", "text-processing"]

[lib]
bench = false

[[bench]]
name = "performance"
path = "benchmarks/performance.rs"
harness = false

[dev-dependencies]
criterion = "0.7"
88 changes: 88 additions & 0 deletions crates/sparse-ngrams/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# sparse-ngrams

Fast sparse n-gram extraction from byte slices.

Sparse grams select variable-length n-grams (2–8 bytes) without extracting all possible substrings. The algorithm is deterministic: the same extraction logic applies to every substring, making it suitable for substring search indexes.

For background, see:
- [The technology behind GitHub's new code search](https://github.blog/engineering/architecture-optimization/the-technology-behind-githubs-new-code-search/#fn-69904-bignote)
- [Sparse n-grams: smarter trigram selection](https://cursor.com/blog/fast-regex-search#sparse-n-grams-smarter-trigram-selection)

## Caveats

The integrated bigram table contains only lowercase ASCII bigrams. Callers should lowercase and normalize input before extraction (e.g. fold uppercase to lowercase, map non-ASCII bytes to a single sentinel value). This makes the implementation suitable for case-insensitive search indexes.

## How it works

Each consecutive byte pair (bigram) is assigned a frequency-based priority from a precomputed table. An n-gram boundary occurs wherever a bigram has lower priority than all bigrams between it and the previous boundary. This is computed efficiently using a monotone deque or a scan-based approach.

For a document of N bytes, this produces at most 3(N−1) n-grams: N−1 bigrams, plus up to 2(N−1) algorithmically selected longer n-grams (up to 8 bytes).

### Selection criterion

A substring of length 3–8 is emitted as a sparse n-gram if and only if every interior bigram priority is strictly greater than the maximum of the left and right boundary bigram priorities.
Comment on lines +21 to +23
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section seems redundant.

There was a github blog post about this that I think is worth linking. There's also the comment in sparse_grams.rs which is a gentler introduction than this, but I won't insist.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it as a short introduction. added two links (to our own blog post and to cursors reimplementation)


## Usage

```rust
use sparse_ngrams::{collect_sparse_grams, NGram, MAX_SPARSE_GRAM_SIZE};

let input = b"hello world";
let grams = collect_sparse_grams(input);
for gram in &grams {
assert!(gram.len() >= 2);
assert!(gram.len() <= MAX_SPARSE_GRAM_SIZE as usize);
}
```

## Performance

Benchmarks on an Apple M1 (15 KB input, `lib.rs` source file):

| Variant | Throughput |
|---------|-----------|
| `deque` | ~3.5 GB/s |
| `scan` | ~4.9 GB/s |

The `scan` variant is ~40% faster than the deque variant by replacing the monotone deque with a fixed-size circular buffer and a suffix-minimum scan.

## Bigram table size

The priority table maps byte pairs to frequency-based priorities. Increasing the table size (number of ranked bigrams) produces more distinct longer n-grams, but saturates quickly:

![Unique n-grams vs. table size](images/unique_ngrams_vs_table_size.png)

| Table size | Unique n-grams | % of max |
|-----------|-----------------|----------|
| 100 | 5.8M | 77.0% |
| 200 | 6.4M | 84.4% |
| 400 | 6.8M | 90.2% |
| 800 | 7.3M | 96.0% |
| 1,600 | 7.5M | 99.2% |
| 3,200 | 7.6M | 99.9% |
| 5,845 | 7.6M | 100% |

The current bigram table contains the 5,845 most frequent bigrams from a large code corpus.
The table saturates quickly — the first ~1,600 bigrams already capture 99% of the unique n-grams.

## Maximum n-gram length

Increasing the maximum n-gram length produces more unique longer grams, with diminishing returns:

![Unique n-grams vs. max length](images/unique_ngrams_vs_max_length.png)

| Max length | Unique n-grams | vs. len=8 |
|-----------|---------------|-----------|
| 2 | 1.2M | 16% |
| 3 | 4.1M | 54% |
| 4 | 5.3M | 70% |
| 6 | 6.8M | 89% |
| 8 | 7.6M | 100% |
| 12 | 8.5M | 113% |
| 16 | 9.1M | 120% |
| 24 | 9.7M | 128% |
| 32 | 10.1M | 133% |
| 48 | 10.4M | 137% |
| 64 | 10.5M | 139% |

The default of 8 captures most of the discriminative power. Going to 16 adds ~20% more unique grams but doubles the scan window; going to 64 adds only ~39% total.
37 changes: 37 additions & 0 deletions crates/sparse-ngrams/benchmarks/performance.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion, Throughput};
use sparse_ngrams::{
collect_sparse_grams_deque, collect_sparse_grams_scan, max_sparse_grams, NGram,
};

fn bench_collect(c: &mut Criterion) {
let inputs: Vec<(&str, Vec<u8>)> = vec![
("small_11B", b"hello world".to_vec()),
(
"medium_900B",
"the quick brown fox jumps over the lazy dog. "
.repeat(20)
.into_bytes(),
),
(
"large_15KB",
include_str!("../src/lib.rs").as_bytes().to_vec(),
),
];

let mut group = c.benchmark_group("collect");
for (name, input) in &inputs {
let mut buf = vec![NGram::from_bytes(b"xx"); max_sparse_grams(input.len())];
group.throughput(Throughput::Bytes(input.len() as u64));

group.bench_with_input(BenchmarkId::new("deque", name), input, |b, input| {
b.iter(|| collect_sparse_grams_deque(black_box(input), &mut buf))
});
group.bench_with_input(BenchmarkId::new("scan", name), input, |b, input| {
b.iter(|| collect_sparse_grams_scan(black_box(input), &mut buf))
});
}
group.finish();
}

criterion_group!(benches, bench_collect);
criterion_main!(benches);
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added crates/sparse-ngrams/src/bigrams.bin
Binary file not shown.
71 changes: 71 additions & 0 deletions crates/sparse-ngrams/src/deque.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
//! Stack-allocated circular buffer (monotone deque).

use std::mem::MaybeUninit;

/// Deque element representing two neighboring bytes in the input.
#[derive(Debug, Clone, Copy)]
pub(crate) struct PosStateBytes {
/// Absolute index position between the two bigram characters.
/// I.e. 1 references the very first bigram.
pub index: u32,
pub value: u16,
}

/// Stack-allocated circular buffer holding up to `CAP` elements.
/// Replaces `VecDeque<PosStateBytes>` — avoids heap allocation and fits in a
/// single cache line for small CAP values.
pub(crate) struct FixedDeque<const CAP: usize> {
data: [MaybeUninit<PosStateBytes>; CAP],
start: u8,
len: u8,
}
Comment on lines +17 to +21

impl<const CAP: usize> FixedDeque<CAP> {
pub fn new() -> Self {
Self {
data: [MaybeUninit::uninit(); CAP],
start: 0,
len: 0,
}
}

#[inline]
pub fn front(&self) -> Option<&PosStateBytes> {
if self.len == 0 {
None
} else {
Some(unsafe { self.data[self.start as usize].assume_init_ref() })
}
}

#[inline]
pub fn back(&self) -> Option<&PosStateBytes> {
if self.len == 0 {
None
} else {
let idx = (self.start + self.len - 1) as usize % CAP;
Some(unsafe { self.data[idx].assume_init_ref() })
}
}

#[inline]
pub fn pop_front(&mut self) {
debug_assert!(self.len > 0);
self.start = (self.start + 1) % CAP as u8;
self.len -= 1;
}

#[inline]
pub fn pop_back(&mut self) {
debug_assert!(self.len > 0);
self.len -= 1;
}

#[inline]
pub fn push_back(&mut self, val: PosStateBytes) {
debug_assert!((self.len as usize) < CAP);
let idx = (self.start + self.len) as usize % CAP;
self.data[idx] = MaybeUninit::new(val);
self.len += 1;
}
}
Loading