Rust implementation of SSHash: Sparse and Skew Hashing of k-mers.
A compressed dictionary for k-mers (strings of length k over the DNA alphabet {A, C, G, T}), based on minimal perfect hashing and succinct data structures.
This project grew out of my (Rob's) perpetual desire for a Rust implementation of the excellent SSHash data structure. We use SSHash heavily
for tools across the lab, including piscem, which we really want to move completely into Rust. I've wanted this translation to exist for years,
but never had the time for it, and couldn't justify it as a full-time PhD student project.
This initial port was almost entirely created by Claude Opus 4.5 and 4.6, over the span of about 2 days. It was created in a tight interactive loop
in which I (Rob) was involved, and the guidance toward a reasonable implementation (i.e. actually using appropriate succinct data structures, the
selection of libraries, and the direction about tackling critical optimizations) was provided by me. Nonetheless, the code itself was not written
by me, nor by any person. The same is true of the remainder of the README below (though I will be updating that as I continue to work on this
project). I feel that this aspect is a critical disclosure about the initial release. I plan to continue to iterate on this
library, and this will involve both further agent-driven edits, as well as, likely, considerable "manual" edits to the code to improve the
idioms, ergonomics and structure.
During the entire process of development, the C++ implementation (which was not written in an AI-assisted manner, and was almost entirely written by Giulio) was treated as the "guide star". While what Claude was able to accomplish in only a couple of days is, in my opinion, quite impressive, it certainly would not have been possible without such a clear and correct refrence implementation against which to check (and whose internals could be freely inspected).
Anyway, I hope this library is useful to others in addition to myself and our lab, and I expect it to be developed and improved going forward.
This library does not currently support the weighted dictionary functionality of SSHash. This was not needed for our purpose, and I wanted to remove non-essential features from the initial version.
Requires Rust 1.85+ (edition 2024).
cargo build --releaseThe binary is target/release/sshash.
# From FASTA/FASTQ (optionally gzipped)
sshash build --input sequences.fa.gz --k 31 --m 20 --output index
# From SPSS format (one unitig per line)
sshash build --input unitigs.spss --k 31 --m 20 --output index
# Canonical mode (reverse-complement normalization)
sshash build --input sequences.fa --k 31 --m 20 --output index --canonical# Single k-mer queries from a file (one k-mer per line, or FASTA)
sshash query --index index --query queries.fa --k 31
# Streaming queries (processes whole sequences, faster for high-hit-rate inputs)
sshash query --index index --query queries.fa --k 31 --streaming# Verify that every k-mer in the input can be found in the index
sshash check --index index --input sequences.fa.gz --k 31sshash bench --index indexAdd sshash-lib to your Cargo.toml:
[dependencies]
sshash-lib = { path = "crates/sshash-lib" }use sshash_lib::{Dictionary, Kmer, KmerBits};
type Kmer31 = Kmer<31>;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let dict = Dictionary::load("index")?;
let kmer = Kmer31::from_string("ACGTACGTACGTACGTACGTACGTACGTACG")?;
let id = dict.lookup::<31>(&kmer);
// Streaming queries over a sequence
let mut engine = dict.create_streaming_query::<31>();
// ...
Ok(())
}Cargo.toml # Workspace root
crates/
sshash-lib/ # Core library
src/
dictionary.rs # Dictionary: load, save, lookup, query
builder/ # Index construction pipeline
streaming_query.rs
kmer.rs # Kmer<K> with const-generic sizing
minimizer.rs # Minimizer extraction
minimizers_control_map.rs # MPHF-based minimizer→bucket map
spectrum_preserving_string_set.rs # SPSS unitig storage
sparse_and_skew_index.rs # Bucket dispatch (singleton/light/heavy)
offsets.rs # Elias-Fano string boundary offsets
sshash-cli/ # CLI binary
src/main.rs
| Crate | Purpose |
|---|---|
ph |
PHast minimal perfect hash functions |
sux |
BitFieldVec compact bitvectors |
needletail |
FASTA/FASTQ parsing |
rayon |
Parallel sorting during build |
Verified against the C++ reference implementation on:
- Salmonella enterica (4.8M k-mers): 100% correctness (regular and canonical)
- Human chromosome 1 (204M k-mers): 100% correctness
BSD 3-Clause
Giulio Ermanno Pibiri. "Sparse and Skew Hashing of K-Mers." Bioinformatics, 2022. Giulio Ermanno Pibiri and Rob Patro. "Optimizing sparse and skew hashing: faster k-mer dictionaries." BioRxiv, 2026.