Skip to content

RagnarGrootKoerkamp/sassy

Repository files navigation

crates.io Conda version PyPI docs.rs biorXiv preprint

WARNING: Versions up to 0.1.9 had a shameful bug in the sassy grep CLI where after every 1MB of input it would skip one record. Please update.

Sassy: SIMD-accelerated Approximate String Matching

Sassy is a library and tool for searching short strings in texts, a problem that goes by many names:

  • approximate string matching,
  • pattern matching,
  • fuzzy searching.

The motivating application is searching short (length 20 to 100) DNA sequences in a human genome or e.g. in a set of reads. Sassy generally works well for patterns/queries up to length 1000, and supports both ASCII, DNA, and IUPAC.

It has a grep-like mode for quick human inspection, as well as search to report locations of matches, and filter to only output (non)-matching records.

gif of sassy grep

Highlights:

  • Sassy uses bitpacking and SIMD (both AVX2 and NEON supported). Its main novelty is tiling these in the text direction.
  • Support for overhang alignments where the pattern extends beyond the text.
  • Support for (case-insensitive) ASCII, DNA (ACGT), and IUPAC (=ACGT+NYR...) alphabets.
  • Rust library (cargo add sassy), binary (cargo install sassy, see details below), Python bindings (pip install sassy-rs), and C bindings (see below).

See the paper, and corresponding evals in evals/:

Rick Beeloo and Ragnar Groot Koerkamp.
Sassy: Searching Short DNA Strings in the 2020s.
bioRxiv, July 2025. https://doi.org/10.1101/2025.07.22.666207.

Installation

Prebuilt binaries

See the latest release.

You can also get these via

cargo binstall sassy

or via conda/mamba/pixi:

conda install -c bioconda sassy

Build from source

RUSTFLAGS="-C target-cpu=native" cargo install sassy

Sassy uses AVX2 or NEON instructions performance reasons, which requires either target-cpu=native or target-cpu=x86-64-v3 on x64 machines. See this README for details and this blog for background. The same restrictions apply when using the sassy library in a larger project.

Sassy requires Rust 1.91 or newer. Get it via rustup update. (Switch to rustup when your system installation is too old).

Usage

Sassy can be used via the CLI, or as Rust, Python, or C library.

0. Rust library

The library can be used to search for ASCII or DNA strings. A larger example can be found in src/lib.rs.

// cargo add sassy
use sassy::{Searcher, Match, profiles::Iupac, Strand};

let pattern = b"ATCG";
let text = b"AAAATTGAAA";
let k = 1;

// The Iupac profile supports N and YR... characters.
// If you are sure you only have ACGT input, then `profiles::Dna` is slightly faster.
let mut searcher = Searcher::<Iupac>::new_fwd();
let matches = searcher.search(pattern, &text, k);

assert_eq!(matches.len(), 1);

assert_eq!(matches[0].text_start, 3);
assert_eq!(matches[0].text_end, 7);
assert_eq!(matches[0].cost, 1);
assert_eq!(matches[0].strand, Strand::Fwd);
assert_eq!(matches[0].cigar.to_string(), "2=1X1=");

1. Command-line interface (CLI)

The CLI can be used via:

  1. sassy grep: to show nicely coloured output.
  2. sassy search: to write a .tsv of matching locations.
  3. sassy filter: to write a .fasta/.fastq of (non)-matching records.
  4. sassy crispr: to search for CRISPR guides.

grep, search, and filter all take the same arguments, and are implemented by forwarding to grep. Thus, they can all be combined via e.g.

sassy grep -p ACGTCAAACCTA -k 3 --matches matches.tsv --output filtered.fastq reads.fastq.gz

1.1: Grep for a pattern

Search a pattern ATGAGCA in text.fasta with ≤1 edit:

sassy search --pattern ATGAGCA -k 1 text.fasta

or search all records of a fasta file with --pattern-fasta <fasta-file> instead of --pattern.

The grep output is coloured:

  • green shows matching characters,
  • orange shows mismatches,
  • red shows deleted characters (in pattern but not in text),
  • blue shows inserted characters (in text but not in pattern). screenshot of sassy grep output

1.2: TSV output for matches

sassy search -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fa > matches.tsv
# or
sassy search -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fa --matches matches.tsv
# or
sassy grep   -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fa --matches matches.tsv

gives .tsv output like this:

pat_id	text_id	cost	strand	start	end	match_region	cigar
pattern	AC_000001.1__1_1	0	+	6	48	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_35	0	+	897	939	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_49	1	+	866	908	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCGCGCG	37=1X4=
pattern	AC_000001.1__1_64	0	-	1267	1309	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_67	0	+	600	642	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_68	0	-	1826	1868	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_78	3	-	4381	4425	GTACAGAAACGAGCGGATGGAAAATAGTAGTGAGCGGCCTCGCG	23=1X1I10=1I8=
pattern	AC_000001.1__1_92	0	-	6554	6596	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_94	0	-	6413	6455	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_115	2	+	2091	2131	GTACAGAAACGAGCATGGAAAGAGTAGTGAGCGCCTCGCG	14=2D26=
pattern	AC_000001.1__1_118	0	-	3062	3104	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_123	0	+	1416	1458	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=
pattern	AC_000001.1__1_127	0	+	27	69	GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG	42=

1.3: Filter matching records

sassy filter -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fq > filtered.fq
# or
sassy filter -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fq -o filtered.fq
# or
sassy grep   -p GTACAGAAACGAGCGGATGGAAAGAGTAGTGAGCGCCTCGCG -k 2 reads.fq -o filtered.fq

Writes a file containing only matching records. Use --invert to only write non-matching records.

1.4: CRISPR off-target search

Search for one or more guides in guides.txt:

sassy crispr --threads 8 --guide guides.txt --k 5 --max-n-frac 0.1 --output hits.tsv hg38.fasta

Allows <= k edits in the sgRNA, and the PAM (the last 3 characters of each guide) has to match exactly, unless --allow-pam-edits is given.

Output of the crispr command is a tab-delimited file with one row per hit, e.g.:

guide                    text_id  cost  strand  start     end       match_region             cigar
GAGTCCGAGCAGAAGAAGAANGG  chr21    5     +       5024135   5024154   GAGGCCACAGAGAAGAGGG      3=1X2=1D1=1D3=1D5=1D4=
GAGTCCGAGCAGAAGAAGAANGG  chr21    3     +       21087337  21087359  gagaccgaggagaagaaaaagg   3=1X5=1X7=1D5=
GAGTCCGAGCAGAAGAAGAANGG  chr21    3     -       9701297   9701320   GACTCGAGCATGAAGAAGAAAGG  2=1X1=1D6=1I12=
GAGTCCGAGCAGAAGAAGAANGG  chr21    5     -       46396975  46396998  CAGTCCCAGCAGACGACGGACGG  1X5=1X6=1X2=1X1=1X4=

The start and end are 0-based open-ended (i.e. 0-based inclusive of the start, but exclusive of the end), and start is always less than end (regardless of the strand). The match_region reported will be the sequence from the target file when strand is +, or the reverse complement of the sequence from the target file when strand is -, so that it matches the guide sequence. The cigar is always oriented to read left-to-right with the provided guide and match_region sequences.

Note that this searches for approximate occurrences of the guide sequence itself, and not for reverse-complement binding sites. If binding sites are to be found, please reverse-complement the input or output manually.

2. Python bindings

PyPI wheels can be installed with:

pip install sassy-rs 
import sassy

pattern = b"ACTG"
text    = b"ACGGCTACGCAGCATCATCAGCAT"

searcher = sassy.Searcher("dna") # ascii / dna / iupac
matches  = searcher.search(pattern, text, k=1)

for m in matches:
    print(m)

See python/README.md for more details.

3. C library

See c/README.md for details. Quick example:

#include "sassy.h"

int main() {
    const char* pattern = "ACTG";
    const char* text    = "ACGGCTACGCAGCATCATCAGCAT";

    // DNA alphabet, with reverse complement, without overhang.
    sassy_SearcherType* searcher = sassy_searcher("dna", true, NAN);
    sassy_Match* out_matches = NULL;
    size_t n_matches = search(searcher,
                              pattern, strlen(pattern),
                              text, strlen(text),
                              1, // k=1
                              &out_matches);

    sassy_matches_free(out_matches, n_matches);
    sassy_searcher_free(searcher);
}

About

Fast approximate string searching

Topics

Resources

License

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •  

Languages