Fast dot plot comparisons of DNA sequences using an FM-Index. Written in Rust with PyO3 python bindings.
- Read FASTA / gzipped FASTA files via needletail
- Build FM-indexes per sequence using rust-bio
- K-mer set intersection for efficient shared k-mer lookup
- Both-strand k-mer matching: forward (
+) and reverse-complement (-) hits detected viacompare_sequences_stranded - Merge sequential k-mer runs into contiguous match blocks for both orientations:
- Forward-strand co-linear diagonal merging (
py_merge_kmer_runs) - RC anti-diagonal merging — standard inverted repeats (
py_merge_rev_runs) - RC co-diagonal merging — both arms run in same direction (
py_merge_rev_fwd_runs) - Unified strand-aware entry-point (
py_merge_runs)
- Forward-strand co-linear diagonal merging (
- PAF format output for alignment records
- FM-index serialization/deserialization with serde + postcard
- All-vs-all dotplot visualization with matplotlib:
- Forward hits drawn in blue (configurable via
dot_color) - Reverse-complement hits drawn in red (configurable via
rc_color) - Sequence names rendered once — at the bottom of each column and left of each row
- SVG vector output in addition to PNG/PDF via the
formatparameter - Minimum alignment length filter (
min_length) to suppress short/spurious hits before rendering
- Forward hits drawn in blue (configurable via
- Cross-index comparisons between two sequence sets (e.g. two genome assemblies)
- Relative sequence scaling in dotplot subpanels
- Gravity-centre contig ordering for maximum collinearity
PafAlignment.filter_by_min_length()— discard short alignment records from a loaded PAF file- Full Python bindings via PyO3
Requirements:
- Rust: See rust-lang.org
- Python >=3.9 <3.14
# Clone this project repo
git clone https://github.com/Adamtaranto/rusty-dot.git && cd rusty-dot
# Install maturin build tool
pip install maturin
# Build and install the python package
maturin develop --releaseEach sequence added to a SequenceIndex gets its own independent FM-index
(rust-bio FM-indexes are read-only once built and cannot be extended).
Calling add_sequence or load_fasta multiple times accumulates sequences
— it never merges or replaces the existing collection.
Re-using an existing sequence name emits a UserWarning and overwrites that
entry.
If a FASTA file contains duplicate sequence names, load_fasta raises a
ValueError before adding any sequences.
from rusty_dot import SequenceIndex
from rusty_dot.dotplot import DotPlotter
# Build an index from a multi-sequence FASTA file
# Each sequence in the file gets its own independent FM-index entry
idx = SequenceIndex(k=15)
names = idx.load_fasta("assembly.fasta")
# load_fasta accumulates: calling it again adds more sequences, keeps existing ones
# idx.load_fasta("more_sequences.fasta") # would add to the same index
# List the sequences now held in the index
print(idx.sequence_names()) # ['contig1', 'contig2', 'contig3', ...]
# Print all pairwise PAF lines (every i ≠ j combination)
for line in idx.get_paf_all():
print(line)
# Print PAF lines for one specific pair
for line in idx.get_paf("contig1", "contig2"):
print(line)
# All-vs-all dotplot
# Forward (+) hits are drawn in blue, reverse-complement (-) hits in red.
# Sequence names appear once per column (bottom) and once per row (left).
plotter = DotPlotter(idx)
plotter.plot(output_path="all_vs_all.png", title="All vs All")
# Save as an SVG vector image instead of PNG
plotter.plot(output_path="all_vs_all.svg", title="All vs All")
# Filter out short alignments (< 500 bp) before plotting
plotter.plot(output_path="filtered.png", min_length=500)
# Single pairwise dotplot
plotter.plot_single("contig1", "contig2", output_path="pair.png")Compare sequences from two separate FASTA files (e.g. two genome assemblies) and plot an all-vs-all grid with subpanels scaled by relative sequence length.
from rusty_dot.dotplot import DotPlotter
from rusty_dot.paf_io import CrossIndex, PafAlignment, PafRecord
# --- Build a cross-index for two assemblies ---
cross = CrossIndex(k=15)
cross.load_fasta("genome_a.fasta", group="a") # query sequences (rows)
cross.load_fasta("genome_b.fasta", group="b") # target sequences (columns)
# --- Sort contigs for maximum collinearity ---
# Option 1: via CrossIndex (delegates to SequenceIndex.optimal_contig_order)
q_sorted, t_sorted = cross.reorder_contigs()
# Option 2: via PafAlignment gravity-centre algorithm
# Retrieve all cross-group PAF lines
paf_lines = cross.get_paf_all()
records = [PafRecord.from_line(line) for line in paf_lines]
aln = PafAlignment.from_records(records)
q_sorted, t_sorted = aln.reorder_contigs(
query_names=cross.query_names,
target_names=cross.target_names,
)
# Unmatched contigs are placed at the end, sorted by descending length.
# --- Plot with relative scaling ---
plotter = DotPlotter(cross)
plotter.plot(
query_names=q_sorted,
target_names=t_sorted,
output_path="cross_dotplot.png",
scale_sequences=True, # subplot size proportional to sequence length
title="Genome A vs Genome B",
)
# Save as SVG vector image for publication-quality output
plotter.plot(
query_names=q_sorted,
target_names=t_sorted,
output_path="cross_dotplot.svg",
scale_sequences=True,
title="Genome A vs Genome B",
)
# Suppress short alignments (e.g. < 500 bp) from the plot
plotter.plot(
query_names=q_sorted,
target_names=t_sorted,
output_path="cross_dotplot_filtered.png",
scale_sequences=True,
min_length=500,
title="Genome A vs Genome B (≥500 bp alignments)",
)Use PafAlignment.filter_by_min_length to remove short alignment records after
loading a PAF file. This is particularly useful for cleaned-up visualisations
when alignments have been merged from k-mer runs (which can be longer than the
k-mer size) or when working with a pre-computed PAF file.
from rusty_dot.paf_io import PafAlignment
aln = PafAlignment.from_file("alignments.paf")
# Keep only alignments of at least 500 bp on the query
aln_long = aln.filter_by_min_length(500)
print(f"Records before: {len(aln)}, after: {len(aln_long)}")# All pairwise alignments within a single index
paf_lines = idx.get_paf_all()
# Or one specific pair
paf_lines = idx.get_paf("contig1", "contig2", merge=True)
with open("alignments.paf", "w") as f:
for line in paf_lines:
f.write(line + "\n")# Save the current index to a compact binary file
idx.save("my_index.bin")
# Load into a new index (k must match the saved index)
idx2 = SequenceIndex(k=15)
idx2.load("my_index.bin")