Skip to content

A Python library for making fast dot-plot comparisons of DNA sequences powered by Rust FM-Index

License

Notifications You must be signed in to change notification settings

Adamtaranto/rusty-dot

Repository files navigation

License: GPL v3

rusty-dot

Fast dot plot comparisons of DNA sequences using an FM-Index. Written in Rust with PyO3 python bindings.

Features

  • Read FASTA / gzipped FASTA files via needletail
  • Build FM-indexes per sequence using rust-bio
  • K-mer set intersection for efficient shared k-mer lookup
  • Both-strand k-mer matching: forward (+) and reverse-complement (-) hits detected via compare_sequences_stranded
  • Merge sequential k-mer runs into contiguous match blocks for both orientations:
    • Forward-strand co-linear diagonal merging (py_merge_kmer_runs)
    • RC anti-diagonal merging — standard inverted repeats (py_merge_rev_runs)
    • RC co-diagonal merging — both arms run in same direction (py_merge_rev_fwd_runs)
    • Unified strand-aware entry-point (py_merge_runs)
  • PAF format output for alignment records
  • FM-index serialization/deserialization with serde + postcard
  • All-vs-all dotplot visualization with matplotlib:
    • Forward hits drawn in blue (configurable via dot_color)
    • Reverse-complement hits drawn in red (configurable via rc_color)
    • Sequence names rendered once — at the bottom of each column and left of each row
    • SVG vector output in addition to PNG/PDF via the format parameter
    • Minimum alignment length filter (min_length) to suppress short/spurious hits before rendering
  • Cross-index comparisons between two sequence sets (e.g. two genome assemblies)
  • Relative sequence scaling in dotplot subpanels
  • Gravity-centre contig ordering for maximum collinearity
  • PafAlignment.filter_by_min_length() — discard short alignment records from a loaded PAF file
  • Full Python bindings via PyO3

Installation

Requirements:

# Clone this project repo
git clone https://github.com/Adamtaranto/rusty-dot.git && cd rusty-dot

# Install maturin build tool
pip install maturin

# Build and install the python package
maturin develop --release

Quick Start — single multi-FASTA index

Each sequence added to a SequenceIndex gets its own independent FM-index (rust-bio FM-indexes are read-only once built and cannot be extended).

Calling add_sequence or load_fasta multiple times accumulates sequences — it never merges or replaces the existing collection.

Re-using an existing sequence name emits a UserWarning and overwrites that entry.

If a FASTA file contains duplicate sequence names, load_fasta raises a ValueError before adding any sequences.

from rusty_dot import SequenceIndex
from rusty_dot.dotplot import DotPlotter

# Build an index from a multi-sequence FASTA file
# Each sequence in the file gets its own independent FM-index entry
idx = SequenceIndex(k=15)
names = idx.load_fasta("assembly.fasta")

# load_fasta accumulates: calling it again adds more sequences, keeps existing ones
# idx.load_fasta("more_sequences.fasta")   # would add to the same index

# List the sequences now held in the index
print(idx.sequence_names())   # ['contig1', 'contig2', 'contig3', ...]

# Print all pairwise PAF lines (every i ≠ j combination)
for line in idx.get_paf_all():
    print(line)

# Print PAF lines for one specific pair
for line in idx.get_paf("contig1", "contig2"):
    print(line)

# All-vs-all dotplot
# Forward (+) hits are drawn in blue, reverse-complement (-) hits in red.
# Sequence names appear once per column (bottom) and once per row (left).
plotter = DotPlotter(idx)
plotter.plot(output_path="all_vs_all.png", title="All vs All")

# Save as an SVG vector image instead of PNG
plotter.plot(output_path="all_vs_all.svg", title="All vs All")

# Filter out short alignments (< 500 bp) before plotting
plotter.plot(output_path="filtered.png", min_length=500)

# Single pairwise dotplot
plotter.plot_single("contig1", "contig2", output_path="pair.png")

All-vs-All Dotplot Between Two Genomes

Compare sequences from two separate FASTA files (e.g. two genome assemblies) and plot an all-vs-all grid with subpanels scaled by relative sequence length.

from rusty_dot.dotplot import DotPlotter
from rusty_dot.paf_io import CrossIndex, PafAlignment, PafRecord

# --- Build a cross-index for two assemblies ---
cross = CrossIndex(k=15)
cross.load_fasta("genome_a.fasta", group="a")   # query sequences (rows)
cross.load_fasta("genome_b.fasta", group="b")   # target sequences (columns)

# --- Sort contigs for maximum collinearity ---
# Option 1: via CrossIndex (delegates to SequenceIndex.optimal_contig_order)
q_sorted, t_sorted = cross.reorder_contigs()

# Option 2: via PafAlignment gravity-centre algorithm
# Retrieve all cross-group PAF lines
paf_lines = cross.get_paf_all()

records = [PafRecord.from_line(line) for line in paf_lines]
aln = PafAlignment.from_records(records)
q_sorted, t_sorted = aln.reorder_contigs(
    query_names=cross.query_names,
    target_names=cross.target_names,
)
# Unmatched contigs are placed at the end, sorted by descending length.

# --- Plot with relative scaling ---

plotter = DotPlotter(cross)
plotter.plot(
    query_names=q_sorted,
    target_names=t_sorted,
    output_path="cross_dotplot.png",
    scale_sequences=True,   # subplot size proportional to sequence length
    title="Genome A vs Genome B",
)

# Save as SVG vector image for publication-quality output
plotter.plot(
    query_names=q_sorted,
    target_names=t_sorted,
    output_path="cross_dotplot.svg",
    scale_sequences=True,
    title="Genome A vs Genome B",
)

# Suppress short alignments (e.g. < 500 bp) from the plot
plotter.plot(
    query_names=q_sorted,
    target_names=t_sorted,
    output_path="cross_dotplot_filtered.png",
    scale_sequences=True,
    min_length=500,
    title="Genome A vs Genome B (≥500 bp alignments)",
)

Filtering PAF Alignments by Length

Use PafAlignment.filter_by_min_length to remove short alignment records after loading a PAF file. This is particularly useful for cleaned-up visualisations when alignments have been merged from k-mer runs (which can be longer than the k-mer size) or when working with a pre-computed PAF file.

from rusty_dot.paf_io import PafAlignment

aln = PafAlignment.from_file("alignments.paf")

# Keep only alignments of at least 500 bp on the query
aln_long = aln.filter_by_min_length(500)
print(f"Records before: {len(aln)}, after: {len(aln_long)}")

Writing PAF Lines to a File

# All pairwise alignments within a single index
paf_lines = idx.get_paf_all()

# Or one specific pair
paf_lines = idx.get_paf("contig1", "contig2", merge=True)

with open("alignments.paf", "w") as f:
    for line in paf_lines:
        f.write(line + "\n")

Saving and Loading Indexes

# Save the current index to a compact binary file
idx.save("my_index.bin")

# Load into a new index (k must match the saved index)
idx2 = SequenceIndex(k=15)
idx2.load("my_index.bin")

About

A Python library for making fast dot-plot comparisons of DNA sequences powered by Rust FM-Index

Topics

Resources

License

Stars

Watchers

Forks

Contributors