A Python package implementing MycoBLAST-style sequence identity calculations for DNA sequences, specifically designed for mycological DNA barcoding applications. This package provides sophisticated sequence alignment and scoring with various adjustments for homopolymer differences, IUPAC ambiguity codes, and sequencing artifacts.
Based on the MycoBLAST algorithm developed by Stephen Russell and Mycota Lab. See the foundational article: "Why NCBI BLAST identity scores can be misleading for fungi" which explains the theoretical basis and motivation for these sequence preprocessing adjustments.
- Homopolymer Length Normalization: Ignore differences in homopolymer run lengths (e.g., "AAA" vs "AAAA")
- Repeat Motif Adjustment: Handle dinucleotide and longer repeat motifs (e.g., "ATATAT" vs "ATATATAT")
- IUPAC Ambiguity Code Handling: Allow different ambiguity codes to match via nucleotide intersection
- MSA Dual-Gap Support: Correctly handle sequences from multi-sequence alignments (MSA) where both sequences may have gaps at the same position
- End Trimming: Skip mismatches in terminal regions to avoid sequencing artifacts (disabled by default, set
end_skip_distanceto enable) - Indel Normalization: Count contiguous indels as single evolutionary events
- LOCAL / GLOBAL Scoring Modes: Choose between overlap-only scoring (default) or full-alignment scoring that penalizes terminal gaps — critical for comparing sequences of very different lengths
- Comprehensive Alignment: Multi-stage bidirectional alignment optimization using edlib
- Flexible Configuration: Enable/disable individual adjustments as needed
pip install adjusted-identitypip install git+https://github.com/joshuaowalker/adjusted-identity.gitgit clone https://github.com/joshuaowalker/adjusted-identity.git
cd adjusted-identity
pip install -e ".[dev]"from adjusted_identity import align_and_score
# For raw sequences - handles alignment and scoring
result = align_and_score("ATCGAAAAATGTC", "ATCGAAAATGTC")
print(f"Adjusted identity: {result.identity:.3f}")
print(f"Coverage: seq1={result.seq1_coverage:.3f}, seq2={result.seq2_coverage:.3f}")from adjusted_identity import score_alignment
# For pre-aligned sequences from BLAST, BioPython, etc.
aligned_seq1 = "ATCG-AAAT" # From your alignment tool
aligned_seq2 = "ATCGAAAAT" # From your alignment tool
result = score_alignment(aligned_seq1, aligned_seq2)
print(f"Adjusted identity: {result.identity:.3f}")from adjusted_identity import align_and_score, RAW_ADJUSTMENT_PARAMS
# Traditional identity (no adjustments)
raw_result = align_and_score("ATCGAAAAATGTC", "ATCGAAAATGTC", RAW_ADJUSTMENT_PARAMS)
print(f"Traditional identity: {raw_result.identity:.3f}")
# Adjusted identity (with MycoBLAST adjustments)
adj_result = align_and_score("ATCGAAAAATGTC", "ATCGAAAATGTC")
print(f"Adjusted identity: {adj_result.identity:.3f}")
# Examine the scoring pattern
print(f"Score pattern: {adj_result.score_aligned}")
# '|' = exact match, '=' = ambiguous/homopolymer, ' ' = substitutionCommon scenario: ITS sequences with homopolymer differences due to sequencing artifacts.
from adjusted_identity import align_and_score
# Fungal ITS sequences with different homopolymer lengths
its_seq1 = "TCCGTAGGTGAACCTGCGGAAGGATCATTACCGAGTTTAAA" # 3 A's at end
its_seq2 = "TCCGTAGGTGAACCTGCGGAAGGATCATTACCGAGTTTTAAAA" # 4 A's at end
result = align_and_score(its_seq1, its_seq2)
print(f"Species identity: {result.identity:.3f}") # Should be ~1.0 with homopolymer adjustmentfrom adjusted_identity import align_and_score
# Sequences with IUPAC ambiguity codes
barcode1 = "ATCGRGTC" # R = A or G
barcode2 = "ATCGKGTC" # K = G or T (both R and K contain G)
result = align_and_score(barcode1, barcode2)
print(f"Identity with IUPAC handling: {result.identity:.3f}") # Should be 1.0
print(f"Score pattern: {result.score_aligned}") # Shows '=' for ambiguous matchesBy default, identity is calculated over the overlap region only (LOCAL mode). This works well for most use cases, but can be misleading when comparing a short sequence against a much longer one — the short fragment may match perfectly within its overlap while covering only a fraction of the target.
GLOBAL mode uses Needleman-Wunsch alignment and scores the full alignment including terminal gaps, giving a more conservative identity estimate:
from adjusted_identity import align_and_score, AdjustmentParams, ScoringMode
short_seq = "ATCG" * 55 # 220bp truncated sequence
full_seq = "ATCG" * 165 # 660bp full-length target
# LOCAL (default): scores only the overlapping region
local_result = align_and_score(short_seq, full_seq)
print(f"LOCAL identity: {local_result.identity:.3f}") # ~1.000
print(f"LOCAL seq2_coverage: {local_result.seq2_coverage:.3f}")
# GLOBAL: terminal gaps count against identity
global_params = AdjustmentParams(scoring_mode=ScoringMode.GLOBAL, normalize_indels=False)
global_result = align_and_score(short_seq, full_seq, global_params)
print(f"GLOBAL identity: {global_result.identity:.3f}") # ~0.333
print(f"GLOBAL seq2_coverage: {global_result.seq2_coverage:.3f}")Use LOCAL mode (the default) when you want overlap-only identity, e.g. when sequences are expected to be similar in length or when coverage is checked separately. Use GLOBAL mode when identity should reflect how much of the target is actually covered, e.g. for clustering or when comparing sequences of very different lengths.
Tip: You can also pass the mode as a string: AdjustmentParams(scoring_mode="global").
The score_aligned field provides a visual representation of how each position was scored:
|= Exact match between standard nucleotides (A=A, C=C, G=G, T=T)== Ambiguous match (IUPAC codes) or homopolymer/repeat extension(space) = Substitution (mismatch)-= Indel extension (normalized).= End-trimmed, dual-gap, or overhang position (not scored)
from adjusted_identity import align_and_score
result = align_and_score("ATCGRAAATGTC", "ATCGAAAAATGTC")
print(f"Seq1: {result.seq1_aligned}")
print(f"Seq2: {result.seq2_aligned}")
print(f"Score: {result.score_aligned}")
# Output might show: ||||==||||||
# ATCG = exact matches (||||)
# R vs A = ambiguous match (=)
# AAA vs AAAA = homopolymer extension (=)By default, the output alignment strings preserve the original gap positions from the aligner. However, since standard aligners don't distinguish homopolymer extensions from true indels, the gap placement may not reflect how positions were actually scored.
Use adjust_gaps=True to rewrite gap positions so the visualization directly matches the scoring interpretation:
from adjusted_identity import align_and_score
# These sequences differ only in homopolymer lengths (4A+3T vs 3A+4T)
seq1 = "AAAATTT"
seq2 = "AAATTTT"
# Default: alignment strings show A-vs-T at position 4 (looks like substitution)
result = align_and_score(seq1, seq2, adjust_gaps=False)
print(f"Seq1: {result.seq1_aligned}") # AAAATTT
print(f"Seq2: {result.seq2_aligned}") # AAATTTT
print(f"Score: {result.score_aligned}") # |||=||| (score shows extension, but alignment hides it)
# Adjusted: gaps inserted to reveal the homopolymer interpretation
result = align_and_score(seq1, seq2, adjust_gaps=True)
print(f"Seq1: {result.seq1_aligned}") # AAAA-TTT
print(f"Seq2: {result.seq2_aligned}") # AAA-TTTT
print(f"Score: {result.score_aligned}") # |||==||| (two extensions now visible)
# Identity is 1.0 in both cases - the adjustment is purely visual
print(f"Identity: {result.identity}") # 1.0The adjust_gaps=True mode is particularly useful when you need to visually inspect why positions were scored a certain way—each position in the output alignment corresponds directly to its scoring marker.
from adjusted_identity import align_and_score, AdjustmentParams
# Dinucleotide repeat differences (AT repeat from Russell article)
seq1 = "CGATAT--C" # Missing one AT unit
seq2 = "CGATATATC" # Has extra AT unit
# With repeat motif adjustment (default)
result = align_and_score(seq1, seq2)
print(f"Adjusted identity: {result.identity:.3f}") # Should be 1.0
# Control max repeat motif length
params = AdjustmentParams(
max_repeat_motif_length=3 # Detect up to trinucleotide repeats (e.g., CAG)
)
result = align_and_score("CAGCAG---TTC", "CAGCAGCAGTTC", params)The package correctly handles sequence pairs extracted from multi-sequence alignments (MSA), where both sequences may have gaps at the same position due to alignment with third sequences.
from adjusted_identity import score_alignment
# Sequences from MSA (e.g., spoa, MUSCLE, MAFFT output)
# Both sequences have gaps at positions 3-4 due to alignment with a third sequence
msa_seq1 = "AGA--TT"
msa_seq2 = "AGAT-TT"
result = score_alignment(msa_seq1, msa_seq2)
print(f"MSA identity: {result.identity:.3f}") # Should be 1.0 - 'T' recognized as homopolymer
print(f"Score pattern: {result.score_aligned}")
# Another example with consensus-based homopolymer detection
msa_seq1 = "AGG-AC" # G at position 2
msa_seq2 = "AG-GAC" # G at position 3
result = score_alignment(msa_seq1, msa_seq2)
print(f"MSA identity: {result.identity:.3f}") # Both G's recognized as homopolymer extensionsKey MSA features:
- Dual-gap handling: Positions where both sequences have '-' are excluded from scoring (marked with
.) - Consensus context: Homopolymer detection uses consensus nucleotides from both sequences
- Conflict resolution: When sequences disagree at context positions, homopolymer extension is not applied
from adjusted_identity import align_and_score, AdjustmentParams
# Custom adjustment parameters
custom_params = AdjustmentParams(
normalize_homopolymers=True, # Enable homopolymer adjustment
handle_iupac_overlap=False, # Disable IUPAC intersection
normalize_indels=True, # Enable indel normalization
end_skip_distance=10, # Skip 10bp from each end (default is 0 = disabled)
max_repeat_motif_length=2 # Detect up to dinucleotide repeats (default)
)
result = align_and_score(seq1, seq2, custom_params)This package provides a layered API design that separates sequence alignment from identity scoring, giving you maximum flexibility:
The core implementation that applies MycoBLAST-style adjustments to any pre-aligned sequences. This allows you to use any alignment library (BLAST, BioPython, edlib, etc.) with the adjusted identity algorithm.
Input: Just needs gapped sequences of equal length
Output: Adjusted identity metrics with detailed scoring information
A higher-level function that combines fast edlib alignment with the scoring algorithm. Provides BLAST-like infix alignment that's fast enough for production use without requiring external dependencies.
Input: Raw unaligned sequences
Output: Complete alignment and adjusted identity results
score_alignment(seq1_aligned, seq2_aligned, adjustment_params=None, scoring_format=None, adjust_gaps=False)
The core implementation - applies MycoBLAST-style adjustments to pre-aligned sequences from any source.
Use this when:
- You already have alignments from BLAST, BioPython, or other tools
- You want to integrate adjusted identity into existing pipelines
- You need maximum control over the alignment process
Parameters:
seq1_aligned,seq2_aligned(str): Pre-aligned sequences with gaps (must be same length)adjustment_params(AdjustmentParams, optional): Adjustment parametersscoring_format(ScoringFormat, optional): Scoring visualization formatadjust_gaps(bool, optional): IfTrue, rewrite gap positions so the output alignment matches the scoring interpretation. Output strings may have different length than input. Defaults toFalse.
Returns:
AlignmentResult: Scoring results and metrics
Example with BLAST alignment:
# After getting BLAST alignment results
from adjusted_identity import score_alignment
blast_seq1 = "ATCG-AAAT" # From BLAST output
blast_seq2 = "ATCGAAAAT" # From BLAST output
result = score_alignment(blast_seq1, blast_seq2)
print(f"Adjusted identity: {result.identity:.3f}")High-level convenience function that handles both alignment and scoring in one step. Uses semi-global (HW) alignment by default, or Needleman-Wunsch (NW) alignment when scoring_mode=ScoringMode.GLOBAL.
Use this when:
- You want a simple, fast solution without additional alignment tools
- You need BLAST-like performance for production use
- You're comparing raw sequences end-to-end
Parameters:
seq1,seq2(str): Raw DNA sequences to compareadjustment_params(AdjustmentParams, optional): Adjustment parametersscoring_format(ScoringFormat, optional): Scoring visualization formatadjust_gaps(bool, optional): IfTrue, rewrite gap positions so the output alignment matches the scoring interpretation. Output strings may have different length than input. Defaults toFalse.
Returns:
AlignmentResult: Contains identity metrics, alignment, and coverage information
Example:
from adjusted_identity import align_and_score
result = align_and_score("ATCGAAAATGTC", "ATCGAAAATGTC")
print(f"Identity: {result.identity:.3f}")Controls whether terminal gaps are included in identity scoring:
from adjusted_identity import ScoringMode
ScoringMode.LOCAL # Score overlap region only (default)
ScoringMode.GLOBAL # Score full alignment including terminal gaps (NW alignment)Configure which sequence adjustments to apply:
AdjustmentParams(
normalize_homopolymers=True, # Ignore homopolymer length differences
handle_iupac_overlap=True, # Allow IUPAC ambiguity intersections
normalize_indels=True, # Count contiguous indels as single events
end_skip_distance=0, # Skip first/last N nucleotides (0 = disabled by default)
max_repeat_motif_length=2, # Maximum repeat motif length to detect (1=homopolymers only, 2=dinucleotides, etc.)
scoring_mode=ScoringMode.LOCAL # LOCAL (overlap only) or GLOBAL (full alignment with terminal gaps)
)Customize alignment visualization characters:
ScoringFormat(
match='|', # Exact match (A=A, C=C, G=G, T=T)
ambiguous_match='=', # Ambiguous nucleotide match (any IUPAC code match)
substitution=' ', # Nucleotide substitution
indel_start=' ', # First position of indel
indel_extension='-', # Indel positions (normalization)
homopolymer_extension='=', # Homopolymer length difference
end_trimmed='.' # Position outside scoring region
)Results returned by alignment functions:
AlignmentResult(
identity=0.95, # Identity score (0.0-1.0)
mismatches=2, # Number of mismatches counted
scored_positions=40, # Positions used for identity calculation
seq1_coverage=0.98, # Fraction of seq1 in alignment
seq2_coverage=0.97, # Fraction of seq2 in alignment
seq1_aligned="ATCG-ATCG", # Aligned sequence 1 with gaps
seq2_aligned="ATCGATCG-", # Aligned sequence 2 with gaps
score_aligned="||||=|||" # Scoring visualization
)Note: When adjust_gaps=True, the seq1_aligned and seq2_aligned strings are rewritten so gap positions match the scoring interpretation. This makes the score_aligned visualization directly interpretable position-by-position, but the output length may differ from the input alignment.
DEFAULT_ADJUSTMENT_PARAMS: All adjustments enabled (recommended)RAW_ADJUSTMENT_PARAMS: No adjustments (traditional identity)DEFAULT_SCORING_FORMAT: Default visualization characters
The core score_alignment() function works with alignments from any source. Here are examples with popular alignment libraries:
# NOTE: This example is illustrative and not tested
from Bio.Blast import NCBIWWW, NCBIXML
from adjusted_identity import score_alignment
# Run BLAST search (example)
result_handle = NCBIWWW.qblast("blastn", "nt", query_sequence)
blast_records = NCBIXML.parse(result_handle)
for blast_record in blast_records:
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
# Extract aligned sequences from BLAST HSP
query_aligned = hsp.query
subject_aligned = hsp.sbjct
# Apply adjusted identity scoring
adj_result = score_alignment(query_aligned, subject_aligned)
print(f"BLAST identity: {hsp.identities/hsp.align_length:.3f}")
print(f"Adjusted identity: {adj_result.identity:.3f}")from Bio import Align
from adjusted_identity import score_alignment
# Create BioPython aligner
aligner = Align.PairwiseAligner()
aligner.match_score = 2
aligner.mismatch_score = -1
# Example sequences
seq1 = "ATCGATCG"
seq2 = "ATCGTCG" # Missing one nucleotide
# Perform alignment
alignments = aligner.align(seq1, seq2)
best_alignment = alignments[0]
# Extract aligned sequences using indexing (idiomatic BioPython)
seq1_aligned = str(best_alignment[0])
seq2_aligned = str(best_alignment[1])
# Apply adjusted scoring
result = score_alignment(seq1_aligned, seq2_aligned)
print(f"Adjusted identity: {result.identity:.3f}")# NOTE: This example is illustrative template code
from adjusted_identity import score_alignment
def process_alignment_file(alignment_file):
"""Process alignments from any external tool."""
results = []
# Parse your alignment format (FASTA, SAM, custom, etc.)
for seq1_aligned, seq2_aligned in parse_alignment_file(alignment_file):
# Ensure sequences are the same length
assert len(seq1_aligned) == len(seq2_aligned)
# Apply adjusted identity scoring
result = score_alignment(seq1_aligned, seq2_aligned)
results.append(result)
return resultsThe end_skip_distance parameter implements "digital end trimming" to skip sequencing artifacts near read ends. Important: This parameter counts nucleotides (non-gap characters), not alignment positions.
End trimming only activates when sequences are long enough:
from adjusted_identity import align_and_score, AdjustmentParams
# Short sequences (< 2 × end_skip_distance nucleotides): NO trimming applied
short_seq1 = "ATCGATCG" # 8 nucleotides
short_seq2 = "ATCGATCG" # 8 nucleotides
result = align_and_score(short_seq1, short_seq2) # end_skip_distance=0 by default
print(f"Scored positions: {result.scored_positions}") # 8 (full sequence)
print(f"Score pattern: {result.score_aligned}") # "||||||||" (no trimming dots)
# Long sequences (≥ 2 × end_skip_distance nucleotides): Trimming applied
long_seq1 = "A" * 25 + "TCGX" + "T" * 25 # 54 nucleotides
long_seq2 = "A" * 25 + "TCGA" + "T" * 25 # 54 nucleotides
result = align_and_score(long_seq1, long_seq2)
print(f"Scored positions: {result.scored_positions}") # ~14 (middle region only)
print(f"Score pattern: {result.score_aligned}") # ".......|||| |||||......." (dots show trimmed regions)End trimming counts actual nucleotides in each sequence, ignoring gaps:
# This alignment has gaps, but nucleotide counting still works correctly
seq1_aligned = "AAA---TCGATCG---TTT" # 12 nucleotides (ignoring gaps)
seq2_aligned = "---AAATCGATCGTTT---" # 12 nucleotides (ignoring gaps)
# With end_skip_distance=5: skips first 5 and last 5 nucleotides from each sequence
# Only the middle "TCGATCG" region (2 nucleotides) would be scored# Disable end trimming completely
no_trim_params = AdjustmentParams(end_skip_distance=0)
result = align_and_score(long_seq1, long_seq2, no_trim_params)
# Use shorter trimming distance for smaller sequences
short_trim_params = AdjustmentParams(end_skip_distance=5)
result = align_and_score(medium_seq1, medium_seq2, short_trim_params)Rule of thumb: For sequences shorter than 2 × end_skip_distance nucleotides, end trimming has no effect and the entire alignment is scored.
from adjusted_identity import align_and_score
sequences = [
("seq1", "ATCGATCGATCG"),
("seq2", "ATCGATCGATCC"),
("seq3", "ATCGATCGATCG"),
]
reference = sequences[0][1]
results = []
for name, seq in sequences[1:]:
result = align_and_score(reference, seq)
results.append({
'name': name,
'identity': result.identity,
'coverage': min(result.seq1_coverage, result.seq2_coverage)
})
# Sort by identity
results.sort(key=lambda x: x['identity'], reverse=True)The score_aligned field shows how each position was scored:
result = align_and_score("AAA-TTT", "AAAATTT")
print(result.score_aligned) # "|||=|||"
# | = match
# = = homopolymer extension (ignored if adjustment enabled)Scoring symbols:
|: Exact match (A=A, C=C, G=G, T=T)=: Ambiguous match (IUPAC) or homopolymer/repeat extension(space): Substitution or indel start (counts as mismatch)-: Indel extension (ignored if normalization enabled).: End-trimmed, dual-gap, or overhang (not scored)
Run the comprehensive test suite:
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=adjusted_identity --cov-report=htmlThe test suite includes:
- Unit tests for all adjustment types
- Edge cases and error conditions
- Real-world mycological scenarios
- Performance tests with long sequences
- Documentation examples
This package implements the sequence preprocessing approach described in the MycoBLAST algorithm by Stephen Russell and Mycota Lab, adapted for general-purpose DNA sequence comparison. The foundational research is detailed in "Why NCBI BLAST identity scores can be misleading for fungi".
Standard sequence alignment algorithms use gap penalties that don't distinguish between different biological contexts. Specifically, they penalize gaps uniformly regardless of whether the gap occurs within a homopolymer run (e.g., AAAA vs AAA) or represents a true insertion/deletion event.
For many applications—particularly DNA barcoding where sequencing technology introduces homopolymer length variation as a technical artifact—it would be desirable to assign zero penalty to homopolymer length differences while still penalizing other indels normally. However, existing aligners don't support such context-dependent gap costs.
This package addresses the limitation by applying a post-hoc adjustment: we use a standard aligner to establish positional correspondence between sequences, then rescore the alignment with homopolymer-aware logic that treats length variation in repetitive tracts as neutral.
Starting in v0.2.0, this package uses a variant range algorithm that provides more accurate scoring for complex indel patterns, especially in multi-sequence alignments.
The key insight: Standard aligners don't know about homopolymers—they just find the minimum-edit alignment. This can produce patterns where a simple homopolymer expansion looks like a complex substitution or scattered indels.
How it works:
-
Find variant ranges: Scan the alignment for contiguous regions where sequences differ (gaps, mismatches, or both). These are bounded by matching positions on each side.
-
Extract alleles: For each variant range, pull out the gap-free content from each sequence. For example, in
TGC-C-TCvsTGCT--TC, the variant range yields alleles"C"and"T". -
Check for extensions: Ask whether each allele could be explained as a repeat of the adjacent context:
- Does
"C"extend the left contextC? Yes (homopolymer) - Does
"T"extend the right contextT? Yes (homopolymer)
- Does
-
Apply Occam's razor: If both alleles are valid extensions of their respective contexts, they represent equivalent repeat expansions → 0 edits. No mismatch is counted because both placements are biologically plausible.
Example:
seq1: ATTCA Traditional scoring: 1 substitution (T vs C)
seq2: ATCCA Variant range: T extends left T, C extends right C → 0 edits
This approach handles cases that position-by-position algorithms miss, such as "floating" nucleotides in MSA data where gap placement is arbitrary.
For the complete specification, see docs/SCORING_SPEC.md.
The adjustments are particularly valuable for:
- Fungal taxonomy: ITS sequences often have homopolymer differences
- DNA barcoding: Technical artifacts can obscure phylogenetic signal
- Sequence quality assessment: End-trimming handles poor-quality regions
- Phylogenetic analysis: IUPAC codes preserve ambiguous but valid matches
Credit: This implementation is based on the MycoBLAST algorithm developed by Stephen Russell and Mycota Lab. The theoretical framework and biological motivation are thoroughly explained in their foundational article.
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make changes and add tests
- Run the test suite:
pytest - Submit a pull request
If you use this package in your research, please cite:
Walker, J. (2025). Adjusted Identity Calculator for DNA Sequences.
GitHub: https://github.com/joshuaowalker/adjusted-identity
Please also cite the foundational work:
Russell, S. (2025). Why NCBI BLAST identity scores can be misleading for fungi.
Mycota Lab. https://mycotalab.substack.com/p/why-ncbi-blast-identity-scores-can
BSD 2-Clause License - see LICENSE file for details.
See CHANGELOG.md for the full release history.