Skip to content

Calculates adjusted identity scores for DNA sequences a la MycoBLAST

License

Notifications You must be signed in to change notification settings

joshuaowalker/adjusted-identity

Repository files navigation

Adjusted Identity Calculator for DNA Sequences

PyPI version CI

A Python package implementing MycoBLAST-style sequence identity calculations for DNA sequences, specifically designed for mycological DNA barcoding applications. This package provides sophisticated sequence alignment and scoring with various adjustments for homopolymer differences, IUPAC ambiguity codes, and sequencing artifacts.

Based on the MycoBLAST algorithm developed by Stephen Russell and Mycota Lab. See the foundational article: "Why NCBI BLAST identity scores can be misleading for fungi" which explains the theoretical basis and motivation for these sequence preprocessing adjustments.

Features

  • Homopolymer Length Normalization: Ignore differences in homopolymer run lengths (e.g., "AAA" vs "AAAA")
  • Repeat Motif Adjustment: Handle dinucleotide and longer repeat motifs (e.g., "ATATAT" vs "ATATATAT")
  • IUPAC Ambiguity Code Handling: Allow different ambiguity codes to match via nucleotide intersection
  • MSA Dual-Gap Support: Correctly handle sequences from multi-sequence alignments (MSA) where both sequences may have gaps at the same position
  • End Trimming: Skip mismatches in terminal regions to avoid sequencing artifacts (disabled by default, set end_skip_distance to enable)
  • Indel Normalization: Count contiguous indels as single evolutionary events
  • LOCAL / GLOBAL Scoring Modes: Choose between overlap-only scoring (default) or full-alignment scoring that penalizes terminal gaps — critical for comparing sequences of very different lengths
  • Comprehensive Alignment: Multi-stage bidirectional alignment optimization using edlib
  • Flexible Configuration: Enable/disable individual adjustments as needed

Installation

From PyPI

pip install adjusted-identity

From GitHub (latest development version)

pip install git+https://github.com/joshuaowalker/adjusted-identity.git

Development Installation

git clone https://github.com/joshuaowalker/adjusted-identity.git
cd adjusted-identity
pip install -e ".[dev]"

Quick Start

Option 1: Complete Solution (align + score)

from adjusted_identity import align_and_score

# For raw sequences - handles alignment and scoring
result = align_and_score("ATCGAAAAATGTC", "ATCGAAAATGTC")
print(f"Adjusted identity: {result.identity:.3f}")
print(f"Coverage: seq1={result.seq1_coverage:.3f}, seq2={result.seq2_coverage:.3f}")

Option 2: Core Scoring Only (use with your alignments)

from adjusted_identity import score_alignment

# For pre-aligned sequences from BLAST, BioPython, etc.
aligned_seq1 = "ATCG-AAAT"  # From your alignment tool
aligned_seq2 = "ATCGAAAAT"  # From your alignment tool

result = score_alignment(aligned_seq1, aligned_seq2)
print(f"Adjusted identity: {result.identity:.3f}")

Compare with Traditional Identity

from adjusted_identity import align_and_score, RAW_ADJUSTMENT_PARAMS

# Traditional identity (no adjustments)
raw_result = align_and_score("ATCGAAAAATGTC", "ATCGAAAATGTC", RAW_ADJUSTMENT_PARAMS)
print(f"Traditional identity: {raw_result.identity:.3f}")

# Adjusted identity (with MycoBLAST adjustments)  
adj_result = align_and_score("ATCGAAAAATGTC", "ATCGAAAATGTC")
print(f"Adjusted identity: {adj_result.identity:.3f}")

# Examine the scoring pattern
print(f"Score pattern: {adj_result.score_aligned}")
# '|' = exact match, '=' = ambiguous/homopolymer, ' ' = substitution

Use Cases

Mycological DNA Barcoding

Common scenario: ITS sequences with homopolymer differences due to sequencing artifacts.

from adjusted_identity import align_and_score

# Fungal ITS sequences with different homopolymer lengths
its_seq1 = "TCCGTAGGTGAACCTGCGGAAGGATCATTACCGAGTTTAAA"    # 3 A's at end
its_seq2 = "TCCGTAGGTGAACCTGCGGAAGGATCATTACCGAGTTTTAAAA"   # 4 A's at end

result = align_and_score(its_seq1, its_seq2)
print(f"Species identity: {result.identity:.3f}")  # Should be ~1.0 with homopolymer adjustment

Handling Ambiguous Bases

from adjusted_identity import align_and_score

# Sequences with IUPAC ambiguity codes
barcode1 = "ATCGRGTC"  # R = A or G
barcode2 = "ATCGKGTC"  # K = G or T (both R and K contain G)

result = align_and_score(barcode1, barcode2)
print(f"Identity with IUPAC handling: {result.identity:.3f}")  # Should be 1.0
print(f"Score pattern: {result.score_aligned}")  # Shows '=' for ambiguous matches

Comparing Sequences of Different Lengths (LOCAL vs GLOBAL)

By default, identity is calculated over the overlap region only (LOCAL mode). This works well for most use cases, but can be misleading when comparing a short sequence against a much longer one — the short fragment may match perfectly within its overlap while covering only a fraction of the target.

GLOBAL mode uses Needleman-Wunsch alignment and scores the full alignment including terminal gaps, giving a more conservative identity estimate:

from adjusted_identity import align_and_score, AdjustmentParams, ScoringMode

short_seq = "ATCG" * 55   # 220bp truncated sequence
full_seq  = "ATCG" * 165  # 660bp full-length target

# LOCAL (default): scores only the overlapping region
local_result = align_and_score(short_seq, full_seq)
print(f"LOCAL identity: {local_result.identity:.3f}")   # ~1.000
print(f"LOCAL seq2_coverage: {local_result.seq2_coverage:.3f}")

# GLOBAL: terminal gaps count against identity
global_params = AdjustmentParams(scoring_mode=ScoringMode.GLOBAL, normalize_indels=False)
global_result = align_and_score(short_seq, full_seq, global_params)
print(f"GLOBAL identity: {global_result.identity:.3f}")  # ~0.333
print(f"GLOBAL seq2_coverage: {global_result.seq2_coverage:.3f}")

Use LOCAL mode (the default) when you want overlap-only identity, e.g. when sequences are expected to be similar in length or when coverage is checked separately. Use GLOBAL mode when identity should reflect how much of the target is actually covered, e.g. for clustering or when comparing sequences of very different lengths.

Tip: You can also pass the mode as a string: AdjustmentParams(scoring_mode="global").

Understanding Score Patterns

The score_aligned field provides a visual representation of how each position was scored:

  • | = Exact match between standard nucleotides (A=A, C=C, G=G, T=T)
  • = = Ambiguous match (IUPAC codes) or homopolymer/repeat extension
  • (space) = Substitution (mismatch)
  • - = Indel extension (normalized)
  • . = End-trimmed, dual-gap, or overhang position (not scored)
from adjusted_identity import align_and_score

result = align_and_score("ATCGRAAATGTC", "ATCGAAAAATGTC")
print(f"Seq1: {result.seq1_aligned}")
print(f"Seq2: {result.seq2_aligned}")
print(f"Score: {result.score_aligned}")
# Output might show: ||||==||||||
#                    ATCG = exact matches (||||)
#                    R vs A = ambiguous match (=)
#                    AAA vs AAAA = homopolymer extension (=)

Gap-Adjusted Visualization

By default, the output alignment strings preserve the original gap positions from the aligner. However, since standard aligners don't distinguish homopolymer extensions from true indels, the gap placement may not reflect how positions were actually scored.

Use adjust_gaps=True to rewrite gap positions so the visualization directly matches the scoring interpretation:

from adjusted_identity import align_and_score

# These sequences differ only in homopolymer lengths (4A+3T vs 3A+4T)
seq1 = "AAAATTT"
seq2 = "AAATTTT"

# Default: alignment strings show A-vs-T at position 4 (looks like substitution)
result = align_and_score(seq1, seq2, adjust_gaps=False)
print(f"Seq1:  {result.seq1_aligned}")   # AAAATTT
print(f"Seq2:  {result.seq2_aligned}")   # AAATTTT
print(f"Score: {result.score_aligned}")  # |||=|||  (score shows extension, but alignment hides it)

# Adjusted: gaps inserted to reveal the homopolymer interpretation
result = align_and_score(seq1, seq2, adjust_gaps=True)
print(f"Seq1:  {result.seq1_aligned}")   # AAAA-TTT
print(f"Seq2:  {result.seq2_aligned}")   # AAA-TTTT
print(f"Score: {result.score_aligned}")  # |||==|||  (two extensions now visible)

# Identity is 1.0 in both cases - the adjustment is purely visual
print(f"Identity: {result.identity}")    # 1.0

The adjust_gaps=True mode is particularly useful when you need to visually inspect why positions were scored a certain way—each position in the output alignment corresponds directly to its scoring marker.

Repeat Motif Handling

from adjusted_identity import align_and_score, AdjustmentParams

# Dinucleotide repeat differences (AT repeat from Russell article)
seq1 = "CGATAT--C"  # Missing one AT unit
seq2 = "CGATATATC"  # Has extra AT unit

# With repeat motif adjustment (default)
result = align_and_score(seq1, seq2)
print(f"Adjusted identity: {result.identity:.3f}")  # Should be 1.0

# Control max repeat motif length
params = AdjustmentParams(
    max_repeat_motif_length=3  # Detect up to trinucleotide repeats (e.g., CAG)
)
result = align_and_score("CAGCAG---TTC", "CAGCAGCAGTTC", params)

Multi-Sequence Alignment (MSA) Support

The package correctly handles sequence pairs extracted from multi-sequence alignments (MSA), where both sequences may have gaps at the same position due to alignment with third sequences.

from adjusted_identity import score_alignment

# Sequences from MSA (e.g., spoa, MUSCLE, MAFFT output)
# Both sequences have gaps at positions 3-4 due to alignment with a third sequence
msa_seq1 = "AGA--TT"
msa_seq2 = "AGAT-TT"

result = score_alignment(msa_seq1, msa_seq2)
print(f"MSA identity: {result.identity:.3f}")  # Should be 1.0 - 'T' recognized as homopolymer
print(f"Score pattern: {result.score_aligned}")

# Another example with consensus-based homopolymer detection
msa_seq1 = "AGG-AC"  # G at position 2
msa_seq2 = "AG-GAC"  # G at position 3

result = score_alignment(msa_seq1, msa_seq2)
print(f"MSA identity: {result.identity:.3f}")  # Both G's recognized as homopolymer extensions

Key MSA features:

  • Dual-gap handling: Positions where both sequences have '-' are excluded from scoring (marked with .)
  • Consensus context: Homopolymer detection uses consensus nucleotides from both sequences
  • Conflict resolution: When sequences disagree at context positions, homopolymer extension is not applied

Custom Adjustments

from adjusted_identity import align_and_score, AdjustmentParams

# Custom adjustment parameters
custom_params = AdjustmentParams(
    normalize_homopolymers=True,   # Enable homopolymer adjustment
    handle_iupac_overlap=False,    # Disable IUPAC intersection
    normalize_indels=True,         # Enable indel normalization
    end_skip_distance=10,         # Skip 10bp from each end (default is 0 = disabled)
    max_repeat_motif_length=2     # Detect up to dinucleotide repeats (default)
)

result = align_and_score(seq1, seq2, custom_params)

API Architecture

This package provides a layered API design that separates sequence alignment from identity scoring, giving you maximum flexibility:

Core Layer: score_alignment()

The core implementation that applies MycoBLAST-style adjustments to any pre-aligned sequences. This allows you to use any alignment library (BLAST, BioPython, edlib, etc.) with the adjusted identity algorithm.

Input: Just needs gapped sequences of equal length
Output: Adjusted identity metrics with detailed scoring information

Convenience Layer: align_and_score()

A higher-level function that combines fast edlib alignment with the scoring algorithm. Provides BLAST-like infix alignment that's fast enough for production use without requiring external dependencies.

Input: Raw unaligned sequences
Output: Complete alignment and adjusted identity results


API Reference

Core Function

score_alignment(seq1_aligned, seq2_aligned, adjustment_params=None, scoring_format=None, adjust_gaps=False)

The core implementation - applies MycoBLAST-style adjustments to pre-aligned sequences from any source.

Use this when:

  • You already have alignments from BLAST, BioPython, or other tools
  • You want to integrate adjusted identity into existing pipelines
  • You need maximum control over the alignment process

Parameters:

  • seq1_aligned, seq2_aligned (str): Pre-aligned sequences with gaps (must be same length)
  • adjustment_params (AdjustmentParams, optional): Adjustment parameters
  • scoring_format (ScoringFormat, optional): Scoring visualization format
  • adjust_gaps (bool, optional): If True, rewrite gap positions so the output alignment matches the scoring interpretation. Output strings may have different length than input. Defaults to False.

Returns:

  • AlignmentResult: Scoring results and metrics

Example with BLAST alignment:

# After getting BLAST alignment results
from adjusted_identity import score_alignment

blast_seq1 = "ATCG-AAAT"  # From BLAST output
blast_seq2 = "ATCGAAAAT"  # From BLAST output

result = score_alignment(blast_seq1, blast_seq2)
print(f"Adjusted identity: {result.identity:.3f}")

Convenience Function

align_and_score(seq1, seq2, adjustment_params=None, scoring_format=None, adjust_gaps=False)

High-level convenience function that handles both alignment and scoring in one step. Uses semi-global (HW) alignment by default, or Needleman-Wunsch (NW) alignment when scoring_mode=ScoringMode.GLOBAL.

Use this when:

  • You want a simple, fast solution without additional alignment tools
  • You need BLAST-like performance for production use
  • You're comparing raw sequences end-to-end

Parameters:

  • seq1, seq2 (str): Raw DNA sequences to compare
  • adjustment_params (AdjustmentParams, optional): Adjustment parameters
  • scoring_format (ScoringFormat, optional): Scoring visualization format
  • adjust_gaps (bool, optional): If True, rewrite gap positions so the output alignment matches the scoring interpretation. Output strings may have different length than input. Defaults to False.

Returns:

  • AlignmentResult: Contains identity metrics, alignment, and coverage information

Example:

from adjusted_identity import align_and_score

result = align_and_score("ATCGAAAATGTC", "ATCGAAAATGTC")
print(f"Identity: {result.identity:.3f}")

Configuration Classes

ScoringMode

Controls whether terminal gaps are included in identity scoring:

from adjusted_identity import ScoringMode

ScoringMode.LOCAL   # Score overlap region only (default)
ScoringMode.GLOBAL  # Score full alignment including terminal gaps (NW alignment)

AdjustmentParams

Configure which sequence adjustments to apply:

AdjustmentParams(
    normalize_homopolymers=True,    # Ignore homopolymer length differences
    handle_iupac_overlap=True,      # Allow IUPAC ambiguity intersections
    normalize_indels=True,          # Count contiguous indels as single events
    end_skip_distance=0,           # Skip first/last N nucleotides (0 = disabled by default)
    max_repeat_motif_length=2,     # Maximum repeat motif length to detect (1=homopolymers only, 2=dinucleotides, etc.)
    scoring_mode=ScoringMode.LOCAL # LOCAL (overlap only) or GLOBAL (full alignment with terminal gaps)
)

ScoringFormat

Customize alignment visualization characters:

ScoringFormat(
    match='|',                     # Exact match (A=A, C=C, G=G, T=T)
    ambiguous_match='=',           # Ambiguous nucleotide match (any IUPAC code match)
    substitution=' ',              # Nucleotide substitution
    indel_start=' ',               # First position of indel
    indel_extension='-',           # Indel positions (normalization)
    homopolymer_extension='=',     # Homopolymer length difference
    end_trimmed='.'               # Position outside scoring region
)

AlignmentResult

Results returned by alignment functions:

AlignmentResult(
    identity=0.95,                 # Identity score (0.0-1.0)
    mismatches=2,                  # Number of mismatches counted
    scored_positions=40,           # Positions used for identity calculation
    seq1_coverage=0.98,            # Fraction of seq1 in alignment
    seq2_coverage=0.97,            # Fraction of seq2 in alignment
    seq1_aligned="ATCG-ATCG",      # Aligned sequence 1 with gaps
    seq2_aligned="ATCGATCG-",      # Aligned sequence 2 with gaps
    score_aligned="||||=|||"       # Scoring visualization
)

Note: When adjust_gaps=True, the seq1_aligned and seq2_aligned strings are rewritten so gap positions match the scoring interpretation. This makes the score_aligned visualization directly interpretable position-by-position, but the output length may differ from the input alignment.

Constants

  • DEFAULT_ADJUSTMENT_PARAMS: All adjustments enabled (recommended)
  • RAW_ADJUSTMENT_PARAMS: No adjustments (traditional identity)
  • DEFAULT_SCORING_FORMAT: Default visualization characters

Integration with Other Alignment Tools

The core score_alignment() function works with alignments from any source. Here are examples with popular alignment libraries:

Using with NCBI BLAST

# NOTE: This example is illustrative and not tested
from Bio.Blast import NCBIWWW, NCBIXML
from adjusted_identity import score_alignment

# Run BLAST search (example)
result_handle = NCBIWWW.qblast("blastn", "nt", query_sequence)
blast_records = NCBIXML.parse(result_handle)

for blast_record in blast_records:
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            # Extract aligned sequences from BLAST HSP
            query_aligned = hsp.query
            subject_aligned = hsp.sbjct
            
            # Apply adjusted identity scoring
            adj_result = score_alignment(query_aligned, subject_aligned)
            
            print(f"BLAST identity: {hsp.identities/hsp.align_length:.3f}")
            print(f"Adjusted identity: {adj_result.identity:.3f}")

Using with BioPython PairwiseAligner

from Bio import Align
from adjusted_identity import score_alignment

# Create BioPython aligner
aligner = Align.PairwiseAligner()
aligner.match_score = 2
aligner.mismatch_score = -1

# Example sequences
seq1 = "ATCGATCG"
seq2 = "ATCGTCG"  # Missing one nucleotide

# Perform alignment
alignments = aligner.align(seq1, seq2)
best_alignment = alignments[0]

# Extract aligned sequences using indexing (idiomatic BioPython)
seq1_aligned = str(best_alignment[0])
seq2_aligned = str(best_alignment[1])

# Apply adjusted scoring
result = score_alignment(seq1_aligned, seq2_aligned)
print(f"Adjusted identity: {result.identity:.3f}")

Using with Custom/External Aligners

# NOTE: This example is illustrative template code
from adjusted_identity import score_alignment

def process_alignment_file(alignment_file):
    """Process alignments from any external tool."""
    results = []
    
    # Parse your alignment format (FASTA, SAM, custom, etc.)
    for seq1_aligned, seq2_aligned in parse_alignment_file(alignment_file):
        # Ensure sequences are the same length
        assert len(seq1_aligned) == len(seq2_aligned)
        
        # Apply adjusted identity scoring
        result = score_alignment(seq1_aligned, seq2_aligned)
        results.append(result)
    
    return results

Understanding End Trimming Behavior

The end_skip_distance parameter implements "digital end trimming" to skip sequencing artifacts near read ends. Important: This parameter counts nucleotides (non-gap characters), not alignment positions.

Automatic Activation

End trimming only activates when sequences are long enough:

from adjusted_identity import align_and_score, AdjustmentParams

# Short sequences (< 2 × end_skip_distance nucleotides): NO trimming applied
short_seq1 = "ATCGATCG"      # 8 nucleotides 
short_seq2 = "ATCGATCG"      # 8 nucleotides
result = align_and_score(short_seq1, short_seq2)  # end_skip_distance=0 by default
print(f"Scored positions: {result.scored_positions}")  # 8 (full sequence)
print(f"Score pattern: {result.score_aligned}")        # "||||||||" (no trimming dots)

# Long sequences (≥ 2 × end_skip_distance nucleotides): Trimming applied  
long_seq1 = "A" * 25 + "TCGX" + "T" * 25    # 54 nucleotides
long_seq2 = "A" * 25 + "TCGA" + "T" * 25    # 54 nucleotides
result = align_and_score(long_seq1, long_seq2)
print(f"Scored positions: {result.scored_positions}")  # ~14 (middle region only)
print(f"Score pattern: {result.score_aligned}")        # ".......|||| |||||......." (dots show trimmed regions)

Nucleotide vs Position Counting

End trimming counts actual nucleotides in each sequence, ignoring gaps:

# This alignment has gaps, but nucleotide counting still works correctly
seq1_aligned = "AAA---TCGATCG---TTT"  # 12 nucleotides (ignoring gaps)
seq2_aligned = "---AAATCGATCGTTT---"  # 12 nucleotides (ignoring gaps)

# With end_skip_distance=5: skips first 5 and last 5 nucleotides from each sequence
# Only the middle "TCGATCG" region (2 nucleotides) would be scored

Customizing End Trimming

# Disable end trimming completely
no_trim_params = AdjustmentParams(end_skip_distance=0)
result = align_and_score(long_seq1, long_seq2, no_trim_params)

# Use shorter trimming distance for smaller sequences
short_trim_params = AdjustmentParams(end_skip_distance=5) 
result = align_and_score(medium_seq1, medium_seq2, short_trim_params)

Rule of thumb: For sequences shorter than 2 × end_skip_distance nucleotides, end trimming has no effect and the entire alignment is scored.

Advanced Usage

Batch Processing

from adjusted_identity import align_and_score

sequences = [
    ("seq1", "ATCGATCGATCG"),
    ("seq2", "ATCGATCGATCC"),
    ("seq3", "ATCGATCGATCG"),
]

reference = sequences[0][1]
results = []

for name, seq in sequences[1:]:
    result = align_and_score(reference, seq)
    results.append({
        'name': name,
        'identity': result.identity,
        'coverage': min(result.seq1_coverage, result.seq2_coverage)
    })

# Sort by identity
results.sort(key=lambda x: x['identity'], reverse=True)

Understanding Scoring Patterns

The score_aligned field shows how each position was scored:

result = align_and_score("AAA-TTT", "AAAATTT")
print(result.score_aligned)  # "|||=|||"
# | = match
# = = homopolymer extension (ignored if adjustment enabled)

Scoring symbols:

  • |: Exact match (A=A, C=C, G=G, T=T)
  • =: Ambiguous match (IUPAC) or homopolymer/repeat extension
  • (space): Substitution or indel start (counts as mismatch)
  • -: Indel extension (ignored if normalization enabled)
  • .: End-trimmed, dual-gap, or overhang (not scored)

Testing

Run the comprehensive test suite:

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=adjusted_identity --cov-report=html

The test suite includes:

  • Unit tests for all adjustment types
  • Edge cases and error conditions
  • Real-world mycological scenarios
  • Performance tests with long sequences
  • Documentation examples

Background

This package implements the sequence preprocessing approach described in the MycoBLAST algorithm by Stephen Russell and Mycota Lab, adapted for general-purpose DNA sequence comparison. The foundational research is detailed in "Why NCBI BLAST identity scores can be misleading for fungi".

Why Post-Hoc Adjustment?

Standard sequence alignment algorithms use gap penalties that don't distinguish between different biological contexts. Specifically, they penalize gaps uniformly regardless of whether the gap occurs within a homopolymer run (e.g., AAAA vs AAA) or represents a true insertion/deletion event.

For many applications—particularly DNA barcoding where sequencing technology introduces homopolymer length variation as a technical artifact—it would be desirable to assign zero penalty to homopolymer length differences while still penalizing other indels normally. However, existing aligners don't support such context-dependent gap costs.

This package addresses the limitation by applying a post-hoc adjustment: we use a standard aligner to establish positional correspondence between sequences, then rescore the alignment with homopolymer-aware logic that treats length variation in repetitive tracts as neutral.

How the Variant Range Algorithm Works

Starting in v0.2.0, this package uses a variant range algorithm that provides more accurate scoring for complex indel patterns, especially in multi-sequence alignments.

The key insight: Standard aligners don't know about homopolymers—they just find the minimum-edit alignment. This can produce patterns where a simple homopolymer expansion looks like a complex substitution or scattered indels.

How it works:

  1. Find variant ranges: Scan the alignment for contiguous regions where sequences differ (gaps, mismatches, or both). These are bounded by matching positions on each side.

  2. Extract alleles: For each variant range, pull out the gap-free content from each sequence. For example, in TGC-C-TC vs TGCT--TC, the variant range yields alleles "C" and "T".

  3. Check for extensions: Ask whether each allele could be explained as a repeat of the adjacent context:

    • Does "C" extend the left context C? Yes (homopolymer)
    • Does "T" extend the right context T? Yes (homopolymer)
  4. Apply Occam's razor: If both alleles are valid extensions of their respective contexts, they represent equivalent repeat expansions → 0 edits. No mismatch is counted because both placements are biologically plausible.

Example:

seq1: ATTCA     Traditional scoring: 1 substitution (T vs C)
seq2: ATCCA     Variant range: T extends left T, C extends right C → 0 edits

This approach handles cases that position-by-position algorithms miss, such as "floating" nucleotides in MSA data where gap placement is arbitrary.

For the complete specification, see docs/SCORING_SPEC.md.

Why These Adjustments Matter

The adjustments are particularly valuable for:

  • Fungal taxonomy: ITS sequences often have homopolymer differences
  • DNA barcoding: Technical artifacts can obscure phylogenetic signal
  • Sequence quality assessment: End-trimming handles poor-quality regions
  • Phylogenetic analysis: IUPAC codes preserve ambiguous but valid matches

Credit: This implementation is based on the MycoBLAST algorithm developed by Stephen Russell and Mycota Lab. The theoretical framework and biological motivation are thoroughly explained in their foundational article.

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make changes and add tests
  4. Run the test suite: pytest
  5. Submit a pull request

Citation

If you use this package in your research, please cite:

Walker, J. (2025). Adjusted Identity Calculator for DNA Sequences. 
GitHub: https://github.com/joshuaowalker/adjusted-identity

Please also cite the foundational work:

Russell, S. (2025). Why NCBI BLAST identity scores can be misleading for fungi.
Mycota Lab. https://mycotalab.substack.com/p/why-ncbi-blast-identity-scores-can

License

BSD 2-Clause License - see LICENSE file for details.

Changelog

See CHANGELOG.md for the full release history.

About

Calculates adjusted identity scores for DNA sequences a la MycoBLAST

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages