Adjusted Identity Calculator for DNA Sequences

A Python package implementing MycoBLAST-style sequence identity calculations for DNA sequences, specifically designed for mycological DNA barcoding applications. This package provides sophisticated sequence alignment and scoring with various adjustments for homopolymer differences, IUPAC ambiguity codes, and sequencing artifacts.

Based on the MycoBLAST algorithm developed by Stephen Russell and Mycota Lab. See the foundational article: "Why NCBI BLAST identity scores can be misleading for fungi" which explains the theoretical basis and motivation for these sequence preprocessing adjustments.

Features

Homopolymer Length Normalization: Ignore differences in homopolymer run lengths (e.g., "AAA" vs "AAAA")
Repeat Motif Adjustment: Handle dinucleotide and longer repeat motifs (e.g., "ATATAT" vs "ATATATAT")
IUPAC Ambiguity Code Handling: Allow different ambiguity codes to match via nucleotide intersection
MSA Dual-Gap Support: Correctly handle sequences from multi-sequence alignments (MSA) where both sequences may have gaps at the same position
End Trimming: Skip mismatches in terminal regions to avoid sequencing artifacts (disabled by default, set end_skip_distance to enable)
Indel Normalization: Count contiguous indels as single evolutionary events
LOCAL / GLOBAL Scoring Modes: Choose between overlap-only scoring (default) or full-alignment scoring that penalizes terminal gaps — critical for comparing sequences of very different lengths
Comprehensive Alignment: Multi-stage bidirectional alignment optimization using edlib
Flexible Configuration: Enable/disable individual adjustments as needed

Installation

From PyPI

pip install adjusted-identity

From GitHub (latest development version)

pip install git+https://github.com/joshuaowalker/adjusted-identity.git

Development Installation

git clone https://github.com/joshuaowalker/adjusted-identity.git
cd adjusted-identity
pip install -e ".[dev]"

Quick Start

Option 1: Complete Solution (align + score)

from adjusted_identity import align_and_score

# For raw sequences - handles alignment and scoring
result = align_and_score("ATCGAAAAATGTC", "ATCGAAAATGTC")
print(f"Adjusted identity: {result.identity:.3f}")
print(f"Coverage: seq1={result.seq1_coverage:.3f}, seq2={result.seq2_coverage:.3f}")

Option 2: Core Scoring Only (use with your alignments)

from adjusted_identity import score_alignment

# For pre-aligned sequences from BLAST, BioPython, etc.
aligned_seq1 = "ATCG-AAAT"  # From your alignment tool
aligned_seq2 = "ATCGAAAAT"  # From your alignment tool

result = score_alignment(aligned_seq1, aligned_seq2)
print(f"Adjusted identity: {result.identity:.3f}")

Compare with Traditional Identity

from adjusted_identity import align_and_score, RAW_ADJUSTMENT_PARAMS

# Traditional identity (no adjustments)
raw_result = align_and_score("ATCGAAAAATGTC", "ATCGAAAATGTC", RAW_ADJUSTMENT_PARAMS)
print(f"Traditional identity: {raw_result.identity:.3f}")

# Adjusted identity (with MycoBLAST adjustments)  
adj_result = align_and_score("ATCGAAAAATGTC", "ATCGAAAATGTC")
print(f"Adjusted identity: {adj_result.identity:.3f}")

# Examine the scoring pattern
print(f"Score pattern: {adj_result.score_aligned}")
# '|' = exact match, '=' = ambiguous/homopolymer, ' ' = substitution

Use Cases

Mycological DNA Barcoding

Common scenario: ITS sequences with homopolymer differences due to sequencing artifacts.

from adjusted_identity import align_and_score

# Fungal ITS sequences with different homopolymer lengths
its_seq1 = "TCCGTAGGTGAACCTGCGGAAGGATCATTACCGAGTTTAAA"    # 3 A's at end
its_seq2 = "TCCGTAGGTGAACCTGCGGAAGGATCATTACCGAGTTTTAAAA"   # 4 A's at end

result = align_and_score(its_seq1, its_seq2)
print(f"Species identity: {result.identity:.3f}")  # Should be ~1.0 with homopolymer adjustment

Handling Ambiguous Bases

from adjusted_identity import align_and_score

# Sequences with IUPAC ambiguity codes
barcode1 = "ATCGRGTC"  # R = A or G
barcode2 = "ATCGKGTC"  # K = G or T (both R and K contain G)

result = align_and_score(barcode1, barcode2)
print(f"Identity with IUPAC handling: {result.identity:.3f}")  # Should be 1.0
print(f"Score pattern: {result.score_aligned}")  # Shows '=' for ambiguous matches

Comparing Sequences of Different Lengths (LOCAL vs GLOBAL)

By default, identity is calculated over the overlap region only (LOCAL mode). This works well for most use cases, but can be misleading when comparing a short sequence against a much longer one — the short fragment may match perfectly within its overlap while covering only a fraction of the target.

GLOBAL mode uses Needleman-Wunsch alignment and scores the full alignment including terminal gaps, giving a more conservative identity estimate:

from adjusted_identity import align_and_score, AdjustmentParams, ScoringMode

short_seq = "ATCG" * 55   # 220bp truncated sequence
full_seq  = "ATCG" * 165  # 660bp full-length target

# LOCAL (default): scores only the overlapping region
local_result = align_and_score(short_seq, full_seq)
print(f"LOCAL identity: {local_result.identity:.3f}")   # ~1.000
print(f"LOCAL seq2_coverage: {local_result.seq2_coverage:.3f}")

# GLOBAL: terminal gaps count against identity
global_params = AdjustmentParams(scoring_mode=ScoringMode.GLOBAL, normalize_indels=False)
global_result = align_and_score(short_seq, full_seq, global_params)
print(f"GLOBAL identity: {global_result.identity:.3f}")  # ~0.333
print(f"GLOBAL seq2_coverage: {global_result.seq2_coverage:.3f}")

Use LOCAL mode (the default) when you want overlap-only identity, e.g. when sequences are expected to be similar in length or when coverage is checked separately. Use GLOBAL mode when identity should reflect how much of the target is actually covered, e.g. for clustering or when comparing sequences of very different lengths.

Tip: You can also pass the mode as a string: AdjustmentParams(scoring_mode="global").

Understanding Score Patterns

The score_aligned field provides a visual representation of how each position was scored:

| = Exact match between standard nucleotides (A=A, C=C, G=G, T=T)
= = Ambiguous match (IUPAC codes) or homopolymer/repeat extension
(space) = Substitution (mismatch)
- = Indel extension (normalized)
. = End-trimmed, dual-gap, or overhang position (not scored)

from adjusted_identity import align_and_score

result = align_and_score("ATCGRAAATGTC", "ATCGAAAAATGTC")
print(f"Seq1: {result.seq1_aligned}")
print(f"Seq2: {result.seq2_aligned}")
print(f"Score: {result.score_aligned}")
# Output might show: ||||==||||||
#                    ATCG = exact matches (||||)
#                    R vs A = ambiguous match (=)
#                    AAA vs AAAA = homopolymer extension (=)

Gap-Adjusted Visualization

By default, the output alignment strings preserve the original gap positions from the aligner. However, since standard aligners don't distinguish homopolymer extensions from true indels, the gap placement may not reflect how positions were actually scored.

Use adjust_gaps=True to rewrite gap positions so the visualization directly matches the scoring interpretation:

from adjusted_identity import align_and_score

# These sequences differ only in homopolymer lengths (4A+3T vs 3A+4T)
seq1 = "AAAATTT"
seq2 = "AAATTTT"

# Default: alignment strings show A-vs-T at position 4 (looks like substitution)
result = align_and_score(seq1, seq2, adjust_gaps=False)
print(f"Seq1:  {result.seq1_aligned}")   # AAAATTT
print(f"Seq2:  {result.seq2_aligned}")   # AAATTTT
print(f"Score: {result.score_aligned}")  # |||=|||  (score shows extension, but alignment hides it)

# Adjusted: gaps inserted to reveal the homopolymer interpretation
result = align_and_score(seq1, seq2, adjust_gaps=True)
print(f"Seq1:  {result.seq1_aligned}")   # AAAA-TTT
print(f"Seq2:  {result.seq2_aligned}")   # AAA-TTTT
print(f"Score: {result.score_aligned}")  # |||==|||  (two extensions now visible)

# Identity is 1.0 in both cases - the adjustment is purely visual
print(f"Identity: {result.identity}")    # 1.0

The adjust_gaps=True mode is particularly useful when you need to visually inspect why positions were scored a certain way—each position in the output alignment corresponds directly to its scoring marker.

Repeat Motif Handling

from adjusted_identity import align_and_score, AdjustmentParams

# Dinucleotide repeat differences (AT repeat from Russell article)
seq1 = "CGATAT--C"  # Missing one AT unit
seq2 = "CGATATATC"  # Has extra AT unit

# With repeat motif adjustment (default)
result = align_and_score(seq1, seq2)
print(f"Adjusted identity: {result.identity:.3f}")  # Should be 1.0

# Control max repeat motif length
params = AdjustmentParams(
    max_repeat_motif_length=3  # Detect up to trinucleotide repeats (e.g., CAG)
)
result = align_and_score("CAGCAG---TTC", "CAGCAGCAGTTC", params)

Multi-Sequence Alignment (MSA) Support

The package correctly handles sequence pairs extracted from multi-sequence alignments (MSA), where both sequences may have gaps at the same position due to alignment with third sequences.

from adjusted_identity import score_alignment

# Sequences from MSA (e.g., spoa, MUSCLE, MAFFT output)
# Both sequences have gaps at positions 3-4 due to alignment with a third sequence
msa_seq1 = "AGA--TT"
msa_seq2 = "AGAT-TT"

result = score_alignment(msa_seq1, msa_seq2)
print(f"MSA identity: {result.identity:.3f}")  # Should be 1.0 - 'T' recognized as homopolymer
print(f"Score pattern: {result.score_aligned}")

# Another example with consensus-based homopolymer detection
msa_seq1 = "AGG-AC"  # G at position 2
msa_seq2 = "AG-GAC"  # G at position 3

result = score_alignment(msa_seq1, msa_seq2)
print(f"MSA identity: {result.identity:.3f}")  # Both G's recognized as homopolymer extensions

Key MSA features:

Dual-gap handling: Positions where both sequences have '-' are excluded from scoring (marked with .)
Consensus context: Homopolymer detection uses consensus nucleotides from both sequences
Conflict resolution: When sequences disagree at context positions, homopolymer extension is not applied

Custom Adjustments

from adjusted_identity import align_and_score, AdjustmentParams

# Custom adjustment parameters
custom_params = AdjustmentParams(
    normalize_homopolymers=True,   # Enable homopolymer adjustment
    handle_iupac_overlap=False,    # Disable IUPAC intersection
    normalize_indels=True,         # Enable indel normalization
    end_skip_distance=10,         # Skip 10bp from each end (default is 0 = disabled)
    max_repeat_motif_length=2     # Detect up to dinucleotide repeats (default)
)

result = align_and_score(seq1, seq2, custom_params)

API Architecture

This package provides a layered API design that separates sequence alignment from identity scoring, giving you maximum flexibility:

Core Layer: `score_alignment()`

The core implementation that applies MycoBLAST-style adjustments to any pre-aligned sequences. This allows you to use any alignment library (BLAST, BioPython, edlib, etc.) with the adjusted identity algorithm.

Input: Just needs gapped sequences of equal length
Output: Adjusted identity metrics with detailed scoring information

Convenience Layer: `align_and_score()`

A higher-level function that combines fast edlib alignment with the scoring algorithm. Provides BLAST-like infix alignment that's fast enough for production use without requiring external dependencies.

Input: Raw unaligned sequences
Output: Complete alignment and adjusted identity results

API Reference

Core Function

`score_alignment(seq1_aligned, seq2_aligned, adjustment_params=None, scoring_format=None, adjust_gaps=False)`

The core implementation - applies MycoBLAST-style adjustments to pre-aligned sequences from any source.

Use this when:

You already have alignments from BLAST, BioPython, or other tools
You want to integrate adjusted identity into existing pipelines
You need maximum control over the alignment process

Parameters:

seq1_aligned, seq2_aligned (str): Pre-aligned sequences with gaps (must be same length)
adjustment_params (AdjustmentParams, optional): Adjustment parameters
scoring_format (ScoringFormat, optional): Scoring visualization format
adjust_gaps (bool, optional): If True, rewrite gap positions so the output alignment matches the scoring interpretation. Output strings may have different length than input. Defaults to False.

Returns:

AlignmentResult: Scoring results and metrics

Example with BLAST alignment:

# After getting BLAST alignment results
from adjusted_identity import score_alignment

blast_seq1 = "ATCG-AAAT"  # From BLAST output
blast_seq2 = "ATCGAAAAT"  # From BLAST output

result = score_alignment(blast_seq1, blast_seq2)
print(f"Adjusted identity: {result.identity:.3f}")

Convenience Function

`align_and_score(seq1, seq2, adjustment_params=None, scoring_format=None, adjust_gaps=False)`

High-level convenience function that handles both alignment and scoring in one step. Uses semi-global (HW) alignment by default, or Needleman-Wunsch (NW) alignment when scoring_mode=ScoringMode.GLOBAL.

Use this when:

You want a simple, fast solution without additional alignment tools
You need BLAST-like performance for production use
You're comparing raw sequences end-to-end

Parameters:

seq1, seq2 (str): Raw DNA sequences to compare
adjustment_params (AdjustmentParams, optional): Adjustment parameters
scoring_format (ScoringFormat, optional): Scoring visualization format
adjust_gaps (bool, optional): If True, rewrite gap positions so the output alignment matches the scoring interpretation. Output strings may have different length than input. Defaults to False.

Returns:

AlignmentResult: Contains identity metrics, alignment, and coverage information

Example:

from adjusted_identity import align_and_score

result = align_and_score("ATCGAAAATGTC", "ATCGAAAATGTC")
print(f"Identity: {result.identity:.3f}")

Configuration Classes

`ScoringMode`

Controls whether terminal gaps are included in identity scoring:

from adjusted_identity import ScoringMode

ScoringMode.LOCAL   # Score overlap region only (default)
ScoringMode.GLOBAL  # Score full alignment including terminal gaps (NW alignment)

`AdjustmentParams`

Configure which sequence adjustments to apply:

AdjustmentParams(
    normalize_homopolymers=True,    # Ignore homopolymer length differences
    handle_iupac_overlap=True,      # Allow IUPAC ambiguity intersections
    normalize_indels=True,          # Count contiguous indels as single events
    end_skip_distance=0,           # Skip first/last N nucleotides (0 = disabled by default)
    max_repeat_motif_length=2,     # Maximum repeat motif length to detect (1=homopolymers only, 2=dinucleotides, etc.)
    scoring_mode=ScoringMode.LOCAL # LOCAL (overlap only) or GLOBAL (full alignment with terminal gaps)
)

`ScoringFormat`

Customize alignment visualization characters:

ScoringFormat(
    match='|',                     # Exact match (A=A, C=C, G=G, T=T)
    ambiguous_match='=',           # Ambiguous nucleotide match (any IUPAC code match)
    substitution=' ',              # Nucleotide substitution
    indel_start=' ',               # First position of indel
    indel_extension='-',           # Indel positions (normalization)
    homopolymer_extension='=',     # Homopolymer length difference
    end_trimmed='.'               # Position outside scoring region
)

`AlignmentResult`

Results returned by alignment functions:

AlignmentResult(
    identity=0.95,                 # Identity score (0.0-1.0)
    mismatches=2,                  # Number of mismatches counted
    scored_positions=40,           # Positions used for identity calculation
    seq1_coverage=0.98,            # Fraction of seq1 in alignment
    seq2_coverage=0.97,            # Fraction of seq2 in alignment
    seq1_aligned="ATCG-ATCG",      # Aligned sequence 1 with gaps
    seq2_aligned="ATCGATCG-",      # Aligned sequence 2 with gaps
    score_aligned="||||=|||"       # Scoring visualization
)

Note: When adjust_gaps=True, the seq1_aligned and seq2_aligned strings are rewritten so gap positions match the scoring interpretation. This makes the score_aligned visualization directly interpretable position-by-position, but the output length may differ from the input alignment.

Constants

DEFAULT_ADJUSTMENT_PARAMS: All adjustments enabled (recommended)
RAW_ADJUSTMENT_PARAMS: No adjustments (traditional identity)
DEFAULT_SCORING_FORMAT: Default visualization characters

Integration with Other Alignment Tools

The core score_alignment() function works with alignments from any source. Here are examples with popular alignment libraries:

Using with NCBI BLAST

# NOTE: This example is illustrative and not tested
from Bio.Blast import NCBIWWW, NCBIXML
from adjusted_identity import score_alignment

# Run BLAST search (example)
result_handle = NCBIWWW.qblast("blastn", "nt", query_sequence)
blast_records = NCBIXML.parse(result_handle)

for blast_record in blast_records:
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            # Extract aligned sequences from BLAST HSP
            query_aligned = hsp.query
            subject_aligned = hsp.sbjct
            
            # Apply adjusted identity scoring
            adj_result = score_alignment(query_aligned, subject_aligned)
            
            print(f"BLAST identity: {hsp.identities/hsp.align_length:.3f}")
            print(f"Adjusted identity: {adj_result.identity:.3f}")

Using with BioPython PairwiseAligner

from Bio import Align
from adjusted_identity import score_alignment

# Create BioPython aligner
aligner = Align.PairwiseAligner()
aligner.match_score = 2
aligner.mismatch_score = -1

# Example sequences
seq1 = "ATCGATCG"
seq2 = "ATCGTCG"  # Missing one nucleotide

# Perform alignment
alignments = aligner.align(seq1, seq2)
best_alignment = alignments[0]

# Extract aligned sequences using indexing (idiomatic BioPython)
seq1_aligned = str(best_alignment[0])
seq2_aligned = str(best_alignment[1])

# Apply adjusted scoring
result = score_alignment(seq1_aligned, seq2_aligned)
print(f"Adjusted identity: {result.identity:.3f}")

Using with Custom/External Aligners

# NOTE: This example is illustrative template code
from adjusted_identity import score_alignment

def process_alignment_file(alignment_file):
    """Process alignments from any external tool."""
    results = []
    
    # Parse your alignment format (FASTA, SAM, custom, etc.)
    for seq1_aligned, seq2_aligned in parse_alignment_file(alignment_file):
        # Ensure sequences are the same length
        assert len(seq1_aligned) == len(seq2_aligned)
        
        # Apply adjusted identity scoring
        result = score_alignment(seq1_aligned, seq2_aligned)
        results.append(result)
    
    return results

Understanding End Trimming Behavior

The end_skip_distance parameter implements "digital end trimming" to skip sequencing artifacts near read ends. Important: This parameter counts nucleotides (non-gap characters), not alignment positions.

Automatic Activation

End trimming only activates when sequences are long enough:

from adjusted_identity import align_and_score, AdjustmentParams

# Short sequences (< 2 × end_skip_distance nucleotides): NO trimming applied
short_seq1 = "ATCGATCG"      # 8 nucleotides 
short_seq2 = "ATCGATCG"      # 8 nucleotides
result = align_and_score(short_seq1, short_seq2)  # end_skip_distance=0 by default
print(f"Scored positions: {result.scored_positions}")  # 8 (full sequence)
print(f"Score pattern: {result.score_aligned}")        # "||||||||" (no trimming dots)

# Long sequences (≥ 2 × end_skip_distance nucleotides): Trimming applied  
long_seq1 = "A" * 25 + "TCGX" + "T" * 25    # 54 nucleotides
long_seq2 = "A" * 25 + "TCGA" + "T" * 25    # 54 nucleotides
result = align_and_score(long_seq1, long_seq2)
print(f"Scored positions: {result.scored_positions}")  # ~14 (middle region only)
print(f"Score pattern: {result.score_aligned}")        # ".......|||| |||||......." (dots show trimmed regions)

Nucleotide vs Position Counting

End trimming counts actual nucleotides in each sequence, ignoring gaps:

# This alignment has gaps, but nucleotide counting still works correctly
seq1_aligned = "AAA---TCGATCG---TTT"  # 12 nucleotides (ignoring gaps)
seq2_aligned = "---AAATCGATCGTTT---"  # 12 nucleotides (ignoring gaps)

# With end_skip_distance=5: skips first 5 and last 5 nucleotides from each sequence
# Only the middle "TCGATCG" region (2 nucleotides) would be scored

Customizing End Trimming

# Disable end trimming completely
no_trim_params = AdjustmentParams(end_skip_distance=0)
result = align_and_score(long_seq1, long_seq2, no_trim_params)

# Use shorter trimming distance for smaller sequences
short_trim_params = AdjustmentParams(end_skip_distance=5) 
result = align_and_score(medium_seq1, medium_seq2, short_trim_params)

Rule of thumb: For sequences shorter than 2 × end_skip_distance nucleotides, end trimming has no effect and the entire alignment is scored.

Advanced Usage

Batch Processing

from adjusted_identity import align_and_score

sequences = [
    ("seq1", "ATCGATCGATCG"),
    ("seq2", "ATCGATCGATCC"),
    ("seq3", "ATCGATCGATCG"),
]

reference = sequences[0][1]
results = []

for name, seq in sequences[1:]:
    result = align_and_score(reference, seq)
    results.append({
        'name': name,
        'identity': result.identity,
        'coverage': min(result.seq1_coverage, result.seq2_coverage)
    })

# Sort by identity
results.sort(key=lambda x: x['identity'], reverse=True)

Understanding Scoring Patterns

The score_aligned field shows how each position was scored:

result = align_and_score("AAA-TTT", "AAAATTT")
print(result.score_aligned)  # "|||=|||"
# | = match
# = = homopolymer extension (ignored if adjustment enabled)

Scoring symbols:

|: Exact match (A=A, C=C, G=G, T=T)
=: Ambiguous match (IUPAC) or homopolymer/repeat extension
(space): Substitution or indel start (counts as mismatch)
-: Indel extension (ignored if normalization enabled)
.: End-trimmed, dual-gap, or overhang (not scored)

Testing

Run the comprehensive test suite:

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=adjusted_identity --cov-report=html

The test suite includes:

Unit tests for all adjustment types
Edge cases and error conditions
Real-world mycological scenarios
Performance tests with long sequences
Documentation examples

Background

This package implements the sequence preprocessing approach described in the MycoBLAST algorithm by Stephen Russell and Mycota Lab, adapted for general-purpose DNA sequence comparison. The foundational research is detailed in "Why NCBI BLAST identity scores can be misleading for fungi".

Why Post-Hoc Adjustment?

Standard sequence alignment algorithms use gap penalties that don't distinguish between different biological contexts. Specifically, they penalize gaps uniformly regardless of whether the gap occurs within a homopolymer run (e.g., AAAA vs AAA) or represents a true insertion/deletion event.

For many applications—particularly DNA barcoding where sequencing technology introduces homopolymer length variation as a technical artifact—it would be desirable to assign zero penalty to homopolymer length differences while still penalizing other indels normally. However, existing aligners don't support such context-dependent gap costs.

This package addresses the limitation by applying a post-hoc adjustment: we use a standard aligner to establish positional correspondence between sequences, then rescore the alignment with homopolymer-aware logic that treats length variation in repetitive tracts as neutral.

How the Variant Range Algorithm Works

Starting in v0.2.0, this package uses a variant range algorithm that provides more accurate scoring for complex indel patterns, especially in multi-sequence alignments.

The key insight: Standard aligners don't know about homopolymers—they just find the minimum-edit alignment. This can produce patterns where a simple homopolymer expansion looks like a complex substitution or scattered indels.

How it works:

Find variant ranges: Scan the alignment for contiguous regions where sequences differ (gaps, mismatches, or both). These are bounded by matching positions on each side.
Extract alleles: For each variant range, pull out the gap-free content from each sequence. For example, in TGC-C-TC vs TGCT--TC, the variant range yields alleles "C" and "T".
Check for extensions: Ask whether each allele could be explained as a repeat of the adjacent context:
- Does "C" extend the left context C? Yes (homopolymer)
- Does "T" extend the right context T? Yes (homopolymer)
Apply Occam's razor: If both alleles are valid extensions of their respective contexts, they represent equivalent repeat expansions → 0 edits. No mismatch is counted because both placements are biologically plausible.

Example:

seq1: ATTCA     Traditional scoring: 1 substitution (T vs C)
seq2: ATCCA     Variant range: T extends left T, C extends right C → 0 edits

This approach handles cases that position-by-position algorithms miss, such as "floating" nucleotides in MSA data where gap placement is arbitrary.

For the complete specification, see docs/SCORING_SPEC.md.

Why These Adjustments Matter

The adjustments are particularly valuable for:

Fungal taxonomy: ITS sequences often have homopolymer differences
DNA barcoding: Technical artifacts can obscure phylogenetic signal
Sequence quality assessment: End-trimming handles poor-quality regions
Phylogenetic analysis: IUPAC codes preserve ambiguous but valid matches

Credit: This implementation is based on the MycoBLAST algorithm developed by Stephen Russell and Mycota Lab. The theoretical framework and biological motivation are thoroughly explained in their foundational article.

Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Make changes and add tests
Run the test suite: pytest
Submit a pull request

Citation

If you use this package in your research, please cite:

Walker, J. (2025). Adjusted Identity Calculator for DNA Sequences. 
GitHub: https://github.com/joshuaowalker/adjusted-identity

Please also cite the foundational work:

Russell, S. (2025). Why NCBI BLAST identity scores can be misleading for fungi.
Mycota Lab. https://mycotalab.substack.com/p/why-ncbi-blast-identity-scores-can

License

BSD 2-Clause License - see LICENSE file for details.

Changelog

See CHANGELOG.md for the full release history.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
adjusted_identity		adjusted_identity
docs		docs
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

License

joshuaowalker/adjusted-identity

Folders and files

Latest commit

History

Repository files navigation

Adjusted Identity Calculator for DNA Sequences

Features

Installation

From PyPI

From GitHub (latest development version)

Development Installation

Quick Start

Option 1: Complete Solution (align + score)

Option 2: Core Scoring Only (use with your alignments)

Compare with Traditional Identity

Use Cases

Mycological DNA Barcoding

Handling Ambiguous Bases

Comparing Sequences of Different Lengths (LOCAL vs GLOBAL)

Understanding Score Patterns

Gap-Adjusted Visualization

Repeat Motif Handling

Multi-Sequence Alignment (MSA) Support

Custom Adjustments

API Architecture

Core Layer: score_alignment()

Convenience Layer: align_and_score()

API Reference

Core Function

score_alignment(seq1_aligned, seq2_aligned, adjustment_params=None, scoring_format=None, adjust_gaps=False)

Convenience Function

align_and_score(seq1, seq2, adjustment_params=None, scoring_format=None, adjust_gaps=False)

Configuration Classes

ScoringMode

AdjustmentParams

ScoringFormat

AlignmentResult

Constants

Integration with Other Alignment Tools

Using with NCBI BLAST

Using with BioPython PairwiseAligner

Using with Custom/External Aligners

Understanding End Trimming Behavior

Automatic Activation

Nucleotide vs Position Counting

Customizing End Trimming

Advanced Usage

Batch Processing

Understanding Scoring Patterns

Testing

Background

Why Post-Hoc Adjustment?

How the Variant Range Algorithm Works

Why These Adjustments Matter

Contributing

Citation

License

Changelog

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Uh oh!

Languages

Core Layer: `score_alignment()`

Convenience Layer: `align_and_score()`

`score_alignment(seq1_aligned, seq2_aligned, adjustment_params=None, scoring_format=None, adjust_gaps=False)`

`align_and_score(seq1, seq2, adjustment_params=None, scoring_format=None, adjust_gaps=False)`

`ScoringMode`

`AdjustmentParams`

`ScoringFormat`

`AlignmentResult`

Packages