Seed Aligner is a lightweight genetic preprocessing module designed to solve a fundamental issue in comparative genomic analysis:
not all sequence assemblies in databases start at the same genomic region.
This variation, where some sequences starting internally, others at the end, or beginning can disrupt alignment and embedding analyses such as those performed by Covary. Seed Aligner addresses this by locating conserved βseedβ consensus regions across sequences and normalizing sequence orientations.
Unlike conventional Multiple Sequence Alignment (MSA) tools that align entire sequece, Seed Aligner focuses on identifying a short, conserved seed region and reassembling the sequences around it.
- π Seed Region Identification: Finds a consensus sequence (dafault= 100 nt; start of the reference) shared across all genomes.
- π Sequence Reorientation: Repositions fragments flanking the seed to ensure all sequences start consistently.
- π§© MSA-Free Normalization: Reduces computational cost by skipping full alignments.
- βοΈ Colab Compatible: Runs entirely on Google Colab as a Jupyter Notebook for fast prototyping.
- Input: Multi-FASTA file containing complete genomes.
- Seed Detection: Paste the reference sequence or assembly.
- Sequence Rearrangement:
- If the genome starts after the seed β shift 5β² fragment to the end.
- If the genome starts before the seed β ensure seed alignment consistency.
- Output: Normalized FASTA file suitable for Covary input and other FASTA-associated analyses
Reference assembly: [SEED] ... AGTCC ... TTGAC
| Changes | Example (reorientation) |
|---|---|
| Original sequence | TTGAC... [SEED] ...AGTCC |
| Normalized output | [SEED] ...AGTCC...TTGAC |
The sequence will now start uniformly at the seed region like the reference assembly.
You can open the notebook directly in Google Colab: