-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Context
There's a push to replace the Genomic Detective hardcoded files with a Nextclade dataset, as suggested in this comment and this comment. This change aims to improve automation and consistency in genotyping processes.
Creating a Nextclade dataset could require:
- investigating clade-defining mutations or
- creating a guide tree based on the Genome Detective results (
ORF1_typeandORF2_type) andaugur traits.
Current situation:
- A rough visual exploration of the samples shows that we have Genome Detective results for approximately 1,900 out of 51,047 norovirus samples with year information.
- Existing typing tools (such as CDC typing tool and RIVM typing tool) are web-only, limiting automation
- CDC provides a comprehensive list of reference sequences
A Nextclade dataset would greatly facilitate automated genotyping.
Potential next steps:
-
Build a guide tree based on CDC reference sequences and identified clades
* [ ] Scraped and cleaned into CDC_references (3).xls (estimated labor: 3+ hours) * due to recombination, genotype follows VP1 (capsid)
* [ ] Pull capsid sequences and annotate header into norovirus_cdc_reference.fasta.txt (estimated labor: 15mins)
* [ ] Optional: Pull reference sequences from Chhabra et al, 2019 Table 1 and duel nomenclature from Table 2 -
Compare results against other tools
- Compare against Genome Detective results for
ORF1_typeandORF2_type - Compare against CDC typing tool
- Compare against RIVM typing tool
- Compare against Genome Detective results for
-
If results are consistent across tools, identify clade-defining mutations
- Check for homoplasies which can result in inconsistent tree topologies
- Mask any homoplasies that disagree with the collection date information (mutation similarities that arise from convergent or parallel evolution rather than from common ancestry)
- Check for important indels compared to the chosen reference (evaluate the reference)
- Check if the root needs to be different then the reference in the guide tree
-
Develop Nextclade dataset files (reference sequence, pathogen.json , etc.)
-
Implement the Nextclade workflow
-
Implement rules that test the Nextclade dataset using example data