Add `Nextclade` workflow for Norovirus genotyping

### Context

There's a push to replace the Genomic Detective hardcoded files with a Nextclade dataset, as suggested in [this comment](https://github.com/blab/norovirus/issues/2#issuecomment-2406207024) and [this comment](https://github.com/nextstrain/norovirus/pull/3#discussion_r1817122832). This change aims to improve automation and consistency in genotyping processes.

**Creating a Nextclade dataset could require:**

1. investigating clade-defining mutations or 
2. creating a guide tree based on the Genome Detective results (`ORF1_type` and `ORF2_type`) and `augur traits`. 

**Current situation:**

* A [rough visual exploration of the samples](https://github.com/j23414/generated-reports/blob/main/reports/norovirus.pdf) shows that we have Genome Detective results for approximately 1,900 out of 51,047 norovirus samples with year information. 
* Existing typing tools (such as [CDC typing tool](https://www.rivm.nl/mpf/typingtool/norovirus/) and [RIVM typing tool](https://www.rivm.nl/mpf/typingtool/norovirus/)) are web-only, limiting automation
* CDC provides a [comprehensive list of reference sequences](https://calicivirustypingtool.cdc.gov/becerance.cgi)

A Nextclade dataset would greatly facilitate automated genotyping.

## Potential next steps:



* [ ] Build a guide tree based on [CDC reference sequences](https://calicivirustypingtool.cdc.gov/becerance.cgi) and identified clades
        * [ ] Scraped and cleaned into [CDC_references (3).xls](https://github.com/user-attachments/files/22191978/CDC_references.3.xls) (estimated labor: 3+ hours) * due to recombination, genotype follows VP1 (capsid)
        * [ ] Pull capsid sequences and annotate header into [norovirus_cdc_reference.fasta.txt](https://github.com/user-attachments/files/18970284/norovirus_cdc_reference.fasta.txt) (estimated labor: 15mins)
        * [ ] Optional: Pull reference sequences from [Chhabra et al, 2019](https://pmc.ncbi.nlm.nih.gov/articles/PMC7011714/) Table 1 and duel nomenclature from Table 2
* [ ] Compare results against other tools
	* [ ] Compare against Genome Detective results for `ORF1_type` and `ORF2_type`
	* [ ] Compare against [CDC typing tool](https://www.rivm.nl/mpf/typingtool/norovirus/)
	* [ ] Compare against [RIVM typing tool](https://www.rivm.nl/mpf/typingtool/norovirus/)
* [ ] If results are consistent across tools, identify clade-defining mutations 
	* [ ] Check for homoplasies which can result in inconsistent tree topologies
	* [ ] Mask any homoplasies that disagree with the collection date information (mutation similarities that arise from convergent or parallel evolution rather than from common ancestry)
	* [ ] Check for important indels compared to the chosen reference (evaluate the reference)
	* [ ] Check if the [root needs to be different then the reference](https://docs.nextstrain.org/en/latest/guides/bioinformatics/root-and-ref-seqs.html) in the guide tree
* [ ] Develop Nextclade dataset files (reference sequence, pathogen.json , etc.) 
* [ ] Implement the Nextclade workflow


* [ ] Implement rules that test the Nextclade dataset using example data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `Nextclade` workflow for Norovirus genotyping #6

Context

Potential next steps:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add Nextclade workflow for Norovirus genotyping #6

Description

Context

Potential next steps:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add `Nextclade` workflow for Norovirus genotyping #6