-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Context
Nextclade references that appear on clades.nextstrain.org live in this repository in the nextclade/dataset_config/ directory. In principle, users can point their build configurations at these files, to run the workflow with different standard references. However, we organize Nextclade references by segment accession like nextclade/dataset_config/h3n2/ha/EPI1857216/reference.fasta and nextclade/dataset_config/h3n2/na/EPI1857215/reference.fasta. This nesting by lineage, segment, and accession prevents users from referring to a genetic lineage and segment path in their build configurations like config/h3n2/{segment}/reference.fasta. Because each build defines parameters for all segments at once, there is no way to make a multiple-segment build with the current configuration schema.
This issue arose through discussion of #208
Description
We should provide an easy way for users to refer to Nextclade references in the dataset_config, so they can make builds based on, for example, A/Darwin/6/2021 instead of A/Wisconsin/67/2005.
Possible solution
We could change the schema for the build config such that we define any segment-specific parameters with a nested structure. For example, we could define the reference and annotation parameters for a build like this:
reference:
ha: "config/h3n2/ha/reference.fasta"
na: "config/h3n2/na/reference.fasta"
annotation:
ha: "config/h3n2/ha/genemap.gff"
na: "config/h3n2/na/genemap.gff"Although this implementation is more verbose and redundant in the example above, it would allow users to define references that live at arbitrary paths outside of the constraints of the workflow's own configuration including the Nextclade references. Then, we could define Nextclade references for H3N2 HA and NA like this:
reference:
ha: "nextclade/dataset_config/h3n2/ha/EPI1857216/reference.fasta"
na: "nextclade/dataset_config/h3n2/na/EPI1857215/reference.fasta"
annotation:
ha: "nextclade/dataset_config/h3n2/ha/EPI1857216/annotation.gff"
na: "nextclade/dataset_config/h3n2/na/EPI1857215/annotation.gff"We could implement this change as a backward-compatible alternative to the current implementation by checking the type of the reference and annotation fields.