Skip to content

Allow users to run the workflow with Nextclade references for multiple segments #209

@huddlej

Description

@huddlej

Context

Nextclade references that appear on clades.nextstrain.org live in this repository in the nextclade/dataset_config/ directory. In principle, users can point their build configurations at these files, to run the workflow with different standard references. However, we organize Nextclade references by segment accession like nextclade/dataset_config/h3n2/ha/EPI1857216/reference.fasta and nextclade/dataset_config/h3n2/na/EPI1857215/reference.fasta. This nesting by lineage, segment, and accession prevents users from referring to a genetic lineage and segment path in their build configurations like config/h3n2/{segment}/reference.fasta. Because each build defines parameters for all segments at once, there is no way to make a multiple-segment build with the current configuration schema.

This issue arose through discussion of #208

Description

We should provide an easy way for users to refer to Nextclade references in the dataset_config, so they can make builds based on, for example, A/Darwin/6/2021 instead of A/Wisconsin/67/2005.

Possible solution

We could change the schema for the build config such that we define any segment-specific parameters with a nested structure. For example, we could define the reference and annotation parameters for a build like this:

reference:
  ha: "config/h3n2/ha/reference.fasta"
  na: "config/h3n2/na/reference.fasta"
annotation:
  ha: "config/h3n2/ha/genemap.gff"
  na: "config/h3n2/na/genemap.gff"

Although this implementation is more verbose and redundant in the example above, it would allow users to define references that live at arbitrary paths outside of the constraints of the workflow's own configuration including the Nextclade references. Then, we could define Nextclade references for H3N2 HA and NA like this:

reference:
  ha: "nextclade/dataset_config/h3n2/ha/EPI1857216/reference.fasta"
  na: "nextclade/dataset_config/h3n2/na/EPI1857215/reference.fasta"
annotation:
  ha: "nextclade/dataset_config/h3n2/ha/EPI1857216/annotation.gff"
  na: "nextclade/dataset_config/h3n2/na/EPI1857215/annotation.gff"

We could implement this change as a backward-compatible alternative to the current implementation by checking the type of the reference and annotation fields.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions