-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Context
We allow users to define their own reference FASTA and annotation GFF in their build configurations. However, clade and subclade definitions use nucleotide positions from a specific reference sequence. When the user's reference sequence differs in coordinates from the reference used to define clades and subclades, they need a way to offset the clade coordinates for their custom reference. Since the workflow downloads clades from GitHub, we should not encourage users to locally modify clade definition files that will be likely be overwritten by subsequent runs of the workflow.
The Nextclade workflow deals with this same issue by introducing a custom rule to offset the clade coordinates using a configuration-based offset value (e.g., the offset for H3N2 HA reference A/Darwin/6/2021).
This issue arose through discussion of #208
Description
We should allow users to define custom nucleotide offsets for their custom references, so clade and subclade definitions work as expected for their builds.
Possible solution
One solution would be to copy the offset_clades rule from Nextclade into the core workflow such that we always generate the offset clades files even when no offset is defined by the user. We could allow users to define a build-level offset per reference sequence. For example, if we used this pattern I've recommended for defining segment-specific parameters, we could support an optional build-level field for reference_offset that could look like this:
reference:
ha: "nextclade/dataset_config/h3n2/ha/EPI1857216/reference.fasta"
na: "nextclade/dataset_config/h3n2/na/EPI1857215/reference.fasta"
reference_offset:
ha: -17Since the default offset would be 0, users would only need to define nonzero offsets in this way.