Skip to content

work needed to allow users to run analyses separate to pathogen workflows #74

@jameshadfield

Description

@jameshadfield

Updated after dev-chat discussion 2025-05-29/30. Previously based on these meeting notes, 2024-11-21/22.

This is an overview issue around the tasks, topics and ideas related to external collaborators/users running workflows separate to the pathogen-repo itself.

  1. external directory support
  • The CLI now supports nextstrain {setup,run} etc. There's a big list of nice-to-have's this part is essentially done.
  • Measles supports this interface fully. Avian-flu works for nextstrain run, but doesn't conform to the standard file structure so nextstrain setup shows errors.
    • Jover is planning on pushing this out to zika etc
  • [implementation] we are using the shared repo to vendor common code
  1. merging private data
  • Running list of pathogens supporting this in various capacities: Provide a generic pattern for including additional user data alongside curated data #72 (comment)
  • I think we're all happy with the interface used in avian-flu being used more widely
    • Implementation may be improved by using shared code for remote resources or snakemake v8s support
  • Not going to contemplate curation of user data at this stage, but if we do this the proposal was to enable a config-hook to such that private data (in the analysis directory) is passed through a user-defined program (etc) before merge
  • PGCoE aim for 25-26 is to link up subsampling and private data
  • Potentially our default inputs should include an (optional) default location for private data - i.e. if these files exist in the analysis directory then theyll be used without needing to write a config overlay.
  1. Generalized subsampling
  • PGCoE aim for 24-25 to have a general augur subsample command. No proximity needed at this stage.
  • We added weighted sampling already and we need to work out what use-cases can be achieved by this alone (i.e. clarify where we actually need a subsampling command)
  1. consistent config syntax
  • Big picture config syntax stuff is probably a long-term thing, and not blocking here
    • First step: try out the globbing syntax and try it out on a non-wildcard repo
  • We need a way to encode a null value, and it'd probably be good to standardise this
  • Ultimately each repo will need to have its own docs…
  1. Workflow versioning, docs etc
  • Immediately we should have a changelog and description of how to run (markdown's fine) in each repo we expect to run via external analysis directories
  • Longer term (medium term?) docs.nextstrain.org will have repo-specific (pathogen-specific) docs sub-projects Linked into reference docs for shared functionality.
    • Aim: Expectation is that each mature pathogen repo has such a docs project
    • Easiest way to roll this out may be a skeleton in the repo guide, but avoid the situation where placeholder text will make it into pathogen repos themselves
  • We’ve played with JSON schemas (for the config) and auto-generating HTML docs from this. That effort wasn’t successful enough to merge but I think this will be where we eventually end up.
  • Also consider implementing one-off checks within code (e.g.) when making changes to the configs

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    proposalProposals that warrant further discussion

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions