Skip to content

Support dynamic CLI calls through config centralization #98

@victorlin

Description

@victorlin

Motivation

copied from nextstrain/ebola@a74f20e

Our current approach of connecting config YAML keys/values to a command line invocation requires a large amount of in-the-middle code, notably the params block in the snakemake rule. A good example of this is nextstrain/ebola@37aced5.

This in-the-middle-code is both painful to write and inherently per-repo.

Description

slightly modified version of nextstrain/ebola@a74f20e

Parse the config YAML into per-rule command-line arguments. This code, and the YAML key name to arg name mappings could be centralised or read in from a schema (etc), which has the additional benefit of consistency in YAML key names across repos. Essentially we shift the cost to a centralised bit of code and make the per-repo 'rule' block much simpler to write, and more powerful as it has access to a larger range of config options than we'd typically write. (E.g. no more commits like the one linked above.) Furthermore nothing here stops a repo-specific addition to the config, or repo-specific rules / functionality.

A side-effect of this implementation is that it leads us towards a YAML structure where each rule has an associated block, which is the dominant style in many repos (ebola, avian-flu, etc) but we often store keys such as id_column, files.x elsewhere. With YAML anchors I think it's plausible that we have all config variables for a rule stored under the rule's name, which is clarifying for the author and should help with our efforts to document and describe the config YAML.

Relevant work

augur subsample effectively does this for augur filter, and can be used as a drop-in replacement (e.g. measles).

rule filter:
    input:
        config="path/to/config.yaml"
    shell:
        """
        augur subsample --config {input.config}

@jameshadfield has prototyped this using a workflow-level helper function get_config_args() in nextstrain/ebola@a74f20e.

rule filter:
    params:
        args = get_config_args("filter")
    shell:
        """
        augur filter {params.args}

Special consideration: filepath values

1. Resolving filepaths

Files are typically given as relative filepaths to CLI arguments. To support both defaults and custom user files, these relative filepaths are searched for in multiple directories using a helper function. Example:

rule translate:
    input:
        reference = resolve_config_path(config["files"]["reference"])
    shell:
        """
        augur translate --reference-sequence {input.reference}

With dynamic inputs/params, this filepath resolution would need to happen elsewhere. More discussion on this in nextstrain/public#23.

2. Snakemake's input change detection

brought up by @j23414 in nextstrain/mumps#45 (comment)

When a filepath is defined in Snakemake's input block, The mtime rerun trigger allows Snakemake to intelligently rerun the workflow upon modifications to the file. This is not possible with filepaths defined elsewhere, such as in the current augur subsample and get_config_args() implementations.

The functionality could potentially be brought back with a workflow-level helper function that dynamically adds inputs.

See also: nextstrain/public#31

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions