Skip to content

Commit 7e76787

Browse files
committed
[ingest] snakemake surgery
This refactors the snakemake graph away from seasonal-flu specific targets towards generic config-defined outputs. The final (seasonal-flu) targets are preserved, and the config structure makes them more obvious. A target for avian-flu is added, and this should make it obvious how we could easily add new targets of interest. We use "dataset" for these targets rather than "lineage" as the intention is to decouple the concepts, and the "avian-flu" dataset/target should make this obvious. (Aside: it's always bugged me that lineage=h3n2 also meant we restricted to human hosts, but I chose not to rename things here as that concept is very engrained in this repo.) The **seasonal flu** phylo builds work off this commit just fine -- you can avoid round-tripping through S3 via the following (for h3n2): mkdir -p data/h3n2 cp ingest/results/h3n2/ * data/h3n2/ The **avian-flu** phylo builds _work_ when using these data, but there are further things to implement such as fixing strain name mismatches (e.g. when using LABEL metadata, include/exclude files) as well as the missing GenoFLU metadata.
1 parent bef811b commit 7e76787

File tree

9 files changed

+277
-216
lines changed

9 files changed

+277
-216
lines changed

ingest/Snakefile

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,19 +8,22 @@ workdir: workflow.current_basedir
88
# Use default configuration values. Override with Snakemake's --configfile/--config options.
99
configfile: "defaults/config.yaml"
1010

11+
VALID_DATASETS = list(config['filtering'].keys())
12+
1113
wildcard_constraints:
12-
# Expected lineages that should match the standardized output lineages
14+
# Expected datasets should match the standardized outputs of the `filtering` block
15+
# (example datasets are "h3n2", "avian-flu")
1316
# in scripts/standardized-lineage
14-
lineage = r'h1n1pdm|h3n2|vic|yam',
17+
dataset = r'|'.join(VALID_DATASETS),
1518
segment = r'pb2|pb1|pa|ha|np|na|mp|ns',
1619
# Constrain GISAID pair names to "gisaid_cache" or YYYY-MM-DD-N
1720
gisaid_pair = r'gisaid_cache|\d{4}-\d{2}-\d{2}(-\d+)?'
1821

1922

2023
rule all:
2124
input:
22-
metadata = expand("results/{lineage}/metadata.tsv", lineage=config["lineages"]),
23-
sequences = expand("results/{lineage}/{segment}.fasta", lineage=config["lineages"], segment=config["segments"]),
25+
metadata = expand("results/{dataset}/metadata.tsv", dataset=VALID_DATASETS),
26+
sequences = expand("results/{dataset}/{segment}.fasta", dataset=VALID_DATASETS, segment=config["segments"]),
2427

2528

2629
include: "rules/prepare_ndjson.smk"

ingest/build-configs/nextstrain-automation/upload.smk

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,10 @@ def all_processed_gisaid_pairs(wildcards):
1818
rule upload_all:
1919
input:
2020
ndjson="results/upload/gisaid.ndjson.upload",
21-
metadata=expand("results/upload/{lineage}/metadata.tsv.upload",
22-
lineage=config["lineages"]),
23-
sequences=expand("results/upload/{lineage}/{segment}.fasta.upload",
24-
lineage=config["lineages"],
21+
metadata=expand("results/upload/{dataset}/metadata.tsv.upload",
22+
dataset=list(config['filtering'].keys())),
23+
sequences=expand("results/upload/{dataset}/{segment}.fasta.upload",
24+
dataset=list(config['filtering'].keys()),
2525
segment=config["segments"]),
2626
mv_processed=all_processed_gisaid_pairs,
2727

@@ -73,33 +73,33 @@ rule mv_processed_gisaid_pair:
7373

7474
rule upload_metadata:
7575
input:
76-
metadata="results/{lineage}/metadata.tsv",
76+
metadata="results/{dataset}/metadata.tsv",
7777
output:
78-
flag="results/upload/{lineage}/metadata.tsv.upload",
78+
flag="results/upload/{dataset}/metadata.tsv.upload",
7979
params:
8080
s3_dst=config["s3_dst"],
8181
shell:
8282
r"""
8383
./vendored/upload-to-s3 \
8484
--quiet \
8585
{input.metadata:q} \
86-
{params.s3_dst:q}/{wildcards.lineage}/metadata.tsv.xz \
86+
{params.s3_dst:q}/{wildcards.dataset}/metadata.tsv.xz \
8787
2>&1 | tee {output.flag:q}
8888
"""
8989

9090

9191
rule upload_sequences:
9292
input:
93-
sequences="results/{lineage}/{segment}.fasta",
93+
sequences="results/{dataset}/{segment}.fasta",
9494
output:
95-
flag="results/upload/{lineage}/{segment}.fasta.upload",
95+
flag="results/upload/{dataset}/{segment}.fasta.upload",
9696
params:
9797
s3_dst=config["s3_dst"],
9898
shell:
9999
r"""
100100
./vendored/upload-to-s3 \
101101
--quiet \
102102
{input.sequences:q} \
103-
{params.s3_dst:q}/{wildcards.lineage}/{wildcards.segment}/sequences.fasta.xz \
103+
{params.s3_dst:q}/{wildcards.dataset}/{wildcards.segment}/sequences.fasta.xz \
104104
2>&1 | tee {output.flag:q}
105105
"""

ingest/defaults/config.yaml

Lines changed: 95 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -16,18 +16,14 @@
1616
# If left empty, workflow will glob for all `data/<YYYY-MM-DD-N>-metadata.xls`
1717
# to include all pairs as input. These will be sorted in reverse order to
1818
# prioritize the later downloads during deduplication.
19+
#
20+
# If you set this to "gisaid_cache" (e.g. via `--config gisaid_pairs='["gisaid_cache"]'`)
21+
# then the pipeline will run without consuming any xlsx/fasta files
1922
gisaid_pairs: []
23+
2024
# GISAID EPI ISL field to deduplicate the GISAID records by id prior to curation
2125
gisaid_id_field: Isolate_Id
2226

23-
# Expected lineages that should match the standardized output lineages
24-
# in scripts/standardized-lineage
25-
lineages:
26-
- h1n1pdm
27-
- h3n2
28-
- vic
29-
- yam
30-
3127
segments:
3228
- pb2
3329
- pb1
@@ -70,7 +66,6 @@ curate:
7066
new_lineage_field: "lineage"
7167
lineage_annotations: "defaults/lineages.tsv"
7268
host_field: "host"
73-
hosts_to_include: ["Human"]
7469
# List of date fields to standardize to ISO format YYYY-MM-DD
7570
date_fields: ["date", "date_submitted"]
7671
# List of expected date formats that are present in the date fields provided above
@@ -114,25 +109,96 @@ curate:
114109
# The GISAID ID field used to prioritize records during strain deduplication
115110
gisaid_id_field: "gisaid_epi_isl"
116111
# The prioritized strain ids for strain deduplication.
117-
# The {lineage} is a wildcard that will be filled by Snakemake
118-
prioritized_strain_ids: "defaults/{lineage}/prioritized_strain_ids.tsv"
119112
# Column added to metadata to annotate which strains are reference strains
120113
reference_column: "is_reference"
121-
# The list of metadata columns to keep in the final output of the curation pipeline.
122-
metadata_columns:
123-
- strain
124-
- gisaid_epi_isl
125-
- date
126-
- date_submitted
127-
- region
128-
- country
129-
- division
130-
- location
131-
- passage_category
132-
- originating_lab
133-
- submitting_lab
134-
- age
135-
- gender
136-
- gisaid_strain
137-
- gihsn_sample
138-
- is_reference
114+
115+
116+
# The *filtering* block determines how the curated (all-influenza NDJSON) is sliced and diced
117+
filtering:
118+
# The seasonal-flu phylo workflows start from TSV/FASTAs per lineage
119+
# NOTE [james]: This isn't all h3n2 as it's filtered to host=human. I toyed with the idea of naming
120+
# the dataset "seasonal-h3n2" for this reason, but the "h3n2" term is so ubiquitous in the
121+
# codebase that it felt egregious.
122+
h3n2:
123+
lineages: h3n2
124+
additional_field: host
125+
additional_field_values: human
126+
prioritized_strain_ids: defaults/h3n2/prioritized_strain_ids.tsv
127+
reference_strains: ../config/h3n2/reference_strains.txt
128+
metadata_columns: &seasonal-flu-metadata-columns
129+
- strain
130+
- gisaid_epi_isl
131+
- date
132+
- date_submitted
133+
- region
134+
- country
135+
- division
136+
- location
137+
- passage_category
138+
- originating_lab
139+
- submitting_lab
140+
- age
141+
- gender
142+
- gisaid_strain
143+
- gihsn_sample
144+
- is_reference
145+
h1n1pdm:
146+
lineages: h1n1pdm
147+
additional_field: host
148+
additional_field_values: human
149+
reference_strains: ../config/h1n1pdm/reference_strains.txt
150+
metadata_columns: *seasonal-flu-metadata-columns
151+
vic:
152+
lineages: vic
153+
additional_field: host
154+
additional_field_values: human
155+
reference_strains: ../config/vic/reference_strains.txt
156+
metadata_columns: *seasonal-flu-metadata-columns
157+
yam:
158+
lineages: yam
159+
additional_field: host
160+
additional_field_values: human
161+
reference_strains: ../config/yam/reference_strains.txt
162+
metadata_columns: *seasonal-flu-metadata-columns
163+
164+
# The avian-flu workflows do the filtering themselves via `augur filter`
165+
# using the config key `subtype_query`. To prevent too many changes for the present time
166+
# we're going to continue to provision one big "avian-flu" dataset. (We may want to change
167+
# this in the future.)
168+
avian-flu:
169+
lineages:
170+
- h5nx # this'll include more than avian-flu currently subsamples to, but that's ok!
171+
- h7n9
172+
- h9n2
173+
metadata_columns: # <https://github.com/nextstrain/avian-flu/blob/f963447179c2b500b5598f056054374d3c9557a0/ingest/rules/ingest_fauna.smk#L37>
174+
- strain
175+
# Note: Fauna's 'virus' field (always "avian-flu") dropped
176+
# Note: Fauna's 'isolate_id' remapped to 'accession_ha'
177+
- date
178+
- region
179+
- country
180+
- division
181+
- location
182+
- host
183+
- domestic_status
184+
- subtype # Note: identical to *lineage* for all A-type, non h1n1pdm viruses
185+
- originating_lab
186+
- submitting_lab
187+
- authors
188+
- PMID
189+
- gisaid_clade
190+
# Note: h5 clade no longer available
191+
- pathogenicity # newly added (not used in fauna)
192+
193+
# This dataset (for testing purposes only) replicates the previous "data/seasonal_flu.ndjson"
194+
# (target: `data/seasonal-flu-for-diffing/curated_gisaid.ndjson`)
195+
# Note that it won't go all the way to TSV/FASTA as it's missing config params
196+
# seasonal-flu-for-diffing:
197+
# lineages:
198+
# - h3n2
199+
# - h1n1pdm
200+
# - vic
201+
# - yam
202+
# additional_field: host
203+
# additional_field_values: human
204+

ingest/defaults/h1n1pdm/prioritized_strain_ids.tsv

Lines changed: 0 additions & 1 deletion
This file was deleted.

ingest/defaults/vic/prioritized_strain_ids.tsv

Lines changed: 0 additions & 1 deletion
This file was deleted.

ingest/defaults/yam/prioritized_strain_ids.tsv

Lines changed: 0 additions & 1 deletion
This file was deleted.

0 commit comments

Comments
 (0)