Skip to content

Commit fcda4b1

Browse files
committed
WIP strain name matching
1 parent c34c448 commit fcda4b1

File tree

6 files changed

+430
-0
lines changed

6 files changed

+430
-0
lines changed
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Strain name matching
2+
3+
With our move away from fauna and to a curated all-influenza ingest pipeline we have introduced a number of strain name changes. This immediately presents a problem as we have myriad lists of hardcoded strain names, such as outliers-to-drop and force-include-lists. This workflow (which is rather ad-hoc!) attempts to match up the old (i.e. fauna) strain names with their updated strain names.
4+
5+
We use a combination of fuzzy-matching and, where possible a hardcoded map of fauna strain names to new strain names.
6+
7+
## How to run
8+
9+
```
10+
cd ingest
11+
snakemake --cores 4 --snakefile build-configs/strain-name-matching/Snakefile -pf -n
12+
```
13+
14+
## What files you'll need
15+
16+
* The original fauna metadata files e.g. (from the base directory) `aws s3 cp s3://nextstrain-data-private/files/workflows/seasonal-flu/h1n1pdm/metadata.tsv.xz - | xz -c -d > data/h1n1pdm/metadata.tsv`. Note that this file won't represent fauna data forever!
17+
18+
* A number of files from the normal ingest pipeline. `snakemake --cores 2 --config gisaid_pairs='["gisaid_cache"]' -pf data/curated_gisaid.ndjson data/avian-flu/curated_gisaid.ndjson results/h3n2/metadata.tsv results/h1n1pdm/metadata.tsv results/vic/metadata.tsv results/yam/metadata.tsv`
19+
20+
21+
## Hardcoded strain maps
22+
23+
For seasonal-flu datasets we can query the EPI_ISLs of the fauna data against the curated data to create lookups.
24+
This table reports how many of the fauna strain names have matches in our data. For avian-flu it's a little tricker so we leverage the existing diff-avian-flu script to create lookups.
25+
26+
27+
| Dataset | Updated | Missing | Unchanged |
28+
| -------- | ------- | -- | -- |
29+
| H1N1pdm | 1,729 | 1,659 | 149,143
30+
| H3N2 | 2,125 | 2,674 | 177,009
31+
| vic | 1,338 | 286 | 66,283
32+
| yam | 367 | 10 | 21,946
33+
| avian-flu | 34,426 | 1,531 | 26,754
34+
35+
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
include: "../../rules/remote_files.smk"
2+
3+
import os
4+
configfile: os.path.join(workflow.basedir, "config.yaml")
5+
6+
print(f"{config=}")
7+
8+
rule all:
9+
input:
10+
files = [f"results/strain-name-matching/{d}/{f}" for d in config['strain-lists'] for f in config['strain-lists'][d]]
11+
12+
13+
rule metadata_strains:
14+
input:
15+
"results/{dataset}/metadata.tsv",
16+
output:
17+
"results/strain-name-matching/{dataset}/metadata.txt"
18+
run:
19+
from augur.io import read_metadata
20+
m = read_metadata(input[0])
21+
with open(output[0], 'w') as fh:
22+
print("\n".join(m.index.tolist()), file=fh)
23+
24+
rule compute_strain_maps_via_epi_isl_lookup:
25+
"""
26+
Computes maps of (old) fauna strain names to (new) curated strain names, where possible.
27+
Fauna inputs via (e.g.) "aws s3 cp s3://nextstrain-data-private/files/workflows/seasonal-flu/vic/metadata.tsv
28+
.xz - | xz -c -d > data/vic/metadata.tsv"
29+
"""
30+
input:
31+
fauna=lambda w: config['strain-maps'][w.dataset],
32+
curated="data/curated_gisaid.ndjson", # hardcoded
33+
output:
34+
tsv = "results/strain-name-matching/{dataset}/strain-name-map.tsv"
35+
wildcard_constraints:
36+
dataset= "h3n2|h1n1pdm|vic|yam"
37+
shell:
38+
"""
39+
./scripts/match-strain-names.py \
40+
--original-metadata {input.fauna} \
41+
--new-metadata {input.curated} \
42+
--changed {output.tsv}
43+
"""
44+
45+
rule compute_strain_maps_via_avian_flu_diff:
46+
"""
47+
Computes maps of (old) fauna strain names to (new) curated strain names, where possible.
48+
"""
49+
input:
50+
fauna=lambda w: config['strain-maps'][w.dataset],
51+
curated="data/avian-flu/curated_gisaid.ndjson", # hardcoded
52+
output:
53+
tsv = "results/strain-name-matching/{dataset}/strain-name-map.tsv"
54+
wildcard_constraints:
55+
dataset= "avian-flu"
56+
shell:
57+
"""
58+
./scripts/diff-avian-flu.py \
59+
--truth {input.fauna} \
60+
--query {input.curated} \
61+
--output-strain-map {output.tsv}
62+
"""
63+
64+
def strain_map(wildcards):
65+
if config['strain-maps'].get(wildcards.dataset, False):
66+
return "results/strain-name-matching/{dataset}/strain-name-map.tsv"
67+
return []
68+
69+
rule match_strains:
70+
input:
71+
metadata="results/strain-name-matching/{dataset}/metadata.txt",
72+
strains=lambda w: path_or_url(config['strain-lists'][w.dataset][w.file_type]),
73+
strain_map_tsv=strain_map,
74+
output:
75+
strains="results/strain-name-matching/{dataset}/{file_type}",
76+
shell:
77+
"""
78+
./scripts/strain-name-fuzzer.py \
79+
--curated-strains {input.metadata:q} \
80+
--query-strains {input.strains:q} \
81+
--strain-map {input.strain_map_tsv:q} \
82+
> {output.strains:q}
83+
"""
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
2+
strain-maps:
3+
# The value here points to the FAUNA data
4+
h1n1pdm: "../data/{dataset}/metadata.tsv"
5+
h3n2: "../data/{dataset}/metadata.tsv"
6+
vic: "../data/{dataset}/metadata.tsv"
7+
yam: "../data/{dataset}/metadata.tsv"
8+
avian-flu: "targets/avian-flu-fauna-ha.tsv"
9+
10+
11+
strain-lists:
12+
h3n2:
13+
"outliers.txt": "../config/h3n2/outliers.txt"
14+
"reference_strains.txt": "../config/h3n2/reference_strains.txt"
15+
"ha_prioritised_seqs_file.tsv": "../config/h3n2/ha/prioritized_seqs_file.tsv"
16+
17+
vic:
18+
"outliers.txt": "../config/vic/outliers.txt"
19+
"reference_strains.txt": "../config/vic/reference_strains.txt"
20+
21+
yam:
22+
"outliers.txt": "../config/yam/outliers.txt"
23+
"reference_strains.txt": "../config/yam/reference_strains.txt"
24+
25+
h1n1pdm:
26+
"outliers.txt": "../config/h1n1pdm/outliers.txt"
27+
"reference_strains.txt": "../config/h1n1pdm/reference_strains.txt"
28+
29+
avian-flu:
30+
h5n1_dropped_strains_h5n1.txt: https://raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h5n1/dropped_strains_h5n1.txt
31+
h5n1_include_strains_h5n1_2y.txt: https://raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h5n1/include_strains_h5n1_2y.txt
32+
h5n1_include_strains_h5n1_all-time.txt: https://raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h5n1/include_strains_h5n1_all-time.txt
33+
34+
h5nx_dropped_strains_h5nx.txt: https://raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h5nx/dropped_strains_h5nx.txt
35+
h5nx_include_strains_h5nx_2y.txt: https://raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h5nx/include_strains_h5nx_2y.txt
36+
h5nx_include_strains_h5nx_all-time.txt: https://raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h5nx/include_strains_h5nx_all-time.txt
37+
38+
h7n9_dropped_strains_h7n9.txt: https://raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h7n9/dropped_strains_h7n9.txt
39+
h7n9_include_strains_h7n9_all-time.txt: https://raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h7n9/include_strains_h7n9_all-time.txt
40+
41+
h9n2_dropped_strains_h9n2.txt: https://raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h9n2/dropped_strains_h9n2.txt
42+
h9n2_include_strains_h9n2_all-time.txt: https://raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h9n2/include_strains_h9n2_all-time.txt

ingest/scripts/diff-avian-flu.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -435,6 +435,7 @@ def compare_records(truth_records, query_records):
435435
parser = argparse.ArgumentParser(description=__doc__)
436436
parser.add_argument("--truth", required=True, metavar="FILE", help="Source of truth file (TSV)")
437437
parser.add_argument("--query", required=True, metavar="FILE", help="Query file (NDJSON )")
438+
parser.add_argument("--output-strain-map", required=False, metavar="FILE", help="Output a map of strain names (FAUNA -> NEW)")
438439

439440
args = parser.parse_args()
440441

@@ -454,6 +455,16 @@ def compare_records(truth_records, query_records):
454455
remap_via_simplified_strain(missing_keys, added_keys, truth_records, query_records)
455456
(truth_keys, query_keys, missing_keys, added_keys, common_keys) = compare_records(truth_records, query_records)
456457

458+
if args.output_strain_map:
459+
print(f"\nWriting out strain map of these {len(common_keys)} matches if the keys differ")
460+
with open(args.output_strain_map, 'w') as fh:
461+
for key in common_keys:
462+
truth_strain = truth_records[key]['strain'] # FAUNA
463+
query_strain = query_records[key]['strain'] # CURATED NDJSON
464+
if truth_strain!=query_strain:
465+
print(f"{truth_strain}\t{query_strain}", file=fh)
466+
print(f"DONE.")
467+
457468

458469
print()
459470
# Allow missing "truth" keys IF --extra-truth, otherwise report the rows
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Attempts to link strain names via a common ID
4+
"""
5+
import argparse
6+
import sys
7+
from augur.io import read_metadata
8+
from augur.io.json import load_ndjson
9+
10+
type EpiIsl = str
11+
type StrainName = str
12+
13+
def parse_existing_metadata(fname: str) -> dict[EpiIsl, StrainName]:
14+
df = read_metadata(fname)
15+
return dict(zip(df['gisaid_epi_isl'], df.index))
16+
17+
def parse_ndjson(fname: str) -> dict[EpiIsl, StrainName]:
18+
with open(fname) as fh:
19+
records = load_ndjson(fh)
20+
return {record['gisaid_epi_isl']:record['strain'] for record in records}
21+
22+
23+
if __name__ == '__main__':
24+
parser = argparse.ArgumentParser(description=__doc__)
25+
parser.add_argument("--original-metadata", required=True, help="Original (fauna) metadata TSV with 'strain' and 'gisaid_epi_isl' columns")
26+
parser.add_argument("--new-metadata", required=True, help="Curated NDJSON")
27+
parser.add_argument("--changed", required=True, help="TSV of strain names which have changed. Map is FAUNA -> NEW")
28+
args = parser.parse_args()
29+
30+
fauna = parse_existing_metadata(args.original_metadata)
31+
curated = parse_ndjson(args.new_metadata)
32+
changed_fh = open(args.changed, 'w')
33+
print("FAUNA_STRAIN\tCURATED_STRAIN", file=changed_fh)
34+
35+
[unchanged, changed, missing] = [0,0,0]
36+
for epi_isl, fauna_name in fauna.items():
37+
new_name = curated.get(epi_isl, None)
38+
if new_name is None:
39+
missing+=1
40+
continue
41+
if fauna_name == new_name:
42+
unchanged+=1
43+
else:
44+
changed+=1
45+
print(f"{fauna_name}\t{new_name}", file=changed_fh)
46+
47+
print(f"Using {args.original_metadata} as source of truth...", file=sys.stderr)
48+
print(f"{missing=:,}")
49+
print(f"{changed=:,}")
50+
print(f"{unchanged=:,}")
51+
changed_fh.close()

0 commit comments

Comments
 (0)