Conversation
Sync these up with the source code. In fetch_from_ncbi, the reference to 'config.sources' seems to be outdated – the code is hardcoded to fetch from GenBank.
Generalize the name before adding rules for other data sources.
This is largely inspired by ebola ingest¹ which recently switched to Pathoplexus data. Many parts were copied directly with adjustments to conform to the repo's current structure and syntax. ¹ nextstrain/ebola@979b2dc
Accessions in **/include.txt updated with the following command:
for FILE in phylogenetic/defaults/{all-lineages,lineage-1A,lineage-2}/include.txt; do
tail -n +2 ingest/results/metadata.tsv | awk -F'\t' '{print $1"\t"$4}' | while IFS=$'\t' read -r new old; do
sed -i '' "s/^${old%.*} /${new} /" "$FILE"
done
done
Command:
tail -n +2 ingest/results/metadata.tsv | awk -F'\t' '{print $1"\t"$4}' | while IFS=$'\t' read -r new old; do
sed -i '' "s/^${old%.*} /${new} /" "ingest/defaults/annotations.tsv"
done
| insdcRawReadsAccession: sra_accession | ||
| displayName: strain | ||
| geoLocCountry: country | ||
| geoLocAdmin1: division |
There was a problem hiding this comment.
Division
Similar to nextstrain/rsv#99, geoLocAdmin1 is problematic with varying granularity and format. Ideally they'd be fixed in Pathoplexus, but for now we may have to process it with a custom script. Example:
Albany Co., NyNy, AlbanyNew YorkNew York City
I believe all of these are simply New York in the current dataset.
| accessionVersion: PPX_accession | ||
| insdcAccessionFull: INSDC_accession | ||
| insdcRawReadsAccession: sra_accession | ||
| displayName: strain |
There was a problem hiding this comment.
Display name
displayName: strain was copied from the mapping in ebola, but strain is not exported in metadata_columns so this line is not necessary and should be removed.
If we wanted to provide a proper strain name, it might require something like nextstrain/ebola@de65325. I had started on that in 3b78d9a, but I think that can be a separate effort independent from using Pathoplexus.
| geoLocCountry: country | ||
| geoLocAdmin1: division | ||
| geoLocAdmin2: location | ||
| sampleCollectionDate: date |
There was a problem hiding this comment.
Date
There are some date differences in the current preview datasets compared to the live datasets. These have been addressed by properly applying the curated annotations in 34ea83e, I just haven't updated the previews.
The remaining differences seem to be improvements:
| accession | NCBI Datasets-based date | Pathoplexus-based date |
|---|---|---|
| PP445046 | 2020-10-27 | |
| OZ185635 | 2016-07-06 | |
| OZ187637 | 2016-07-26 | |
| PQ468649 | 2023-XX-XX | |
| PP580188 | 2021-08-04 | |
| PX056319 | 2023-09-21 | |
| PX453246 | 2025-07-31 | |
| PX453247 | 2025-07-24 | |
| PX453248 | 2025-07-24 | |
| PX453249 | 2025-07-10 | |
| PX453250 | 2025-07-17 | |
| PX453251 | 2025-07-31 | |
| PX453252 | 2025-08-13 | |
| PX453253 | 2025-08-13 |
diffing command
awk -F'\t' '
NR==FNR {
if (FNR==1) {
for (i=1; i<=NF; i++) metaCol[$i]=i
metaAccIdx=metaCol["accession"]
metaDateIdx=metaCol["date"]
next
}
metaDate[$(metaAccIdx)]=$(metaDateIdx)
next
}
FNR==1 {
for (i=1; i<=NF; i++) openCol[$i]=i
openAccIdx=openCol["INSDC_accession"]
openDateIdx=openCol["date"]
print "accession\tmetadata_date\tmetadata_open_date"
next
}
{
acc=$(openAccIdx)
sub(/\.[0-9]+$/, "", acc)
if (acc in metaDate) {
metaVal=metaDate[acc]
openVal=$(openDateIdx)
if (metaVal != openVal)
printf "%s\t%s\t%s\n", acc, metaVal, openVal
}
}
' ingest/results/metadata_ncbi_datasets.tsv ingest/results/metadata_open.tsv| insdc_accession: 'INSDC_accession' | ||
|
|
||
| # The list of metadata columns to keep in the final output of the curation pipeline. | ||
| metadata_columns: [ |
There was a problem hiding this comment.
Exported columns
metadata_columns and ppx_metadata_fields were largely copied from ebola. They should be adjusted to reflect existing usage in WNV (example: lineage should be added).
|
Unassigned myself since I don't plan to keep working on this at the moment. Anyone can feel free to pick it up! |
Description of proposed changes
This PR updates the ingest and phylogenetic workflows to work with Pathoplexus as the main data source.
Previews:
Related issue(s)
Closes #112
Review threads
Checklist