Skip to content

Use Pathoplexus#113

Draft
victorlin wants to merge 5 commits intomainfrom
victorlin/use-pathoplexus
Draft

Use Pathoplexus#113
victorlin wants to merge 5 commits intomainfrom
victorlin/use-pathoplexus

Conversation

@victorlin
Copy link
Member

@victorlin victorlin commented Oct 16, 2025

Description of proposed changes

This PR updates the ingest and phylogenetic workflows to work with Pathoplexus as the main data source.

Previews:

Related issue(s)

Closes #112

Review threads

Checklist

  • Update example data
  • Audit annotations and geolocation rules since these were based on NCBI Datasets accessions and metadata
  • Checks pass
  • Update changelog
  • Post-merge: update link to commit in Pathoplexus guide

Sync these up with the source code.

In fetch_from_ncbi, the reference to 'config.sources' seems to be outdated –
the code is hardcoded to fetch from GenBank.
Generalize the name before adding rules for other data sources.
This is largely inspired by ebola ingest¹ which recently switched to
Pathoplexus data. Many parts were copied directly with adjustments to
conform to the repo's current structure and syntax.

¹ nextstrain/ebola@979b2dc
Accessions in **/include.txt updated with the following command:

    for FILE in phylogenetic/defaults/{all-lineages,lineage-1A,lineage-2}/include.txt; do
      tail -n +2 ingest/results/metadata.tsv | awk -F'\t' '{print $1"\t"$4}' | while IFS=$'\t' read -r new old; do
        sed -i '' "s/^${old%.*} /${new} /" "$FILE"
      done
    done
@victorlin victorlin self-assigned this Oct 16, 2025
Command:

    tail -n +2 ingest/results/metadata.tsv | awk -F'\t' '{print $1"\t"$4}' | while IFS=$'\t' read -r new old; do
      sed -i '' "s/^${old%.*} /${new} /" "ingest/defaults/annotations.tsv"
    done
insdcRawReadsAccession: sra_accession
displayName: strain
geoLocCountry: country
geoLocAdmin1: division
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Division

Similar to nextstrain/rsv#99, geoLocAdmin1 is problematic with varying granularity and format. Ideally they'd be fixed in Pathoplexus, but for now we may have to process it with a custom script. Example:

  • Albany Co., Ny
  • Ny, Albany
  • New York
  • New York City

I believe all of these are simply New York in the current dataset.

accessionVersion: PPX_accession
insdcAccessionFull: INSDC_accession
insdcRawReadsAccession: sra_accession
displayName: strain
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Display name

displayName: strain was copied from the mapping in ebola, but strain is not exported in metadata_columns so this line is not necessary and should be removed.

If we wanted to provide a proper strain name, it might require something like nextstrain/ebola@de65325. I had started on that in 3b78d9a, but I think that can be a separate effort independent from using Pathoplexus.

geoLocCountry: country
geoLocAdmin1: division
geoLocAdmin2: location
sampleCollectionDate: date
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Date

There are some date differences in the current preview datasets compared to the live datasets. These have been addressed by properly applying the curated annotations in 34ea83e, I just haven't updated the previews.

The remaining differences seem to be improvements:

accession NCBI Datasets-based date Pathoplexus-based date
PP445046 2020-10-27
OZ185635 2016-07-06
OZ187637 2016-07-26
PQ468649 2023-XX-XX
PP580188 2021-08-04
PX056319 2023-09-21
PX453246 2025-07-31
PX453247 2025-07-24
PX453248 2025-07-24
PX453249 2025-07-10
PX453250 2025-07-17
PX453251 2025-07-31
PX453252 2025-08-13
PX453253 2025-08-13
diffing command
awk -F'\t' '
NR==FNR {
  if (FNR==1) {
    for (i=1; i<=NF; i++) metaCol[$i]=i
    metaAccIdx=metaCol["accession"]
    metaDateIdx=metaCol["date"]
    next
  }
  metaDate[$(metaAccIdx)]=$(metaDateIdx)
  next
}
FNR==1 {
  for (i=1; i<=NF; i++) openCol[$i]=i
  openAccIdx=openCol["INSDC_accession"]
  openDateIdx=openCol["date"]
  print "accession\tmetadata_date\tmetadata_open_date"
  next
}
{
  acc=$(openAccIdx)
  sub(/\.[0-9]+$/, "", acc)
  if (acc in metaDate) {
    metaVal=metaDate[acc]
    openVal=$(openDateIdx)
    if (metaVal != openVal)
      printf "%s\t%s\t%s\n", acc, metaVal, openVal
  }
}
' ingest/results/metadata_ncbi_datasets.tsv ingest/results/metadata_open.tsv

insdc_accession: 'INSDC_accession'

# The list of metadata columns to keep in the final output of the curation pipeline.
metadata_columns: [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exported columns

metadata_columns and ppx_metadata_fields were largely copied from ebola. They should be adjusted to reflect existing usage in WNV (example: lineage should be added).

@victorlin victorlin removed their assignment Feb 4, 2026
@victorlin
Copy link
Member Author

Unassigned myself since I don't plan to keep working on this at the moment. Anyone can feel free to pick it up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use Pathoplexus

1 participant