Using data sets at https://cami-challenge.org/reference-databases/.
(Data sets are already on farm at ~ctbrown/scratch3/2025-cami-old-db
)
Grab RefSeq genomic:
curl -JLO https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_2_DATABASES/RefSeq_genomic_20190108.tar
mkdir -p genomes
cd genomes
tar xf ../RefSeq_genomic_20190108.tar
cd ../
Grab taxonomy:
curl -JLO https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_2_DATABASES/ncbi_taxonomy.tar
tar xvf ncbi_taxonomy.tar
cd ncbi_taxonomy
tar xzf taxdump.tar.gz
cd ..
Grab accession2taxid from a bunch of places:
curl -JLO https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_2_DATABASES/ncbi_taxonomy_accession2taxid.tar
cd ncbi_taxonomy
tar xvf ../ncbi_taxonomy_accession2taxid.tar
curl -JLO https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/wgs.accession2taxid.gz
curl -JLO https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/dead_wgs.accession2taxid.gz
curl -JLO https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/dead_nucl.accession2taxid.gz
cd ../
srun -p high2 --time=48:00:00 --nodes=1 --cpus-per-task=64 --mem=80GB --pty /bin/bash
Based initially on scripts from https://github.com/dib-lab/2018-ncbi-lineages/.
Make a list of the genomes:
find genomes -type f > genome-list.txt
Then extract nucleotide accessions from the genomes:
./get-seq-acc-for-genomes.py genome-list.txt -o genome-list.accs.csv
Build a combined list of nucleotide accessions to taxids:
./tsv-to-parquet.py \
ncbi_taxonomy/ncbi_taxonomy_accession2taxid/nucl_gb.accession2taxid.gz \
ncbi_taxonomy/wgs.accession2taxid.gz \
ncbi_taxonomy/dead_nucl.accession2taxid.gz \
ncbi_taxonomy/dead_wgs.accession2taxid.gz \
-o accession2taxid.parquet
Get the taxIDs for the sequence accessions from the genome list:
./join-seqacc-taxid.py genome-list.accs.csv accession2taxid.parquet -o genome-list.taxid.parquet
Finally, make lineage & manysketch files:
./make-manysketch-and-lineage.py genome-list.taxid.parquet genome-list.txt \
--nodes ncbi_taxonomy/nodes.dmp --names ncbi_taxonomy/names.dmp \
--output-manysketch-csv manysketch.csv --output-lineage lineages.csv
aaaaaand... build!
sourmash scripts manysketch manysketch.csv -p k=21,k=31,k=51,dna -p skipm1n3 -p skipm2n3 -o cami-refseq-db.sig.zip
A full taxpath
is required for outputting bioboxes
format from sourmash for CAMI II comparisons.
Here, we use taxonkit
to add taxpath information to the lineages file.
To install taxonkit
and its python bindings (pytaxonkit
), you can use the conda environment.yml
file provided in this repository.
conda env create -f environment.yml
conda activate cami-db
Then, run the script to convert taxids to lineages:
./taxid-to-lineages.taxonkit.py lineages.csv \
--data-dir ./ncbi_taxonomy \
-o lineages.taxpath.csv
sourmash sig summarize cami-refseq-db.sig.zip
should show:
num signatures: 705715
** examining manifest...
total hashes: 3469770870
summary of sketches:
141143 sketches with DNA, k=21, scaled=1000 579319798 total hashes
141143 sketches with DNA, k=31, scaled=1000 578226823 total hashes
141143 sketches with DNA, k=51, scaled=1000 579472574 total hashes
141143 sketches with skipm1n3, k=21, scaled=1000 576592042 total hashes
141143 sketches with skipm2n3, k=21, scaled=1000 1156159633 total hashes
and the lineages file should have 141,143 rows + 1 header in it:
wc -l lineages.taxpath.csv
We're missing sequence accessions and/or taxids for about 500 genomes total. About 200 missing sequence accessions, and about 300 missing taxids.