ZipStrain is a strain-resolution metagenomics toolkit for:
- profiling mapped reads into per-position nucleotide counts
- comparing profiles at genome and gene levels
- running large profiling/comparison jobs in local or Slurm batch mode
- building local reference-genome databases from abundance outputs (currently Sylph)
Official docs: https://OlmLab.github.io/ZipStrain/
ZipStrain is developed by Parsa Ghadermazi and team at the Olm Lab, University of Colorado Boulder.
pip install zipstrain
zipstrain testDetailed setup: docs/installation.md
zipstrain utilities prepare_profiling \
--reference-fasta <reference.fasta> \
--gene-fasta <genes.fna> \
--stb-file <mapping.stb> \
--output-dir <profiling_assets_dir>This creates:
genomes_bed_file.bedgene_range_table.tsvgenome_lengths.parquet
Input CSV:
sample_name,bamfile
sample1,/path/to/sample1.bam
sample2,/path/to/sample2.bamRun profiling:
zipstrain profile \
--input-table <samples.csv> \
--stb-file <mapping.stb> \
--gene-range-table <profiling_assets_dir/gene_range_table.tsv> \
--bed-file <profiling_assets_dir/genomes_bed_file.bed> \
--genome-length-file <profiling_assets_dir/genome_lengths.parquet> \
--run-dir <profile_run_dir>Optional execution controls:
--execution-mode local|slurm--slurm-config <slurm.json>(required when--execution-mode slurm)--container-engine local|docker|apptainer--num-procs,--task-per-batch,--max-concurrent-batches,--poll-interval
Input CSV columns (required):
profile_nameprofile_locationreference_db_idgene_db_id
zipstrain utilities build-profile-db \
--profile-db-csv <profiles.csv> \
--output-file <profile_db.parquet>null_model settings are not required anymore for comparison config objects. Legacy null-model keys in older JSON files are ignored for backward compatibility.
Genome comparison config:
zipstrain utilities build-genome-comparison-config \
--profile-db <profile_db.parquet> \
--gene-db-id <gene_db_id> \
--reference-genome-id <reference_id> \
--scope all \
--min-cov 5 \
--min-gene-compare-len 200 \
--stb-file-loc <mapping.stb> \
--output-file <genome_compare.json>Gene comparison config:
zipstrain utilities build-gene-comparison-config \
--profile-db <profile_db.parquet> \
--gene-db-id <gene_db_id> \
--reference-genome-id <reference_id> \
--scope all:all \
--min-cov 5 \
--min-gene-compare-len 200 \
--stb-file-loc <mapping.stb> \
--output-file <gene_compare.json>Genome comparisons:
zipstrain compare genomes \
--genome-comparison-object <genome_compare.json> \
--run-dir <compare_run_dir> \
--engine polars \
--calculate ani+ibs+identical_genes \
--duckdb-memory-limit 4GB \
--duckdb-threads 8Gene comparisons:
zipstrain compare genes \
--gene-comparison-object <gene_compare.json> \
--run-dir <gene_compare_run_dir> \
--engine duckdb \
--ani-method popani \
--duckdb-memory-limit 4GB \
--duckdb-threads 8Notes:
--enginesupportspolarsorduckdb.--calculatecontrols genome metrics:ani,ibs,identical_genes(allsupported). Default isall.- In scoped comparisons (
--genomeor--scopenotall), the polars path uses DuckDB prefiltering first. --duckdb-memory-limitand--duckdb-threadsare available in both single and batch compare interfaces.
Genome-level single compare:
zipstrain utilities single_compare_genome \
--mpileup-contig-1 <sampleA_profile.parquet> \
--mpileup-contig-2 <sampleB_profile.parquet> \
--stb-file <mapping.stb> \
--genome all \
--calculate ani+ibs+identical_genes \
--engine duckdb \
--duckdb-memory-limit 2GB \
--duckdb-threads 8 \
--duckdb-temp-directory /tmp \
--output-file <sampleA_sampleB_comparison.parquet>Gene-level single compare:
zipstrain utilities single_compare_gene \
--mpileup-contig-1 <sampleA_profile.parquet> \
--mpileup-contig-2 <sampleB_profile.parquet> \
--stb-file <mapping.stb> \
--scope all:all \
--engine polars \
--output-file <sampleA_sampleB_gene_comparison.parquet>Each profile task produces:
<sample_name>.parquet<sample_name>_genome_stats.parquet<sample_name>_gene_stats.parquet
For zipstrain profile, these files are written inside task directories under <run_dir>/batch_*/<sample_name>/.
<sample_name>.parquet columns:
chrom,genome,gene,pos,A,C,G,T
Rows are sorted by genome, chrom, pos ascending.
<sample_name>_genome_stats.parquet columns:
genomecoveragebreadthgenome_lengthgap_meangap_std5x_cov_sitesheterogeneityberfugreads_mapped
<sample_name>_gene_stats.parquet columns:
genomegenelengthbreadthcoverage
Batch runs write final merged results to:
<run_dir>/Outputs/all_comparisons.parquet
Columns:
- Always:
genome,sample_1,sample_2 - If
anirequested:total_positions,share_allele_pos,genome_pop_ani - If
ibsrequested:max_consecutive_length - If
identical_genesrequested:shared_genes_count,identical_gene_count,perc_id_genes
Batch runs write final merged results to:
<run_dir>/Outputs/all_gene_comparisons.parquet
Columns:
genomegenetotal_positionsshare_allele_posanisample_1sample_2
Comparison runners create structured run directories. Typical files include:
<run_dir>/batch_events.log(global progress/event log)<run_dir>/batch_*/batch.log(per-batch log)<run_dir>/Outputs/all_comparisons.parquetor<run_dir>/Outputs/all_gene_comparisons.parquet
zipstrain utilities build-genome-db \
--tool sylph \
--abundance-table <sylph_abundance.csv> \
--cache-dir <genome_cache_dir> \
--output-dir <reference_output_dir> \
--download-retries 3 \
--retry-backoff-seconds 1.0 \
--download-workers 4This writes:
<reference_output_dir>/reference_genomes.fna<reference_output_dir>/reference_genomes.stb<reference_output_dir>/genome_db_build_report.txt(includes failed accession IDs, if any)
The cache directory stores downloaded genomes and reuses existing files across runs.
Only genomes with non-zero abundance in at least one sample are included.
For Sylph input, accessions are read from the Genome_file column (GTDB path versions supported).
If Genome_file paths are local, they are cached directly before any download fallback.
Detailed walkthrough: docs/GenomeDBFromSylph.md
Nextflow workflow documentation:
Pipeline entrypoint in this repository:
zipstrain.nf
- CLI reference: docs/cli.md
- End-to-end tutorial: docs/Tutorial.md
- API notes: docs/api.md