Skip to content

Latest commit

 

History

History
299 lines (219 loc) · 7.41 KB

File metadata and controls

299 lines (219 loc) · 7.41 KB

ZipStrain

ZipStrain is a strain-resolution metagenomics toolkit for:

  • profiling mapped reads into per-position nucleotide counts
  • comparing profiles at genome and gene levels
  • running large profiling/comparison jobs in local or Slurm batch mode
  • building local reference-genome databases from abundance outputs (currently Sylph)

Official docs: https://OlmLab.github.io/ZipStrain/

ZipStrain Logo

ZipStrain is developed by Parsa Ghadermazi and team at the Olm Lab, University of Colorado Boulder.

Installation

pip install zipstrain
zipstrain test

Detailed setup: docs/installation.md

Quick Start (Current CLI)

1. Prepare profiling assets

zipstrain utilities prepare_profiling \
  --reference-fasta <reference.fasta> \
  --gene-fasta <genes.fna> \
  --stb-file <mapping.stb> \
  --output-dir <profiling_assets_dir>

This creates:

  • genomes_bed_file.bed
  • gene_range_table.tsv
  • genome_lengths.parquet

2. Profile multiple BAM files

Input CSV:

sample_name,bamfile
sample1,/path/to/sample1.bam
sample2,/path/to/sample2.bam

Run profiling:

zipstrain profile \
  --input-table <samples.csv> \
  --stb-file <mapping.stb> \
  --gene-range-table <profiling_assets_dir/gene_range_table.tsv> \
  --bed-file <profiling_assets_dir/genomes_bed_file.bed> \
  --genome-length-file <profiling_assets_dir/genome_lengths.parquet> \
  --run-dir <profile_run_dir>

Optional execution controls:

  • --execution-mode local|slurm
  • --slurm-config <slurm.json> (required when --execution-mode slurm)
  • --container-engine local|docker|apptainer
  • --num-procs, --task-per-batch, --max-concurrent-batches, --poll-interval

3. Build a profile database for comparisons

Input CSV columns (required):

  • profile_name
  • profile_location
  • reference_db_id
  • gene_db_id
zipstrain utilities build-profile-db \
  --profile-db-csv <profiles.csv> \
  --output-file <profile_db.parquet>

4. Build comparison config objects

null_model settings are not required anymore for comparison config objects. Legacy null-model keys in older JSON files are ignored for backward compatibility.

Genome comparison config:

zipstrain utilities build-genome-comparison-config \
  --profile-db <profile_db.parquet> \
  --gene-db-id <gene_db_id> \
  --reference-genome-id <reference_id> \
  --scope all \
  --min-cov 5 \
  --min-gene-compare-len 200 \
  --stb-file-loc <mapping.stb> \
  --output-file <genome_compare.json>

Gene comparison config:

zipstrain utilities build-gene-comparison-config \
  --profile-db <profile_db.parquet> \
  --gene-db-id <gene_db_id> \
  --reference-genome-id <reference_id> \
  --scope all:all \
  --min-cov 5 \
  --min-gene-compare-len 200 \
  --stb-file-loc <mapping.stb> \
  --output-file <gene_compare.json>

5. Run batch comparisons

Genome comparisons:

zipstrain compare genomes \
  --genome-comparison-object <genome_compare.json> \
  --run-dir <compare_run_dir> \
  --engine polars \
  --calculate ani+ibs+identical_genes \
  --duckdb-memory-limit 4GB \
  --duckdb-threads 8

Gene comparisons:

zipstrain compare genes \
  --gene-comparison-object <gene_compare.json> \
  --run-dir <gene_compare_run_dir> \
  --engine duckdb \
  --ani-method popani \
  --duckdb-memory-limit 4GB \
  --duckdb-threads 8

Notes:

  • --engine supports polars or duckdb.
  • --calculate controls genome metrics: ani, ibs, identical_genes (all supported). Default is all.
  • In scoped comparisons (--genome or --scope not all), the polars path uses DuckDB prefiltering first.
  • --duckdb-memory-limit and --duckdb-threads are available in both single and batch compare interfaces.

6. Single pair compare (optional)

Genome-level single compare:

zipstrain utilities single_compare_genome \
  --mpileup-contig-1 <sampleA_profile.parquet> \
  --mpileup-contig-2 <sampleB_profile.parquet> \
  --stb-file <mapping.stb> \
  --genome all \
  --calculate ani+ibs+identical_genes \
  --engine duckdb \
  --duckdb-memory-limit 2GB \
  --duckdb-threads 8 \
  --duckdb-temp-directory /tmp \
  --output-file <sampleA_sampleB_comparison.parquet>

Gene-level single compare:

zipstrain utilities single_compare_gene \
  --mpileup-contig-1 <sampleA_profile.parquet> \
  --mpileup-contig-2 <sampleB_profile.parquet> \
  --stb-file <mapping.stb> \
  --scope all:all \
  --engine polars \
  --output-file <sampleA_sampleB_gene_comparison.parquet>

Current Output Files

Profiling outputs

Each profile task produces:

  • <sample_name>.parquet
  • <sample_name>_genome_stats.parquet
  • <sample_name>_gene_stats.parquet

For zipstrain profile, these files are written inside task directories under <run_dir>/batch_*/<sample_name>/.

<sample_name>.parquet columns:

  • chrom, genome, gene, pos, A, C, G, T

Rows are sorted by genome, chrom, pos ascending.

<sample_name>_genome_stats.parquet columns:

  • genome
  • coverage
  • breadth
  • genome_length
  • gap_mean
  • gap_std
  • 5x_cov_sites
  • heterogeneity
  • ber
  • fug
  • reads_mapped

<sample_name>_gene_stats.parquet columns:

  • genome
  • gene
  • length
  • breadth
  • coverage

Genome comparison outputs

Batch runs write final merged results to:

  • <run_dir>/Outputs/all_comparisons.parquet

Columns:

  • Always: genome, sample_1, sample_2
  • If ani requested: total_positions, share_allele_pos, genome_pop_ani
  • If ibs requested: max_consecutive_length
  • If identical_genes requested: shared_genes_count, identical_gene_count, perc_id_genes

Gene comparison outputs

Batch runs write final merged results to:

  • <run_dir>/Outputs/all_gene_comparisons.parquet

Columns:

  • genome
  • gene
  • total_positions
  • share_allele_pos
  • ani
  • sample_1
  • sample_2

Run Directory Layout (Batch Runners)

Comparison runners create structured run directories. Typical files include:

  • <run_dir>/batch_events.log (global progress/event log)
  • <run_dir>/batch_*/batch.log (per-batch log)
  • <run_dir>/Outputs/all_comparisons.parquet or <run_dir>/Outputs/all_gene_comparisons.parquet

Build Reference FASTA/STB from Abundances

zipstrain utilities build-genome-db \
  --tool sylph \
  --abundance-table <sylph_abundance.csv> \
  --cache-dir <genome_cache_dir> \
  --output-dir <reference_output_dir> \
  --download-retries 3 \
  --retry-backoff-seconds 1.0 \
  --download-workers 4

This writes:

  • <reference_output_dir>/reference_genomes.fna
  • <reference_output_dir>/reference_genomes.stb
  • <reference_output_dir>/genome_db_build_report.txt (includes failed accession IDs, if any)

The cache directory stores downloaded genomes and reuses existing files across runs. Only genomes with non-zero abundance in at least one sample are included. For Sylph input, accessions are read from the Genome_file column (GTDB path versions supported). If Genome_file paths are local, they are cached directly before any download fallback.

Detailed walkthrough: docs/GenomeDBFromSylph.md

Nextflow

Nextflow workflow documentation:

Pipeline entrypoint in this repository:

  • zipstrain.nf

Additional Documentation