Skip to content

ieeta-pt/HYMET

 
 

Repository files navigation

HYMET (Hybrid Metagenomic Tool)

Conda Version Downloads Platforms License Latest Release Date

HYMET performs contig-level metagenomic classification by combining Mash-based candidate selection, minimap2 alignment, and a weighted-LCA resolver. The repository includes the classifier, the CAMI benchmark harness, real-data case-study tooling, and auxiliary scripts.

Feature Snapshot

  • Candidate filtering – Mash containment scores cap the number of references passed to minimap2 (5000 by default; the CAMI harness overrides this to 1500).
  • CLI workflowsbin/hymet provides run, bench, case, ablation, truth build-zymo, artifacts, version, and legacy subcommands with consistent metadata outputs.
  • Benchmark automation – The CAMI harness produces evaluation tables, runtime logs, and figures from a single driver script.
  • Case-study tooling – Dedicated scripts execute MGnify and Zymo contig workflows and perform reference ablation experiments.
  • Deployment options – Install via Bioconda, Docker/Singularity images, or a source checkout with the supplied environment file.

HYMET F1 by taxonomic rank (RUN_0) HYMET case-study runtime summary (canonical suite)

What’s Included

Directory Purpose
bench/ CAMI benchmark harness, database builders, evaluation, and plotting scripts.
case/ Real-data case study runner plus reference ablation tooling.
workflows/ High-level runners (e.g., CAMI suite) that stage artefacts under results/<scenario>/<suite>/.
bin/ Python CLI entry points (preferred interface for new workflows).
scripts/ Legacy Perl/Bash helpers retained for reproducibility (main.pl, config.pl, etc.).
testdataset/ Utilities to assemble small synthetic evaluation sets.
data/, taxonomy_files/ Expected locations for downloaded references and taxonomy dumps.

Quick Start (First-Time Setup)

After cloning the repository and creating your conda environment:

# 1. Create and activate environment
mamba env create -f environment.yml
conda activate hymet_env

# 2. Initialize HYMET (downloads NCBI taxonomy, creates stubs, verifies installation)
bin/hymet init

# 3. Fetch Mash sketches from Zenodo (required for classification)
tools/fetch_sketches.sh

# 4. Verify everything is ready
bin/hymet init

The init command automatically downloads NCBI taxonomy (~60MB) and generates the required hierarchy file. Use --skip-taxonomy to skip this if you already have the files or want to set them up manually.

Quick Start Commands

# Single-sample classification
bin/hymet run   --contigs /path/to/sample.fna   --out results/sample   --threads 16

# CAMI benchmark (HYMET + baselines; preset 'contigs' runs the default panel)
bin/hymet bench   --manifest bench/cami_manifest.tsv   --tools contigs   --threads 16

# Case-study bundle (MGnify gut + Zymo mock community)
./case/run_cases_full.sh   --threads 16      # canonical + gut + zymo suites; add --suite to limit the run
# or run a single manifest via the CLI:
bin/hymet case   --manifest case/manifest_zymo.tsv   --threads 8

# Reference ablation experiment
bin/hymet ablation   --sample zymo_mc   --taxa 1423,562   --levels 0,0.5,1.0   --threads 4

# Refresh supplementary tables and figures
bin/hymet artifacts

bin/hymet auto-detects HYMET_ROOT. Export it explicitly (export HYMET_ROOT=/path/to/HYMET) if you prefer running from arbitrary directories. The legacy Perl entry point remains available as bin/hymet legacy -- ….

Using HYMET Beyond Benchmarks

HYMET’s run subcommand is the supported way to classify your own assemblies or read sets; the CAMI harness is just a bundled example. A typical ad-hoc run looks like this:

conda activate hymet
export HYMET_ROOT=/path/to/HYMET      # optional when running from a cloned repo
# replace `bin/hymet` with `hymet` if the package is installed into the active environment
bin/hymet run \
  --contigs my_assembly.fna \
  --out results/my_assembly \
  --threads 32 \
  --cand-max 500 \
  --species-dedup
  1. Prepare inputs – Provide a contig FASTA via --contigs or a read FASTA/FASTQ via --reads. Place the pre-built Mash sketches under HYMET/data/ (see Preparing Data below) and point taxonomy_files/ at a fresh NCBI dump. HYMET will download any missing reference genomes into CACHE_ROOT on demand.
  2. Launch classificationhymet run stages the sample under OUTDIR, copies your FASTA/FASTQ into OUTDIR/input/, screens candidates with Mash, downloads the needed references (or reuses a cache hit), aligns with minimap2, and resolves calls with the weighted LCA resolver. Tweak --cand-max, --species-dedup, --threads, --cache-root, or --assembly-summary-dir to fit your hardware and naming conventions. When running via the benchmark harness, add --keep-work if you want it to retain intermediates under OUTDIR/work/ for debugging.
  3. Consume outputs – Every run writes:
    • OUTDIR/classified_sequences.tsv – one row per contig/read with lineage, rank, TaxID, and confidence (compatible with downstream Krona/plots).
    • OUTDIR/hymet.sample_0.cami.tsv – CAMI-format profile built from the classification table (rename or symlink as desired for multi-sample batches).
    • OUTDIR/metadata.json – reproducibility snapshot (HYMET commit, sketch checksums, cache key, tool versions, tunables).
    • OUTDIR/logs/ plus OUTDIR/work/ – diagnostics and reusable intermediates. For benchmark runs, pass --keep-work (or export KEEP_HYMET_WORK=1) to prevent the harness from deleting minimap2 artefacts; direct bin/hymet run invocations keep them by default.

For throughput runs, iterate over samples in a simple shell loop or a workflow manager, changing only --contigs/--reads and --out per sample; the cache key (hash of selected_genomes.txt) lets multiple runs share downloaded references automatically.

Cache note: HYMET uses a two-level cache: per-sample genome FASTAs under CACHE_ROOT/<sha1>/ and shared NCBI assembly summaries (~1.6 GB) under ASSEMBLY_SUMMARY_DIR. Summaries are refreshed every 14 days and reused across all runs to avoid redundant downloads.

Reproducing CAMI suites

  • Follow the detailed playbook in docs/reproducibility.md for the original manuscript run. The published artefacts now live under results/cami/canonical/run_<timestamp>/ (raw outputs, tables, figures, metadata).

  • Use workflows/run_cami_suite.sh to stage new CAMI experiments. Every invocation creates results/<scenario>/<suite>/run_<timestamp>/ and fills it with raw/, tables/, figures/, and metadata.json. For example:

    THREADS=8 CACHE_ROOT=data/downloaded_genomes/cache_bench \
    workflows/run_cami_suite.sh \
      --scenario cami \
      --suite contig_full

    Tool panels default to the selections in workflows/config/cami_suite.cfg; provide --contig-tools and/or --read-tools to override when experimenting. All artefacts for that run appear in results/cami/contig_full/run_<timestamp>/; nothing under bench/out/ or results/cami/canonical/ is touched.

Installation Options

Method Command Notes
Bioconda / mamba mamba install -c bioconda hymet Installs the CLI and dependencies into the active environment.
Docker docker build -t hymet .
docker run --rm -it hymet hymet --help
Image bundles the benchmark harness; bind data/cache directories as needed.
Singularity / Apptainer apptainer build hymet.sif Singularity.def
apptainer exec hymet.sif hymet --help
Mirrors the Docker build for HPC clusters.
Source checkout git clone https://github.com/ieeta-pt/HYMET.git
cd HYMET
mamba env create -f environment.yml
Recommended for development; activate the environment before using bin/hymet. For exact pins, use environment.lock.yml.

Preparing Data

  1. References & taxonomy – Download the Mash sketches from the Zenodo archive (10.5281/zenodo.17428354) with:
    tools/fetch_sketches.sh    # defaults to the Zenodo record + checksum verification
    tools/verify_sketches.sh   # optional: confirm local files match the archive
    Place NCBI taxonomy dumps under HYMET/taxonomy_files/. Builders in bench/db/ derive tool-specific indices on demand.
  2. CAMI subsets – Use bench/fetch_cami.sh (supports --dry-run) to download the contigs listed in bench/cami_manifest.tsv.
  3. Case-study contigscase/fetch_case_data.sh downloads the Zymo mock community assembly (http://nanopore.s3.climb.ac.uk/mockcommunity/v3/7cd60d3b-eafb-48d1-9aab-c8701232f2f8.ctg.cns.fa) and the MGnify gut contigs (https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00794604/file/ERZ24911249_FASTA.fasta.gz) and stages them under /data/case/.
  4. Truth tables – CAMI truth lives under bench/data/; case-study truth files (including curated Zymo panels) live under case/truth/.

Outputs at a Glance

  • Canonical CAMI artefacts: results/cami/canonical/run_<timestamp>/ (raw benchmark outputs, summary tables, figures, metadata).
  • Reviewer suites: results/cami/<suite>/run_<timestamp>/ (one folder per run). Raw per-tool outputs are grouped by mode; derived tables/figures live alongside metadata for immediate inspection.
  • Case studies and ablations follow the same pattern under results/cases/…/run_<timestamp>/ and results/ablation/…/run_<timestamp>/.
  • Bench runs write intermediate outputs to BENCH_OUT_ROOT (defaults to bench/out/); unless you pass --no-publish, a snapshot is also mirrored to results/<scenario>/<suite>/run_<timestamp>/…. Case workflows follow the same convention: case/run_case.sh (and bin/hymet case) publish into results/cases/<suite>/run_<timestamp>/… unless you override the destination with --out.
  • Use python bench/plot/make_figures.py --bench-root bench --tables <run>/tables --outdir <run>/figures to regenerate figures for any run (keeps data and plots in sync).

Documentation & Reporting

  • CAMI harness details: bench/README.md, latest metrics: bench/results_summary.md.
  • Case-study workflows: case/README.md, results recap: case/results_summary.md.
  • Reproducibility playbook: docs/reproducibility.md.

Repository Layout

HYMET/
├── bin/                 # CLI entry points (Python)
├── bench/               # CAMI harness (runners, builders, plots)
├── case/                # Real-data case study + ablation toolkit
├── docs/                # Additional guides
├── results/             # Canonical artefacts (cami, cases, ablation, …)
├── workflows/           # Repro runners that populate results/<scenario>/<suite>/
├── scripts/             # Legacy helpers (Perl/Bash)
├── testdataset/         # Synthetic dataset utilities
└── data/, taxonomy_files/, …  # Downloaded references and taxonomy dumps

The maintained workflow is through the Python CLI. Legacy scripts (config.pl, main.pl, scripts/*.sh) are retained for historical pipelines but no longer required for fresh runs.

Support & Citation

  • Open issues and feature requests on the GitHub tracker.
  • Cite HYMET using CITATION.cff in this repository.

About

HYMET: A Hybrid Metagenomic Pipeline for Accurate and Efficient Taxonomic Classification

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 62.3%
  • Shell 36.5%
  • Other 1.2%