HYMET (Hybrid Metagenomic Tool)

HYMET performs contig-level metagenomic classification by combining Mash-based candidate selection, minimap2 alignment, and a weighted-LCA resolver. The repository includes the classifier, the CAMI benchmark harness, real-data case-study tooling, and auxiliary scripts.

Feature Snapshot

Candidate filtering – Mash containment scores cap the number of references passed to minimap2 (5000 by default; the CAMI harness overrides this to 1500).
CLI workflows – bin/hymet provides run, bench, case, ablation, truth build-zymo, artifacts, version, and legacy subcommands with consistent metadata outputs.
Benchmark automation – The CAMI harness produces evaluation tables, runtime logs, and figures from a single driver script.
Case-study tooling – Dedicated scripts execute MGnify and Zymo contig workflows and perform reference ablation experiments.
Deployment options – Install via Bioconda, Docker/Singularity images, or a source checkout with the supplied environment file.

What’s Included

Directory	Purpose
`bench/`	CAMI benchmark harness, database builders, evaluation, and plotting scripts.
`case/`	Real-data case study runner plus reference ablation tooling.
`workflows/`	High-level runners (e.g., CAMI suite) that stage artefacts under `results/<scenario>/<suite>/`.
`bin/`	Python CLI entry points (preferred interface for new workflows).
`scripts/`	Legacy Perl/Bash helpers retained for reproducibility (`main.pl`, `config.pl`, etc.).
`testdataset/`	Utilities to assemble small synthetic evaluation sets.
`data/`, `taxonomy_files/`	Expected locations for downloaded references and taxonomy dumps.

Quick Start (First-Time Setup)

After cloning the repository and creating your conda environment:

# 1. Create and activate environment
mamba env create -f environment.yml
conda activate hymet_env

# 2. Initialize HYMET (downloads NCBI taxonomy, creates stubs, verifies installation)
bin/hymet init

# 3. Fetch Mash sketches from Zenodo (required for classification)
tools/fetch_sketches.sh

# 4. Verify everything is ready
bin/hymet init

The init command automatically downloads NCBI taxonomy (~60MB) and generates the required hierarchy file. Use --skip-taxonomy to skip this if you already have the files or want to set them up manually.

Quick Start Commands

# Single-sample classification
bin/hymet run   --contigs /path/to/sample.fna   --out results/sample   --threads 16

# CAMI benchmark (HYMET + baselines; preset 'contigs' runs the default panel)
bin/hymet bench   --manifest bench/cami_manifest.tsv   --tools contigs   --threads 16

# Case-study bundle (MGnify gut + Zymo mock community)
./case/run_cases_full.sh   --threads 16      # canonical + gut + zymo suites; add --suite to limit the run
# or run a single manifest via the CLI:
bin/hymet case   --manifest case/manifest_zymo.tsv   --threads 8

# Reference ablation experiment
bin/hymet ablation   --sample zymo_mc   --taxa 1423,562   --levels 0,0.5,1.0   --threads 4

# Refresh supplementary tables and figures
bin/hymet artifacts

bin/hymet auto-detects HYMET_ROOT. Export it explicitly (export HYMET_ROOT=/path/to/HYMET) if you prefer running from arbitrary directories. The legacy Perl entry point remains available as bin/hymet legacy -- ….

Using HYMET Beyond Benchmarks

HYMET’s run subcommand is the supported way to classify your own assemblies or read sets; the CAMI harness is just a bundled example. A typical ad-hoc run looks like this:

conda activate hymet
export HYMET_ROOT=/path/to/HYMET      # optional when running from a cloned repo
# replace `bin/hymet` with `hymet` if the package is installed into the active environment
bin/hymet run \
  --contigs my_assembly.fna \
  --out results/my_assembly \
  --threads 32 \
  --cand-max 500 \
  --species-dedup

Prepare inputs – Provide a contig FASTA via --contigs or a read FASTA/FASTQ via --reads. Place the pre-built Mash sketches under HYMET/data/ (see Preparing Data below) and point taxonomy_files/ at a fresh NCBI dump. HYMET will download any missing reference genomes into CACHE_ROOT on demand.
Launch classification – hymet run stages the sample under OUTDIR, copies your FASTA/FASTQ into OUTDIR/input/, screens candidates with Mash, downloads the needed references (or reuses a cache hit), aligns with minimap2, and resolves calls with the weighted LCA resolver. Tweak --cand-max, --species-dedup, --threads, --cache-root, or --assembly-summary-dir to fit your hardware and naming conventions. When running via the benchmark harness, add --keep-work if you want it to retain intermediates under OUTDIR/work/ for debugging.
Consume outputs – Every run writes:
- OUTDIR/classified_sequences.tsv – one row per contig/read with lineage, rank, TaxID, and confidence (compatible with downstream Krona/plots).
- OUTDIR/hymet.sample_0.cami.tsv – CAMI-format profile built from the classification table (rename or symlink as desired for multi-sample batches).
- OUTDIR/metadata.json – reproducibility snapshot (HYMET commit, sketch checksums, cache key, tool versions, tunables).
- OUTDIR/logs/ plus OUTDIR/work/ – diagnostics and reusable intermediates. For benchmark runs, pass --keep-work (or export KEEP_HYMET_WORK=1) to prevent the harness from deleting minimap2 artefacts; direct bin/hymet run invocations keep them by default.

For throughput runs, iterate over samples in a simple shell loop or a workflow manager, changing only --contigs/--reads and --out per sample; the cache key (hash of selected_genomes.txt) lets multiple runs share downloaded references automatically.

Cache note: HYMET uses a two-level cache: per-sample genome FASTAs under CACHE_ROOT/<sha1>/ and shared NCBI assembly summaries (~1.6 GB) under ASSEMBLY_SUMMARY_DIR. Summaries are refreshed every 14 days and reused across all runs to avoid redundant downloads.

Reproducing CAMI suites

Follow the detailed playbook in docs/reproducibility.md for the original manuscript run. The published artefacts now live under results/cami/canonical/run_<timestamp>/ (raw outputs, tables, figures, metadata).
Use workflows/run_cami_suite.sh to stage new CAMI experiments. Every invocation creates results/<scenario>/<suite>/run_<timestamp>/ and fills it with raw/, tables/, figures/, and metadata.json. For example:
```
THREADS=8 CACHE_ROOT=data/downloaded_genomes/cache_bench \
workflows/run_cami_suite.sh \
  --scenario cami \
  --suite contig_full
```
Tool panels default to the selections in workflows/config/cami_suite.cfg; provide --contig-tools and/or --read-tools to override when experimenting. All artefacts for that run appear in results/cami/contig_full/run_<timestamp>/; nothing under bench/out/ or results/cami/canonical/ is touched.

Installation Options

Method	Command	Notes
Bioconda / mamba	`mamba install -c bioconda hymet`	Installs the CLI and dependencies into the active environment.
Docker	`docker build -t hymet .` `docker run --rm -it hymet hymet --help`	Image bundles the benchmark harness; bind data/cache directories as needed.
Singularity / Apptainer	`apptainer build hymet.sif Singularity.def` `apptainer exec hymet.sif hymet --help`	Mirrors the Docker build for HPC clusters.
Source checkout	`git clone https://github.com/ieeta-pt/HYMET.git` `cd HYMET` `mamba env create -f environment.yml`	Recommended for development; activate the environment before using `bin/hymet`. For exact pins, use `environment.lock.yml`.

Preparing Data

References & taxonomy – Download the Mash sketches from the Zenodo archive (10.5281/zenodo.17428354) with:
```
tools/fetch_sketches.sh    # defaults to the Zenodo record + checksum verification
tools/verify_sketches.sh   # optional: confirm local files match the archive
```
Place NCBI taxonomy dumps under HYMET/taxonomy_files/. Builders in bench/db/ derive tool-specific indices on demand.
CAMI subsets – Use bench/fetch_cami.sh (supports --dry-run) to download the contigs listed in bench/cami_manifest.tsv.
Case-study contigs – case/fetch_case_data.sh downloads the Zymo mock community assembly (http://nanopore.s3.climb.ac.uk/mockcommunity/v3/7cd60d3b-eafb-48d1-9aab-c8701232f2f8.ctg.cns.fa) and the MGnify gut contigs (https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00794604/file/ERZ24911249_FASTA.fasta.gz) and stages them under /data/case/.
Truth tables – CAMI truth lives under bench/data/; case-study truth files (including curated Zymo panels) live under case/truth/.

Outputs at a Glance

Canonical CAMI artefacts: results/cami/canonical/run_<timestamp>/ (raw benchmark outputs, summary tables, figures, metadata).
Reviewer suites: results/cami/<suite>/run_<timestamp>/ (one folder per run). Raw per-tool outputs are grouped by mode; derived tables/figures live alongside metadata for immediate inspection.
Case studies and ablations follow the same pattern under results/cases/…/run_<timestamp>/ and results/ablation/…/run_<timestamp>/.
Bench runs write intermediate outputs to BENCH_OUT_ROOT (defaults to bench/out/); unless you pass --no-publish, a snapshot is also mirrored to results/<scenario>/<suite>/run_<timestamp>/…. Case workflows follow the same convention: case/run_case.sh (and bin/hymet case) publish into results/cases/<suite>/run_<timestamp>/… unless you override the destination with --out.
Use python bench/plot/make_figures.py --bench-root bench --tables <run>/tables --outdir <run>/figures to regenerate figures for any run (keeps data and plots in sync).

Documentation & Reporting

CAMI harness details: bench/README.md, latest metrics: bench/results_summary.md.
Case-study workflows: case/README.md, results recap: case/results_summary.md.
Reproducibility playbook: docs/reproducibility.md.

Repository Layout

HYMET/
├── bin/                 # CLI entry points (Python)
├── bench/               # CAMI harness (runners, builders, plots)
├── case/                # Real-data case study + ablation toolkit
├── docs/                # Additional guides
├── results/             # Canonical artefacts (cami, cases, ablation, …)
├── workflows/           # Repro runners that populate results/<scenario>/<suite>/
├── scripts/             # Legacy helpers (Perl/Bash)
├── testdataset/         # Synthetic dataset utilities
└── data/, taxonomy_files/, …  # Downloaded references and taxonomy dumps

The maintained workflow is through the Python CLI. Legacy scripts (config.pl, main.pl, scripts/*.sh) are retained for historical pipelines but no longer required for fresh runs.

Support & Citation

Open issues and feature requests on the GitHub tracker.
Cite HYMET using CITATION.cff in this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HYMET (Hybrid Metagenomic Tool)

Feature Snapshot

What’s Included

Quick Start (First-Time Setup)

Quick Start Commands

Using HYMET Beyond Benchmarks

Reproducing CAMI suites

Installation Options

Preparing Data

Outputs at a Glance

Documentation & Reporting

Repository Layout

Support & Citation

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
.github/workflows		.github/workflows
bench		bench
bin		bin
case		case
data		data
docs		docs
results		results
scripts		scripts
testdataset		testdataset
tests		tests
tools		tools
workflows		workflows
.gitignore		.gitignore
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
Singularity.def		Singularity.def
benchmark_cami.sh		benchmark_cami.sh
config.pl		config.pl
environment.lock.yml		environment.lock.yml
environment.yml		environment.yml
main.pl		main.pl
run_hymet_cami.sh		run_hymet_cami.sh

Folders and files

Latest commit

History

Repository files navigation

HYMET (Hybrid Metagenomic Tool)

Feature Snapshot

What’s Included

Quick Start (First-Time Setup)

Quick Start Commands

Using HYMET Beyond Benchmarks

Reproducing CAMI suites

Installation Options

Preparing Data

Outputs at a Glance

Documentation & Reporting

Repository Layout

Support & Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages