HYMET performs contig-level metagenomic classification by combining Mash-based candidate selection, minimap2 alignment, and a weighted-LCA resolver. The repository includes the classifier, the CAMI benchmark harness, real-data case-study tooling, and auxiliary scripts.
- Candidate filtering – Mash containment scores cap the number of references passed to minimap2 (5000 by default; the CAMI harness overrides this to 1500).
- CLI workflows –
bin/hymetprovidesrun,bench,case,ablation,truth build-zymo,artifacts,version, andlegacysubcommands with consistent metadata outputs. - Benchmark automation – The CAMI harness produces evaluation tables, runtime logs, and figures from a single driver script.
- Case-study tooling – Dedicated scripts execute MGnify and Zymo contig workflows and perform reference ablation experiments.
- Deployment options – Install via Bioconda, Docker/Singularity images, or a source checkout with the supplied environment file.
| Directory | Purpose |
|---|---|
bench/ |
CAMI benchmark harness, database builders, evaluation, and plotting scripts. |
case/ |
Real-data case study runner plus reference ablation tooling. |
workflows/ |
High-level runners (e.g., CAMI suite) that stage artefacts under results/<scenario>/<suite>/. |
bin/ |
Python CLI entry points (preferred interface for new workflows). |
scripts/ |
Legacy Perl/Bash helpers retained for reproducibility (main.pl, config.pl, etc.). |
testdataset/ |
Utilities to assemble small synthetic evaluation sets. |
data/, taxonomy_files/ |
Expected locations for downloaded references and taxonomy dumps. |
After cloning the repository and creating your conda environment:
# 1. Create and activate environment
mamba env create -f environment.yml
conda activate hymet_env
# 2. Initialize HYMET (downloads NCBI taxonomy, creates stubs, verifies installation)
bin/hymet init
# 3. Fetch Mash sketches from Zenodo (required for classification)
tools/fetch_sketches.sh
# 4. Verify everything is ready
bin/hymet initThe init command automatically downloads NCBI taxonomy (~60MB) and generates the required hierarchy file. Use --skip-taxonomy to skip this if you already have the files or want to set them up manually.
# Single-sample classification
bin/hymet run --contigs /path/to/sample.fna --out results/sample --threads 16
# CAMI benchmark (HYMET + baselines; preset 'contigs' runs the default panel)
bin/hymet bench --manifest bench/cami_manifest.tsv --tools contigs --threads 16
# Case-study bundle (MGnify gut + Zymo mock community)
./case/run_cases_full.sh --threads 16 # canonical + gut + zymo suites; add --suite to limit the run
# or run a single manifest via the CLI:
bin/hymet case --manifest case/manifest_zymo.tsv --threads 8
# Reference ablation experiment
bin/hymet ablation --sample zymo_mc --taxa 1423,562 --levels 0,0.5,1.0 --threads 4
# Refresh supplementary tables and figures
bin/hymet artifactsbin/hymet auto-detects HYMET_ROOT. Export it explicitly (export HYMET_ROOT=/path/to/HYMET) if you prefer running from arbitrary directories. The legacy Perl entry point remains available as bin/hymet legacy -- ….
HYMET’s run subcommand is the supported way to classify your own assemblies or read sets; the CAMI harness is just a bundled example. A typical ad-hoc run looks like this:
conda activate hymet
export HYMET_ROOT=/path/to/HYMET # optional when running from a cloned repo
# replace `bin/hymet` with `hymet` if the package is installed into the active environment
bin/hymet run \
--contigs my_assembly.fna \
--out results/my_assembly \
--threads 32 \
--cand-max 500 \
--species-dedup- Prepare inputs – Provide a contig FASTA via
--contigsor a read FASTA/FASTQ via--reads. Place the pre-built Mash sketches underHYMET/data/(see Preparing Data below) and pointtaxonomy_files/at a fresh NCBI dump. HYMET will download any missing reference genomes intoCACHE_ROOTon demand. - Launch classification –
hymet runstages the sample underOUTDIR, copies your FASTA/FASTQ intoOUTDIR/input/, screens candidates with Mash, downloads the needed references (or reuses a cache hit), aligns with minimap2, and resolves calls with the weighted LCA resolver. Tweak--cand-max,--species-dedup,--threads,--cache-root, or--assembly-summary-dirto fit your hardware and naming conventions. When running via the benchmark harness, add--keep-workif you want it to retain intermediates underOUTDIR/work/for debugging. - Consume outputs – Every run writes:
OUTDIR/classified_sequences.tsv– one row per contig/read with lineage, rank, TaxID, and confidence (compatible with downstream Krona/plots).OUTDIR/hymet.sample_0.cami.tsv– CAMI-format profile built from the classification table (rename or symlink as desired for multi-sample batches).OUTDIR/metadata.json– reproducibility snapshot (HYMET commit, sketch checksums, cache key, tool versions, tunables).OUTDIR/logs/plusOUTDIR/work/– diagnostics and reusable intermediates. For benchmark runs, pass--keep-work(or exportKEEP_HYMET_WORK=1) to prevent the harness from deleting minimap2 artefacts; directbin/hymet runinvocations keep them by default.
For throughput runs, iterate over samples in a simple shell loop or a workflow manager, changing only --contigs/--reads and --out per sample; the cache key (hash of selected_genomes.txt) lets multiple runs share downloaded references automatically.
Cache note: HYMET uses a two-level cache: per-sample genome FASTAs under
CACHE_ROOT/<sha1>/and shared NCBI assembly summaries (~1.6 GB) underASSEMBLY_SUMMARY_DIR. Summaries are refreshed every 14 days and reused across all runs to avoid redundant downloads.
-
Follow the detailed playbook in
docs/reproducibility.mdfor the original manuscript run. The published artefacts now live underresults/cami/canonical/run_<timestamp>/(raw outputs, tables, figures, metadata). -
Use
workflows/run_cami_suite.shto stage new CAMI experiments. Every invocation createsresults/<scenario>/<suite>/run_<timestamp>/and fills it withraw/,tables/,figures/, andmetadata.json. For example:THREADS=8 CACHE_ROOT=data/downloaded_genomes/cache_bench \ workflows/run_cami_suite.sh \ --scenario cami \ --suite contig_full
Tool panels default to the selections in
workflows/config/cami_suite.cfg; provide--contig-toolsand/or--read-toolsto override when experimenting. All artefacts for that run appear inresults/cami/contig_full/run_<timestamp>/; nothing underbench/out/orresults/cami/canonical/is touched.
| Method | Command | Notes |
|---|---|---|
| Bioconda / mamba | mamba install -c bioconda hymet |
Installs the CLI and dependencies into the active environment. |
| Docker | docker build -t hymet .docker run --rm -it hymet hymet --help |
Image bundles the benchmark harness; bind data/cache directories as needed. |
| Singularity / Apptainer | apptainer build hymet.sif Singularity.defapptainer exec hymet.sif hymet --help |
Mirrors the Docker build for HPC clusters. |
| Source checkout | git clone https://github.com/ieeta-pt/HYMET.gitcd HYMETmamba env create -f environment.yml |
Recommended for development; activate the environment before using bin/hymet. For exact pins, use environment.lock.yml. |
- References & taxonomy – Download the Mash sketches from the Zenodo archive (
10.5281/zenodo.17428354) with:Place NCBI taxonomy dumps undertools/fetch_sketches.sh # defaults to the Zenodo record + checksum verification tools/verify_sketches.sh # optional: confirm local files match the archive
HYMET/taxonomy_files/. Builders inbench/db/derive tool-specific indices on demand. - CAMI subsets – Use
bench/fetch_cami.sh(supports--dry-run) to download the contigs listed inbench/cami_manifest.tsv. - Case-study contigs –
case/fetch_case_data.shdownloads the Zymo mock community assembly (http://nanopore.s3.climb.ac.uk/mockcommunity/v3/7cd60d3b-eafb-48d1-9aab-c8701232f2f8.ctg.cns.fa) and the MGnify gut contigs (https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00794604/file/ERZ24911249_FASTA.fasta.gz) and stages them under/data/case/. - Truth tables – CAMI truth lives under
bench/data/; case-study truth files (including curated Zymo panels) live undercase/truth/.
- Canonical CAMI artefacts:
results/cami/canonical/run_<timestamp>/(raw benchmark outputs, summary tables, figures, metadata). - Reviewer suites:
results/cami/<suite>/run_<timestamp>/(one folder per run). Raw per-tool outputs are grouped by mode; derived tables/figures live alongside metadata for immediate inspection. - Case studies and ablations follow the same pattern under
results/cases/…/run_<timestamp>/andresults/ablation/…/run_<timestamp>/. - Bench runs write intermediate outputs to
BENCH_OUT_ROOT(defaults tobench/out/); unless you pass--no-publish, a snapshot is also mirrored toresults/<scenario>/<suite>/run_<timestamp>/…. Case workflows follow the same convention:case/run_case.sh(andbin/hymet case) publish intoresults/cases/<suite>/run_<timestamp>/…unless you override the destination with--out. - Use
python bench/plot/make_figures.py --bench-root bench --tables <run>/tables --outdir <run>/figuresto regenerate figures for any run (keeps data and plots in sync).
- CAMI harness details:
bench/README.md, latest metrics:bench/results_summary.md. - Case-study workflows:
case/README.md, results recap:case/results_summary.md. - Reproducibility playbook:
docs/reproducibility.md.
HYMET/
├── bin/ # CLI entry points (Python)
├── bench/ # CAMI harness (runners, builders, plots)
├── case/ # Real-data case study + ablation toolkit
├── docs/ # Additional guides
├── results/ # Canonical artefacts (cami, cases, ablation, …)
├── workflows/ # Repro runners that populate results/<scenario>/<suite>/
├── scripts/ # Legacy helpers (Perl/Bash)
├── testdataset/ # Synthetic dataset utilities
└── data/, taxonomy_files/, … # Downloaded references and taxonomy dumps
The maintained workflow is through the Python CLI. Legacy scripts (config.pl, main.pl, scripts/*.sh) are retained for historical pipelines but no longer required for fresh runs.
- Open issues and feature requests on the GitHub tracker.
- Cite HYMET using
CITATION.cffin this repository.

