This repository contains the official implementation of the study: Life at the extremes: Maximally divergent microbes with similar genomic signatures linked to extreme environments (Safari et al., 2025): preprint
In that work, we showed that extremophiles — despite belonging to maximally divergent lineages — can converge toward highly similar genomic k-mer signatures when adapting to extreme environments (temperature, pH, and beyond). These convergent patterns highlight the role of large-scale mutational and selective pressures in shaping microbial genomes under stress.
# From repo root
pip install -e .Download the extremophile genome assemblies and metadata from Zenodo DOI: link
Place downloaded assemblies (FASTA .fna) and the metadat file under data/.
All results will be written to outputs/ automatically.
All the results of the experiments of this study are available in results/ folder.
Tests whether random genome proxy choice changes classification accuracy.
python3 src/extprime/pipelines/pipeline_supervised.py \
--exp_type exp1 --max_k 6 --data_root data --output_root outputsCompares accuracy across proxy lengths (k set by --max_k).
python3 src/extprime/pipelines/pipeline_supervised.py \
--exp_type exp2 --max_k 6 --data_root data --output_root outputsVaries n in the composite genome proxy to measure its impact.
python3 src/extprime/pipelines/pipeline_supervised.py \
--exp_type exp3 --max_k 6 --data_root data --output_root outputsOptional: Add --whole_genome to use entire genomes instead of proxies.
Results are written under:
outputs/{exp_type}/{env}/fragments_{length}/...
Each folder contains the generated FASTA for that environment and the model outputs produced by the pipeline.
--exp_type {exp1,exp2,exp3,tuning}– Choose the experiment--max_k INT– Maximum k-mer length considered by the models--data_root PATH– Input root (default:data)--output_root PATH– Results root (default:outputs)--whole_genome– Use entire genomes instead of proxies (optional)
python3 src/extprime/pipelines/pipeline_unsupervised.py \
--exp_type non-parametric --k_mer 6 --data_root path_to_the_subfragments --output_root outputs \
--fragement_length 100000 --n_clusters 4 --env Temperature--exp_type {parametric, non-parametric} – Choose the experiment--k_mer INT – K-mer length--fragement_length INT – Fragment length--n_clusters INT – Number of clusters (default: 4) - not needed for non-parametric--outputs_root PATH – Output results directory--env {pH,Temperature} – Environment type--fragment_path PATH – Path to a fragment FASTA file--data_root PATH – Input data directory
python3 src/extprime/analysis/distance_calculator.py --data_root path_to_the_subfragments