diff --git a/README.md b/README.md
index 409b338c1..9cae060a6 100644
--- a/README.md
+++ b/README.md
@@ -6,75 +6,116 @@
 ![Python Version](https://img.shields.io/badge/python-3.10%2B-blue)
 ![Issues](https://img.shields.io/github/issues/monarch-initiative/pheval)
 
-## Overview
+PhEval (Phenotypic Inference Evaluation Framework) is a **modular, reproducible benchmarking framework** for evaluating **phenotype-driven prioritisation tools**, such as gene, variant, and disease prioritisation algorithms.
 
-The absence of standardised benchmarks and data standardisation for Variant and Gene Prioritisation Algorithms (VGPAs) presents a significant challenge in the field of genomic research. To address this, we developed PhEval, a novel framework designed to streamline the evaluation of VGPAs that incorporate phenotypic data. PhEval offers several key benefits:
+It is designed to support **fair comparison across tools, tool versions, datasets, and knowledge updates**, addressing a long-standing gap in standardised evaluation for phenotype-based methods.
 
-- Automated Processes: Reduces manual effort by automating various evaluation tasks, thus enhancing efficiency.
-- Standardisation: Ensures consistency and comparability in evaluation methodologies, leading to more reliable and standardised assessments.
-- Reproducibility: Facilitates reproducibility in research by providing a standardised platform, allowing for consistent validation of algorithms.
-- Comprehensive Benchmarking: Enables thorough benchmarking of algorithms, providing well-founded comparisons and deeper insights into their performance. 
+📖 **Full documentation:** https://monarch-initiative.github.io/pheval/
+---
 
-PhEval is a valuable tool for researchers looking to improve the accuracy and reliability of VGPA evaluations through a structured and standardised approach.
+## Why PhEval?
 
-For more information please see the full [documentation](https://monarch-initiative.github.io/pheval/).
+Evaluating phenotype-driven prioritisation tools is challenging because performance depends on many moving parts, including:
 
-## Download and Installation
+- Phenotype representations and noise
+- Ontology structure and versioning
+- Gene and disease mappings
+- Tool-specific scoring and ranking strategies
+- Input cohorts and simulation approaches
+
+PhEval provides a framework that makes these factors **explicit, controlled, and comparable**.
+
+Key features:
+
+- **Standardised outputs** across tools
+- **Reproducible benchmarking** with recorded metadata
+- **Plugin-based architecture** for extensibility
+- **Separation of execution and evaluation**
+- Support for **gene, variant, and disease prioritisation**
+
+---
+
+## Installation
+
+PhEval requires **Python 3.10 or later**.
+
+Install from PyPI:
 
-1. Ensure you have Python 3.10 or greater installed.
-2. Install with `pip`:
 ```bash
 pip install pheval
 ```
-3. See list of all PhEval utility commands:
+
+This installs:
+
+* The core pheval CLI (for running tools via plugins)
+* `pheval-utils` (for data preparation, benchmarking, and analysis)
+
+Verify installation:
+
 ```bash
+pheval --help
 pheval-utils --help
 ```
 
-## Usage
+## How PhEval is used
 
-The PhEval CLI offers a variety of commands categorised into two main types: **Runner Implementations** and **Utility Commands**. Below is an overview of each category, detailing how they can be utilised to perform various tasks within PhEval.
+PhEval workflows typically consist of three phases:
 
-### Runner Implementations
+1.	Prepare data 
+    Prepare and manipulate phenopackets and related inputs (e.g. VCFs).
+2. Run tools 
+   Execute phenotype-driven prioritisation tools via plugin-provided runners using:
+   ```bash
+   pheval run --runner <runner_name> ...
+   ```
+3. Benchmark and analyse
+   Compare results across runs using standardised metrics and plots.
 
-The primary command used within PhEval is `pheval run`. This command is responsible for executing concrete VGPA runner implementations, that we sometimes term as plugins. By using pheval run, users can leverage these runner implementations to: execute the VGPA on a set of test corpora, produce tool-specific result outputs, and post-process tool-specific outputs to PhEval standardised TSV outputs.
+Each phase is documented in detail in the user documentation.
 
-Some concrete PhEval runner implementations include the [Exomiser runner](https://github.com/monarch-initiative/pheval.exomiser) and the [Phen2Gene runner](https://github.com/monarch-initiative/pheval.phen2gene). The full list of currently implemented runners can be found [here](https://monarch-initiative.github.io/pheval/plugins/)
+## Plugins and runners
 
-Please read the [documentation](https://monarch-initiative.github.io/pheval/developing_a_pheval_plugin/) for a step-by-step for creating your own PhEval plugin. 
+PhEval itself is tool-agnostic.
 
-### Utility Commands
+Support for specific tools is provided via plugins, which implement runners responsible for:
 
-In addition to the main `run` command, PhEval provides a set of utility commands designed to enhance the overall functionality of the CLI. These commands can be used to set up and configure experiments, streamline data preparation, and benchmark the performance of various VGPA runner implementations. By utilising these utilities, users can optimise their experimental workflows, ensure reproducibility, and compare the efficiency and accuracy of different approaches. The utility commands offer a range of options that facilitate the customisation and fine-tuning to suit diverse research objectives.
+* Preparing tool inputs 
+* Executing the tool 
+* Converting raw outputs into PhEval standardised results
 
-#### Example Usage
+A list of available plugins is maintained in the documentation:
 
-To add noise to an existing corpus of phenopackets, this could be used to assess the robustness of VGPAs when less relevant or unreliable phenotype data is introduced:
-```bash
-pheval-utils scramble-phenopackets --phenopacket-dir /phenopackets --scramble-factor 0.5 --output-dir /scrambled_phenopackets_0.5
-```
+Plugins: https://monarch-initiative.github.io/pheval/plugins/
 
-To update the gene symbols and identifiers to a specific namespace:
-```bash
-pheval-utils update-phenopackets --phenopacket-dir /phenopackets --output-dir /updated_phenopackets --gene-identifier ensembl_id
-```
+Each plugin repository contains tool-specific installation instructions and examples.
 
-To prepare VCF files for a corpus of phenopackets, spiking in the known causative variants:
-```bash
-pheval-utils create-spiked-vcfs --phenopacket-dir /phenopackets --hg19-template-vcf /template_hg19.vcf --hg38-template-vcf /template_hg38.vcf --output-dir /vcf
-```
+## Documentation
 
-Alternatively, you can wrap all corpus preparatory commands into a single step. Specifying `--variant-analysis`/`--gene-analysis`/`--disease-analysis` will check the phenopackets for complete records documenting the known entities. If template vcf(s) are provided this will spike VCFs with the known variant for the corpus. If a `--gene-identifier` is specified then the corpus of phenopackets is updated.
-```bash
-pheval-utils prepare-corpus \
-    --phenopacket-dir /phenopackets \
-    --variant-analysis \
-    --gene-analysis \
-    --gene-identifier ensembl_id \
-    --hg19-template-vcf /template_hg19.vcf \
-    --hg38-template-vcf /template_hg38.vcf \
-    --output-dir /vcf
-```
+The PhEval documentation is organised by audience and task:
+* Getting started: installation and first steps 
+* Using PhEval: running tools, plugins, and workflows 
+* Utilities: data preparation, phenopacket manipulation, simulations 
+* Benchmarking: executing benchmarks, metrics, and plots 
+* Developer documentation: plugin development and API reference
+
+Start here: https://monarch-initiative.github.io/pheval/
+
+## Contributions
+
+Contributions are welcome across:
+
+* Code 
+* Documentation 
+* Testing 
+* Plugins and integrations
+
+## Citation
+
+If you use **PhEval** in your research, please cite the following publication:
+
+> **Bridges, Y., Souza, V. d., Cortes, K. G., et al.**  
+> *Towards a standard benchmark for phenotype-driven variant and gene prioritisation algorithms: PhEval – Phenotypic Inference Evaluation Framework.*  
+> **BMC Bioinformatics** 26, 87 (2025).  
+> https://doi.org/10.1186/s12859-025-06105-4
 
-See the [documentation](https://monarch-initiative.github.io/pheval/executing_a_benchmark/) for instructions on benchmarking and evaluating the performance of various VGPAs.
 
diff --git a/docs/CODE_OF_CONDUCT.md b/docs/CODE_OF_CONDUCT.md
deleted file mode 100644
index b3c57d751..000000000
--- a/docs/CODE_OF_CONDUCT.md
+++ /dev/null
@@ -1,46 +0,0 @@
-# Contributor Covenant Code of Conduct
-
-## Our Pledge
-
-In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to make participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation.
-
-## Our Standards
-
-Examples of behavior that contributes to creating a positive environment include:
-
-* Using welcoming and inclusive language
-* Being respectful of differing viewpoints and experiences
-* Gracefully accepting constructive criticism
-* Focusing on what is best for the community
-* Showing empathy towards other community members
-
-Examples of unacceptable behavior by participants include:
-
-* The use of sexualized language or imagery and unwelcome sexual attention or advances
-* Trolling, insulting/derogatory comments, and personal or political attacks
-* Public or private harassment
-* Publishing others' private information, such as a physical or electronic address, without explicit permission
-* Other conduct which could reasonably be considered inappropriate in a professional setting
-
-## Our Responsibilities
-
-Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
-
-Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
-
-## Scope
-
-This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
-
-## Enforcement
-
-Instances of abusive, harassing, or otherwise unacceptable behavior. <!-- may be reported by [contacting the project team](contact.md).  --> All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.
-
-Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership.
-
-## Attribution
-
-This code of conduct has been derived from the excellent code of conduct of the [ATOM project](https://github.com/atom/atom/blob/master/CODE_OF_CONDUCT.md) which in turn is adapted from the [Contributor Covenant][homepage], version 1.4, available at [https://contributor-covenant.org/version/1/4][version]
-
-[homepage]: https://contributor-covenant.org
-[version]: https://contributor-covenant.org/version/1/4/
\ No newline at end of file
diff --git a/docs/about.md b/docs/about.md
deleted file mode 100644
index d56d40180..000000000
--- a/docs/about.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# PhEval - Phenotypic Inference Evaluation Framework
-
-Many variant prioritization tools (such as [Exomiser](https://exomiser.readthedocs.io/) and other computational approaches) rely on ontologies and phenotype matching, sometimes involving complex processes such as cross-species inference. The performance of such tools is exceedingly hard to evaluate because of the many factors involved: changes to the structure of the ontology, cross-species mappings, and semantic similarity algorithms can have significant consequences. Furthermore, the lack of suitable real-world problems/corpora leads to the situation that many algorithms are evaluated using simulations, which may fail to capture real-world scenarios. The lack of an evaluation framework that enables studying effects on data and knowledge inputs on real-world problems makes it difficult to optimize algorithms. To this end, we are developing a modular Phenotypic Inference Evaluation Framework (PhEval), which is delivered as a community resource.
\ No newline at end of file
diff --git a/docs/benchmarking/executing_a_benchmark.md b/docs/benchmarking/executing_a_benchmark.md
new file mode 100644
index 000000000..c1143b98c
--- /dev/null
+++ b/docs/benchmarking/executing_a_benchmark.md
@@ -0,0 +1,138 @@
+# Executing a Benchmark
+
+This page describes how to execute a benchmark, configure benchmarking parameters, and interpret the resulting outputs.
+
+It assumes that one or more PhEval runs have already been completed using plugin-provided runners.
+
+---
+
+## After runner execution
+
+After executing a run, an output directory structure similar to the following is produced:
+
+```tree
+.
+├── pheval_disease_results
+│   ├── patient_1-disease_result.parquet
+├── pheval_gene_results
+│   ├── patient_1-gene_result.parquet
+├── pheval_variant_results
+│   ├── patient_1-variant_result.parquet
+├── raw_results
+│   ├── patient_1.json
+├── results.yml
+└── tool_input_commands
+    └── tool_input_commands.txt
+```
+
+Which result directories are present depends on the configuration used during runner execution.
+
+The contents of the `pheval_*_results` directories are consumed during benchmarking.
+
+---
+
+## Benchmarking configuration file
+
+Benchmarking is configured using a YAML file supplied to the CLI.
+
+### Example configuration
+
+```yaml
+benchmark_name: tool_version_update_benchmark
+runs:
+  - run_identifier: run_identifier_1
+    results_dir: /path/to/results_dir_1
+    phenopacket_dir: /path/to/phenopacket_dir
+    gene_analysis: true
+    variant_analysis: false
+    disease_analysis: true
+    threshold:
+    score_order: descending
+  - run_identifier: run_identifier_2
+    results_dir: /path/to/results_dir_2
+    phenopacket_dir: /path/to/phenopacket_dir
+    gene_analysis: true
+    variant_analysis: true
+    disease_analysis: true
+    threshold:
+    score_order: descending
+plot_customisation:
+  gene_plots:
+    plot_type: bar_cumulative
+    rank_plot_title:
+    roc_curve_title:
+    precision_recall_title:
+  disease_plots:
+    plot_type: bar_cumulative
+    rank_plot_title:
+    roc_curve_title:
+    precision_recall_title:
+  variant_plots:
+    plot_type: bar_cumulative
+    rank_plot_title:
+    roc_curve_title:
+    precision_recall_title:
+```
+
+The `benchmark_name` is used to name the DuckDB database that stores benchmarking statistics.
+It should not contain whitespace or special characters.
+
+---
+
+## Runs section
+
+Each entry in the `runs` list specifies a completed run to include in the benchmark.
+
+Required fields:
+
+- `run_identifier` → A human-readable identifier used in tables and plots.
+- `results_dir` → Path to the directory containing `pheval_gene_results`, `pheval_variant_results`, and/or `pheval_disease_results`.
+- `phenopacket_dir`  →Path to the phenopacket directory used during runner execution.
+- `gene_analysis`, `variant_analysis`, `disease_analysis` →Boolean flags indicating which analyses to include.
+
+Optional fields:
+
+- `threshold` → Score threshold for result inclusion.
+- `score_order` → Ranking order (`ascending` or `descending`).
+
+---
+
+## Plot customisation
+
+The `plot_customisation` section allows optional control over plot appearance.
+
+Available options:
+
+- `plot_type` → One of `bar_cumulative`, `bar_non_cumulative`, or `bar_stacked`.
+- `rank_plot_title` → Custom title for ranking summary plots.
+- `roc_curve_title` → Custom title for ROC plots.
+- `precision_recall_title` → Custom title for precision–recall plots.
+
+If left unspecified, default titles and plot types are used.
+
+---
+
+## Executing the benchmark
+
+Once the configuration file is prepared, benchmarking can be executed with:
+
+```bash
+pheval-utils benchmark --run-yaml benchmarking_config.yaml
+```
+
+> !!! note "**Command Note:**"
+    As of `pheval` version **0.5.0** onwards, the command is `benchmark`.  
+    In earlier versions, the equivalent command was `generate-benchmark-stats`.
+    See the [v0.5.1 release notes](https://github.com/monarch-initiative/pheval/releases/tag/0.5.1) for more details.
+
+
+---
+
+## Outputs and interpretation
+
+Benchmarking produces:
+
+- A DuckDB database containing computed statistics, and comparisons between runs
+- Rank-based and binary classification plots
+
+These outputs can be used to compare tools, configurations, and experimental conditions in a reproducible manner.
diff --git a/docs/benchmarking/index.md b/docs/benchmarking/index.md
new file mode 100644
index 000000000..1c6e5681f
--- /dev/null
+++ b/docs/benchmarking/index.md
@@ -0,0 +1,55 @@
+# Benchmarking and analysis
+
+This section describes how PhEval is used to **benchmark and compare phenotype-driven prioritisation methods** once tool execution has completed.
+
+Benchmarking in PhEval is designed to support **controlled, reproducible evaluation** across:
+
+- Tools and tool versions
+- Cohorts and simulation strategies
+- Ontology and knowledge-base updates
+
+This section focuses on *analysis*, not execution. Tools are executed via runners provided by plugins.
+
+---
+
+## What benchmarking means in PhEval
+
+In PhEval, benchmarking refers to the process of:
+
+- Consuming **PhEval standardised results** produced by runners
+- Computing rank-based and binary classification metrics
+- Comparing performance across multiple runs
+- Generating plots and summary statistics for interpretation
+
+Benchmarking operates over one or more completed runs and assumes that tool execution has already taken place.
+
+---
+
+## High-level benchmarking workflow
+
+A typical benchmarking workflow consists of:
+
+1. **Execute one or more runners**  
+   Runners produce PhEval-standardised outputs for gene, variant, and/or disease prioritisation.
+
+2. **Configure benchmarking parameters**  
+   A YAML configuration file specifies which runs to include and how benchmarking should be performed.
+
+3. **Run benchmarking and analysis**  
+   PhEval utilities compute metrics, comparisons, and plots across the specified runs.
+
+Each of these steps is described in more detail in the following pages.
+
+---
+
+## What benchmarking produces
+
+Benchmarking generates:
+
+- Ranking-based statistics
+- Binary classification statistics
+- Comparative summaries between runs
+- Plots for visual comparison
+- A singular DuckDB database containing all computed metrics and comparisons
+
+These outputs support both exploratory analysis and formal evaluation.
diff --git a/docs/contact.md b/docs/contact.md
deleted file mode 100644
index c507185ba..000000000
--- a/docs/contact.md
+++ /dev/null
@@ -1,9 +0,0 @@
-# Contact
-
-The preferred way to contact the PhEval team is through the [issue tracker](https://github.com/monarch-initiative/pheval/issues) (for problems with PhEval) or the [GitHub discussions](https://github.com/monarch-initiative/pheval/discussions) (for general questions).
-
-You can find any of the members of the PhEval core team on GitHub:
-
-https://github.com/orgs/monarch-initiative/teams/pheval-team
-
-Their GitHub profiles usually also provide email addresses.
diff --git a/docs/contributing.md b/docs/contributing.md
deleted file mode 100644
index cedafa4ca..000000000
--- a/docs/contributing.md
+++ /dev/null
@@ -1,32 +0,0 @@
-# Contributions
-
-:+1: First of all: Thank you for taking the time to contribute!
-
-The following is a set of guidelines for contributing to the PhEval framework. 
-These guidelines are not strict rules. Use your best judgment, and feel free to propose 
-changes to this document in a pull request.
-
-## Table Of Contents
-
-- [Contributions](#contributions)
-  - [Table Of Contents](#table-of-contents)
-  - [Code of Conduct](#code-of-conduct)
-  - [Guidelines for Contributions and Requests](#guidelines-for-contributions-and-requests)
-    - [Reporting problems with the data model](#reporting-problems-with-the-data-model)
-
-<a id="code-of-conduct"></a>
-
-## Code of Conduct
-
-The monarch-technical-documentation team strives to create a welcoming environment for editors, users and other contributors.
-Please carefully read our [Code of Conduct](CODE_OF_CONDUCT.md).
-
-<a id="contributions"></a>
-
-## Guidelines for Contributions and Requests
-
-<a id="reporting-bugs"></a>
-
-### Reporting problems with the data model
-
-Please use our [Issue Tracker](https://github.com/monarch-initiative/pheval/issues/) for reporting problems with the ontology. 
diff --git a/docs/developing_a_pheval_plugin.md b/docs/developing_a_pheval_plugin.md
deleted file mode 100644
index 9ce0f783e..000000000
--- a/docs/developing_a_pheval_plugin.md
+++ /dev/null
@@ -1,383 +0,0 @@
-# Developing a PhEval Plugin
-
-## Description
-
-Plugin development allows PhEval to be extensible, as we have designed it.
-The plugin goal is to be flexible through custom runner implementations. This plugin development enhances the PhEval functionality. You can build one quickly using this step-by-step process.
-
-==_All custom Runners implementations must implement all_ **PhevalRunner** _methods_==
-
-::: src.pheval.runners.runner.PhEvalRunner
-    handler: python
-    options:
-      members:
-        - PhEvalRunner
-      show_root_heading: false
-      show_source: true
----
-
-## Step-by-Step Plugin Development Process
-
-The plugin structure is derived from a [cookiecutter](https://cookiecutter.readthedocs.io/en/stable/) template, [cookiecutter](https://github.com/monarch-initiative/pheval-runner-template), and it uses [MkDocs](https://mkdocstrings.github.io), [tox](https://tox.wiki/en/latest/) and [uv](https://docs.astral.sh/uv/) as core dependencies.
-This allows PhEval extensibility to be standardised in terms of documentation and dependency management.
-
-### 1. Cookiecutter scaffold
-
-First, install the cruft package. Cruft enables keeping projects up-to-date with future updates made to this original template.
-
-Install the latest release of cruft from pip
-
-```bash
-pip install cruft
-```
-
-> **_NOTE:_**  You may encounter an error with the naming of the project layout if using an older release of cruft. To avoid this, make sure you have installed the latest release version.
-
-Next, create a project using the cookiecutter template.
-
-```
-cruft create https://github.com/monarch-initiative/pheval-runner-template
-```
-
-### 2. Further setup
-
-#### Install uv if you haven't already.
-
-```
-pip install uv
-```
-
-#### Install dependencies
-
-```
-uv sync
-
-source .venv/bin/activate
-```
-> **Note:**
-> 
-> The PhEval runner template uses `uv` by default, but this is **not required**.  
-> Any Python packaging or dependency manager (e.g. Poetry) may be used.  
-> This only affects how the plugin is installed — PhEval only requires a valid `pheval.plugins` entry point.
-
-#### Run tox to see if the setup works
-
-```
-uv run tox
-```
-
-### 3. Implement PhEval Custom Runner
-
-In the project structure generated by Cookiecutter, you'll find `runner.py` located in the `src` directory. This is where you'll define the methods required to develop the plugin. Specifically, you'll implement the prepare, run, and post-process methods, which are essential for executing the pheval run command.
-```python
-"""Runner."""
-
-from dataclasses import dataclass
-from pathlib import Path
-
-from pheval.runners.runner import PhEvalRunner
-
-
-@dataclass
-class CustomRunner(PhEvalRunner):
-    """Runner class implementation."""
-
-    input_dir: Path
-    testdata_dir: Path
-    tmp_dir: Path
-    output_dir: Path
-    config_file: Path
-    version: str
-
-    def prepare(self):
-        """Prepare."""
-        print("preparing")
-
-    def run(self):
-        """Run."""
-        print("running")
-
-    def post_process(self):
-        """Post Process."""
-        print("post processing")
-
-```
-
-
-The Cookiecutter will automatically populate the plugins section in the `pyproject.toml` file. If you decide to modify the path of `runner.py` or rename its class, be sure to update the corresponding entries in this section accordingly:
-
-```toml
-
-[project.entry-points."pheval.plugins"]
-customrunner = "pheval_plugin_example.runner:CustomRunner"
-```
-
-> Please Note that the path here and naming of the class is case-sensitive.
-
-
-### 4. Implementing PhEval helper methods
-
-Streamlining the creation of your custom PhEval runner can be facilitated by leveraging PhEval's versatile helper methods, where applicable.
-
-Within PhEval, numerous public methods have been designed to assist in your runner methods. The utilisation of these helper methods is optional, yet they are crafted to enhance the overall implementation process.
-
-#### Utility methods
-
-The `PhenopacketUtil` class is designed to aid in the collection of specific data from a Phenopacket.
-
-::: src.pheval.utils.phenopacket_utils.PhenopacketUtil
-    handler: python
-    options:
-      members:
-        - PhenopacketUtil
-      show_root_heading: false
-      show_source: true
----
-
-`PhenopacketUtil` proves particularly beneficial in scenarios where the tool for which you're crafting a runner implementation does not directly accept Phenopackets as inputs. Instead, it might require elements—such as HPO IDs— via the command-line interface (CLI). In this context, leveraging PhenopacketUtil within the runner's preparation phase enables the extraction of observed phenotypic features from the Phenopacket input, facilitating seamless processing.
-
-An example of how this could be implemented is outlined here:
-
-```python
-from pheval.utils.phenopacket_utils import phenopacket_reader
-from pheval.utils.phenopacket_utils import PhenopacketUtil
-
-phenopacket = phenopacket_reader("/path/to/phenopacket.json")
-phenopacket_util = PhenopacketUtil(phenopacket)
-# To return a list of all observed phenotypes for a phenopacket
-observed_phenotypes = phenopacket_util.observed_phenotypic_features()
-# To extract just the HPO ID as a list
-observed_phenotypes_hpo_ids = [
-    observed_phenotype.type.id for observed_phenotype in observed_phenotypes
-]
-```
-#### Additional tool-specific configurations
-
-For the `pheval run` command to execute successfully, a `config.yaml` should be found within the input directory supplied on the CLI.
-
-```yaml
-tool: 
-tool_version: 
-variant_analysis: 
-gene_analysis: 
-disease_analysis: 
-tool_specific_configuration_options:
-```
-
-The `tool_specific_configuration_options` is an optional field that can be populated with any variables specific to your runner implementation that is required for the running of your tool.
-
-All other fields are required to be filled in. The `variant_analysis`, `gene_analysis`, and `disease_analysis` are set as booleans (`false` or `true` and are for specifying what type of analysis/prioritisation the tool outputs.
-
-To populate the `tool_specific_configurations_options` with customised data, we suggest using the `pydantic` package as it can easily parse the data from the yaml structure.
-
-e.g.,
-
-_Define a `BaseModel` class with the fields that will populate the `tool_specific_configuration_options`_
-
-```python
-from pydantic import BaseModel, Field
-
-class CustomisedConfigurations(BaseModel):
-    """
-    Class for defining the customised configurations in tool_specific_configurations field,
-    within the input_dir config.yaml
-    Args:
-        environment (str): Environment to run
-    """
-    environment: str = Field(...)
-```
-
-_Within your runner parse the field into an object._
-
-```python
-from dataclasses import dataclass
-from pheval.runners.runner import PhEvalRunner
-from pathlib import Path
-
-@dataclass
-class CustomPhevalRunner(PhEvalRunner):
-    """CustomPhevalRunner Class."""
-
-    input_dir: Path
-    testdata_dir: Path
-    tmp_dir: Path
-    output_dir: Path
-    config_file: Path
-    version: str
-
-    def prepare(self):
-        """prepare method."""
-        print("preparing")
-        config = CustomisedConfigurations.parse_obj(
-            self.input_dir_config.tool_specific_configuration_options
-        )
-        environment = config.environment
-        
-    def run(self):
-        """run method."""
-        print("running with custom pheval runner")
-
-    def post_process(self):
-        """post_process method."""
-        print("post processing")
-        
-
-```
-
-#### Post-processing methods
-
-PhEval currently supports the benchmarking of gene, variant, and disease prioritisation results. 
-
-To benchmark these result types, PhEval parquet result files need to be generated. 
-
-PhEval can deal with the ranking and generation of these files to the correct location. However, the runner implementation must handle the extraction of essential data from the tool-specific raw results. This involves transforming them into a polars dataframe with the required columns for the benchmark type.
-
-The columns representing essential information extracted from tool-specific output for gene, variant, and disease prioritisation are defined as follows:
-
-::: src.pheval.post_processing.validate_result_format.ResultSchema
-    handler: python
-    options:
-      members:
-        - GENE_RESULT_SCHEMA
-        - VARIANT_RESULT_SCHEMA
-        - DISEASE_RESULT_SCHEMA
-      show_root_heading: false
-      show_source: true
----
-
-The `grouping_id` column is _**optional**_ and is designed to handle cases where entities should be jointly ranked without being penalised. 
-For example, in the ranking of compound heterozygous variant which occurs when two or more variants, inherited together, 
-contribute to a phenotype. For this purpose, variants that are part of the same compound heterozygous group 
-(e.g., within the same gene) should be assigned the same `grouping_id`. 
-This ensures they are ranked as a single entity, preserving their combined significance.
-Variants that are not part of any compound heterozygous group should each have a unique `grouping_id`. 
-This approach prevents any unintended overlap in ranking and ensures that each group or individual variant is accurately represented. 
-The use of the `grouping_id` would also be suitable for the ranking and prioritisation of polygenic diseases.
-
-Depending on whether you need to generate gene, variant, and or disease results depends on the final method called to generate the results from the polars dataframe. The methods are outlined below:
-
-> ⚠️ **Breaking Change (v0.5.0):**  
-> The helper method `generate_pheval_result` has been **replaced with three separate methods** for each result type:  
-> - `generate_gene_result`  
-> - `generate_variant_result`  
-> - `generate_disease_result`  
-> Update your runner implementation to call the appropriate method based on the type of result your tool produces.
-
-
-::: src.pheval.post_processing.post_processing.generate_gene_result
-    handler: python
-    options:
-      show_root_heading: false
-      show_source: true
----
-
-::: src.pheval.post_processing.post_processing.generate_variant_result
-    handler: python
-    options:
-      show_root_heading: false
-      show_source: true
----
-
-
-::: src.pheval.post_processing.post_processing.generate_disease_result
-    handler: python
-    options:
-      show_root_heading: false
-      show_source: true
----
-
-An example of how the method can be called is outlined here:
-
-```python
-from pheval.post_processing.post_processing import generate_gene_result, SortOrder
-
-generate_gene_result(
-    results=pheval_gene_result,  # this is the polars dataframe containing extracted PhEval result requirements
-    sort_order=SortOrder.DESCENDING,  # or can be ASCENDING - this determines in which order the scores will be ranked
-    output_dir=output_directory,  # this can be accessed from the runner instance e.g., self.output_dir
-    result_path=result_path  # this is the path to the tool-specific raw results file
-    phenopacket_dir=phenopacket_dir # this is the path to the directory containing the phenopackets
-)
-```
-
-#### Adding metadata to the results.yml
-
-By default, PhEval will write a `results.yml` to the output directory supplied on the CLI. 
-
-The `results.yml` contains basic metadata regarding the run configuration, however, there is also the option to add customised run metadata to the `results.yml` in the `tool_specific_configuration_options` field.
-
-To achieve this, you'll need to create a `construct_meta_data()` method within your runner implementation. This method is responsible for appending customised metadata to the metadata object in the form of a defined dataclass. It should return the entire metadata object once the addition is completed.
-
-e.g.,
-
-_Defined customised metadata dataclass:_
-
-```python
-from dataclasses import dataclass
-
-@dataclass
-class CustomisedMetaData:
-    customised_field: str
-```
-
-_Example of implementation in the runner._
-
-```python
-from dataclasses import dataclass
-from pheval.runners.runner import PhEvalRunner
-from pathlib import Path
-
-@dataclass
-class CustomPhevalRunner(PhEvalRunner):
-    """CustomPhevalRunner Class."""
-
-    input_dir: Path
-    testdata_dir: Path
-    tmp_dir: Path
-    output_dir: Path
-    config_file: Path
-    version: str
-
-    def prepare(self):
-        """prepare method."""
-        print("preparing")
-
-    def run(self):
-        """run method."""
-        print("running with custom pheval runner")
-
-    def post_process(self):
-        """post_process method."""
-        print("post processing")
-        
-    def construct_meta_data(self):
-        """Add metadata."""
-        self.meta_data.tool_specific_configuration_options = CustomisedMetaData(customised_field="customised_value")
-        return self.meta_data
-
-```
-
-### 6. Test it.
-
-To update your custom pheval runner implementation, you must first install the package
-
-```
-uv sync
-```
-
-Now you have to be able to run PhEval passing your custom runner as parameter. e.g.,
-
-```
-pheval run -i ./input_dir -t ./test_data_dir -r 'customphevalrunner' -o output_dir
-```
-
-The `-r` parameter stands for your plugin runner class name, and it must be entirely lowercase.
-
-Output:
-
-```
-preparing
-running
-post processing
-```
-
diff --git a/docs/executing_a_benchmark.md b/docs/executing_a_benchmark.md
deleted file mode 100644
index c185a5b61..000000000
--- a/docs/executing_a_benchmark.md
+++ /dev/null
@@ -1,110 +0,0 @@
-# Executing a Benchmark
-
-PhEval is designed for benchmarking algorithms across various datasets. To execute a benchmark using PhEval, you need to: 
-
-1. Execute your runner; generating the PhEval standardised parquet outputs for gene/variant/disease prioritisation.
-2. Configure the benchmarking parameters.
-3. Run the benchmark.
-
-PhEval will generate various performance reports, allowing you to easily compare the effectiveness of different algorithms.
-
-## After the Runner Execution
-
-After executing a run, you may be left with an output directory structure like so:
-
-```tree
-.
-├── pheval_disease_results
-│   ├── patient_1-disease_result.parquet
-├── pheval_gene_results
-│   ├── patient_1-gene_result.parquet
-├── pheval_variant_results
-│   ├── patient_1-variant_result.parquet
-├── raw_results
-│   ├── patient_1.json
-├── results.yml
-└── tool_input_commands
-    └── tool_input_commands.txt
-```
-Whether you have populated `pheval_disease_results`, `pheval_gene_results`, and `pheval_variant_results` directories will depend on what is specified in the `config.yaml` for the runner execution. It is the results in these directories that are consumed in the benchmarking to produce the statistical comparison reports.
-
-## Benchmarking Configuration File
-
-To configure the benchmarking parameters, a YAML configuration file should be created and supplied to the CLI command.
-
-An outline of the configuration file structure follows below:
-
-```yaml
-benchmark_name: exomiser_14_benchmark
-runs:
-  - run_identifier: run_identifier_1
-    results_dir: /path/to/results_dir_1
-    phenopacket_dir: /path/to/phenopacket_dir
-    gene_analysis: true
-    variant_analysis: false
-    disease_analysis: true
-    threshold:
-    score_order: descending
-  - run_identifier: run_identifier_2
-    results_dir: /path/to/results_dir_2
-    phenopacket_dir: /path/to/phenopacket_dir
-    gene_analysis: true
-    variant_analysis: true
-    disease_analysis: true
-    threshold:
-    score_order: descending
-plot_customisation:
-  gene_plots:
-    plot_type: bar_cumulative
-    rank_plot_title: 
-    roc_curve_title: 
-    precision_recall_title: 
-  disease_plots:
-    plot_type: bar_cumulative
-    rank_plot_title:
-    roc_curve_title: 
-    precision_recall_title: 
-  variant_plots:
-    plot_type: bar_cumulative
-    rank_plot_title: 
-    roc_curve_title: 
-    precision_recall_title: 
-
-```
-
-The `benchmark_name` is what will be used to name the duckdb database that will contain all the ranking and binary statistics as well as comparisons between runs. The name provided should not have any whitespace or special characters.
-
-### Runs section
-
-The `runs` section specifies which run configurations should be included in the benchmarking. For each run configuration you will need to populate the following parameters:
-
-- `run_identifier`: The identifier associated with the run - this should be meaningful as it will be used in the naming in tables and plots. 
-- `results_dir`: The full path to the root directory where the directories `pheval_gene_results`/`pheval_variant_results`/`pheval_disease_results` can be found.
-- `phenopacket_dir`: The full path to the phenopacket directory used during the runner execution.
-- `gene_analysis`: Boolean specifying whether to perform benchmarking for gene prioritisation analysis.
-- `variant_analysis`: Boolean specifying whether to perform benchmarking for variant prioritisation analysis
-- `disease_analysis`: Boolean specifying whether to perform benchmarking for disease prioritisation analysis
-- `threshold`: OPTIONAL score threshold to consider for inclusion of results. 
-- `score_order`: Ordering of results for ranking. Either ascending or descending.
-
-### Plot customisation section
-
-The `plot_customisation` section specifies any additional customisation to the plots output from the benchmarking. Here you can specify title names for all the plots output, as well as the plot type for displaying the summary ranking stats. This section is split by the plots output from the gene, variant and disease prioritisation benchmarking. The parameters in this section do not need to be populated - however, if left blank it will default to generic titles. The parameters as follows are:
-
-- `plot_type`: The plot type output for the summary rank stats plot. This can be either, bar_cumulative, bar_non_cumulative or bar_stacked.
-- `rank_plot_title`: The customised title for the summary rank stats plot.
-- `roc_curve_title`: The customised title for the ROC curve plot.
-- `precision_recall_title` The customised title for the precision-recall curve plot.
-
-## Executing the benchmark
-
-After configuring the benchmarking YAML, executing the benchmark is relatively simple.
-
-```bash
-pheval-utils benchmark --run-yaml benchmarking_config.yaml
-```
-
-> **Note:** As of `pheval-utils` version **0.5.0** onwards, the command is `benchmark`.  
-> In earlier versions, the equivalent command was `generate-benchmark-stats`.
-> See the [v0.5.1 release notes](https://github.com/monarch-initiative/pheval/releases/tag/0.5.1) for more details.
-
diff --git a/docs/getting_started/getting_started.md b/docs/getting_started/getting_started.md
new file mode 100644
index 000000000..815abcc0d
--- /dev/null
+++ b/docs/getting_started/getting_started.md
@@ -0,0 +1,28 @@
+# Getting started with PhEval
+
+This section helps new users orient themselves: what PhEval is, what problems it is designed to solve, and where to go next.
+
+## Video walkthrough
+
+If you prefer a guided walkthrough, start here:
+
+- [**Introduction to PhEval and running a simple benchmark**](https://www.youtube.com/watch?v=nIPzVN99UWc)
+
+## High-level workflow
+
+At a high level, PhEval workflows involve:
+
+1. Installing PhEval and any required plugins
+2. Producing PhEval-standardised results using tool-specific runners
+3. Benchmarking and analysing those results
+
+Each of these steps is documented in more detail elsewhere.
+
+---
+
+## Suggested reading order
+
+- New to PhEval: start with [Installation](installation.md), then read [Plugins and runners](../using_pheval/plugins_and_runners.md/)
+- Benchmarking existing outputs: go to [Benchmarking](../benchmarking/index.md)
+- Running robustness or simulation experiments: go to [Utilities](../utilities/index.md)
+- Extending PhEval with a new tool: go to [Developing a PhEval plugin](../resources_for_contributors/developing_a_pheval_plugin.md)
\ No newline at end of file
diff --git a/docs/getting_started/installation.md b/docs/getting_started/installation.md
new file mode 100644
index 000000000..c6396284f
--- /dev/null
+++ b/docs/getting_started/installation.md
@@ -0,0 +1,114 @@
+# Installation
+
+This page explains how to install **PhEval** and verify that it is available on your system.  
+Tool-specific execution details are handled by **plugins** and documented in their respective repositories.
+
+---
+
+## Prerequisites
+
+Before installing PhEval, ensure you have:
+
+- **Python 3.10 or newer**
+- A working Python environment (virtualenv, conda, uv, or similar)
+- `pip` or a compatible Python package manager
+
+It is strongly recommended to install PhEval in an isolated environment.
+
+---
+
+## Create a virtual environment (recommended)
+
+Using `venv`:
+
+```bash
+python -m venv pheval-env
+source pheval-env/bin/activate
+```
+
+On macOS/Linux, you should now see your environment name in the shell prompt.
+
+## Install PhEval
+
+Install the latest released version from PyPI:
+
+```bash
+pip install pheval
+```
+
+This installs:
+
+* The PhEval command-line interface 
+* The `pheval-utils` command-line utilities
+* Shared utilities used by plugins and runners
+
+## Verify the installation
+
+Check that PhEval is installed correctly:
+
+```bash
+pheval --help
+```
+
+You should see output similar to:
+
+```bash
+Usage: pheval [OPTIONS] COMMAND [ARGS]...
+
+Options:
+  --help  Show this message and exit.
+
+Commands:
+  run     Execute a phenotype-driven tool via a runner
+  update  Download or update required mapping resources
+```
+
+Verify that the utility commands are also available:
+
+```bash
+pheval-utils --help
+```
+
+You should see a list of available utility commands, including data preparation and benchmarking.
+
+If either command is not found, ensure your virtual environment is activated.
+
+## Install a plugin
+
+PhEval does not execute tools directly.
+Instead, plugins provide runners that implement tool-specific preparation, execution, and post-processing.
+
+After installing PhEval, install one or more plugins corresponding to the tools you want to evaluate.
+
+Each plugin:
+
+* Documents its own installation requirements 
+* Exposes one or more runners 
+* Explains how to invoke those runners using `pheval run`
+
+See the Plugins section for a list of available plugins and links to their documentation.
+
+## Update mapping resources (recommended)
+
+Some workflows require up-to-date ontology and identifier mappings.
+
+You can download or refresh these resources using:
+
+```bash
+pheval update
+```
+
+This step is recommended before running benchmarks, particularly when working with gene or disease identifiers.
+
+## Next steps
+
+Once PhEval is installed:
+
+* Learn how plugins and runners fit into the execution model
+→ [Plugins and runners](../using_pheval/plugins_and_runners.md)￼
+* Prepare input data and corpora
+→ [Utilities](../utilities/index.md)
+* Benchmark and analyse results
+→ [Benchmarking](../benchmarking/index.md)
+* Write your own runner (for developers)
+→ [Developing a PhEval plugin](../resources_for_contributors/developing_a_pheval_plugin.md)
\ No newline at end of file
diff --git a/docs/images/pheval-logo-2025-01-30_icon white.svg b/docs/images/pheval-logo-2025-01-30_icon white.svg
new file mode 100644
index 000000000..e9139fe3d
--- /dev/null
+++ b/docs/images/pheval-logo-2025-01-30_icon white.svg	
@@ -0,0 +1,16 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<svg id="Layer_1" data-name="Layer 1" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 179.78 226.64">
+  <defs>
+    <style>
+      .cls-1 {
+        fill: #fff;
+      }
+    </style>
+  </defs>
+  <path class="cls-1" d="m142.61,139.25c10.88-12.59,17.49-28.97,17.49-46.87,0-39.57-32.19-71.77-71.76-71.77S16.57,52.81,16.57,92.38v126.35h41.2v-61.44c9.28,4.39,19.64,6.85,30.57,6.85,16.73,0,32.12-5.77,44.34-15.4l-27.35-30.97c-4.86,3.26-10.71,5.17-16.99,5.17-16.86,0-30.57-13.71-30.57-30.57s13.71-30.57,30.57-30.57,30.57,13.71,30.57,30.57c0,5.58-1.53,10.81-4.15,15.31l27.86,31.56Z"/>
+  <g>
+    <path class="cls-1" d="m74.34,115.71v-33.03h-9.01v24.23c2.28,3.59,5.37,6.61,9.01,8.81Z"/>
+    <path class="cls-1" d="m88.62,119.61v-50.3h-9.02v48.85c2.75.93,5.68,1.46,8.73,1.46.09,0,.19-.01.28-.01Z"/>
+    <path class="cls-1" d="m101.92,88.36h-9.02v30.85c3.23-.55,6.27-1.66,9.02-3.25v-27.6Z"/>
+  </g>
+</svg>
\ No newline at end of file
diff --git a/docs/images/pheval_workflow.png b/docs/images/pheval_workflow.png
new file mode 100644
index 000000000..4510842f6
Binary files /dev/null and b/docs/images/pheval_workflow.png differ
diff --git a/docs/index.md b/docs/index.md
index 52dc3682c..5518b1090 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,31 +1,99 @@
-# Home
-
-## Introduction
-
-PhEval - Phenotypic Inference Evaluation Framework
-
-### PhEval: Tool-specific processing (VP pipeline)
-
-```mermaid
-flowchart LR
-    PC-->DP
-    PC[(Phenopackets Corpus)]
-    SSSOM[Semantic Similarity Profiles Mapping Commons]-->|OAK-SEMSIM|DP[Data Prepare]
-    KG[Source data KG - Monarch KG]-->|KGX-BIOLINK|DP[Data Prepare]
-    ONT[Ontologies - Phenio]-->|OAK-ONTO|DP[Data Prepare]
-    DP-->RP[Run Prepare]
-    RP-->PR[PhEval Runner]
-    PR-->DP2[Data Process]
-    ER[Exomiser Runner]-->PR
-    EDP[Exomiser Data Prepare]-->DP
-    ERP[Exomiser Run Prepare]-->RP
-    PPP[Disease-profile similarity prediction Post-process]-->DP2
-    PV[Phenotype/Variant]-->DP2
-    GVP[Gene VP Post-process]-->DP2
-    EPP[Exomiser Post Process]-->GVP
-    GVP-->VPR[VP Report]
-```
-
-**Quick links:**
-
-- [GitHub page](https://github.com/monarch-initiative/pheval/)
+# PhEval
+
+PhEval (Phenotypic Inference Evaluation Framework) is a modular benchmarking framework for evaluating **phenotype-driven prioritisation tools** (e.g. [Exomiser](https://github.com/exomiser/Exomiser)) and related methods that use ontologies, phenotype matching, and semantic similarity.
+
+Phenotype-based methods are widely used in rare disease diagnostics and research, but **robust evaluation is challenging**. Tool performance can change substantially depending on:
+
+- Ontology structure and versioning 
+- Phenotype, gene, and disease mappings 
+- Semantic similarity methods and scoring strategies 
+- Underlying knowledge resources and input cohorts
+
+PhEval was designed to support **fair and reproducible, and controlled evaluation** of these methods by explicitly separating data preparation, tool execution, and analysis, and by standardising how results are represented and compared.
+
+---
+
+## What you can do with PhEval
+
+- Benchmark phenotype-based prioritisation tools on real or simulated cohorts
+- Compare results across **tools**, **tool versions**, **ontology versions**, and **knowledge updates**
+- Quantify the impact of methodological or knowledge changes on downstream performance
+- Produce standardised results suitable for consistent analysis and reporting
+- Extend the framework via **plugins** for new tools or workflows
+---
+
+## PhEval workflow
+
+PhEval is organised around three core phases: **Data Preparation**, **Runner**, and **Analysis**.
+
+<img src="images/pheval_workflow.png" alt="pheval_workflow"/>
+
+**Data Preparation**
+
+- Phenopackets and/or VCF files
+- Ontologies, mappings, and supporting databases
+
+This stage isolates cohort construction and knowledge resources from tool execution.
+
+**Run**
+
+- Tool execution via PhEval runners
+- Support for multiple tools and versions
+- Standardised execution and output handling
+
+**Analyse** 
+
+- Post-processing into a PhEval standardised results format
+- Rank-based and binary classification metrics
+- Comparable plots and summaries across runs
+
+---
+
+## Start here
+
+Choose the entry point that best matches what you want to do:
+
+- **Run a phenotype-driven tool under PhEval:**  
+  Install a plugin, then execute the tool via its runner using the PhEval CLI  
+  (e.g. `pheval run --runner <runner_name>`).  
+  → [Plugins](plugins/index.md)
+- **Prepare and manipulate input data:**  
+  Utilities for preparing phenopackets, creating spiked VCFs, scrambling phenotypes, and updating resources.  
+  → [Utilities and data preparation](utilities/)
+- **Benchmark and analyse results:**  
+  Compare PhEval-standardised results across tools, versions, and experimental conditions.  
+  → [Benchmarking](benchmarking/executing_a_benchmark.md)
+
+- **Extend PhEval:**  
+  Implement new runners or customise workflows by writing plugins.  
+  → [Developing a PhEval Plugin](resources_for_contributors/developing_a_pheval_plugin.md)
+- **Developer reference:**  
+  Implementation details.  
+  → [API Documentation](resources_for_contributors/api_reference/index.md)
+
+---
+
+## Who is PhEval for?
+
+
+- Researchers developing or evaluating phenotype-driven prioritisation methods
+- Teams assessing the impact of tool, ontology, or knowledge updates
+- Ontology and knowledge-graph developers studying downstream effects
+- Anyone needing **transparent, repeatable benchmarking** over phenotyped cohorts
+
+---
+
+## Project links
+
+- Source code and issues: <https://github.com/monarch-initiative/pheval>
+
+---
+
+## Contact and support
+
+For bugs, feature requests, or questions:
+
+- Open an issue on GitHub (preferred)
+- Use the contact details listed in the repository
+
+PhEval is developed as part of the Monarch Initiative ecosystem.
diff --git a/docs/mermaid.css b/docs/mermaid.css
deleted file mode 100644
index 04492b8b2..000000000
--- a/docs/mermaid.css
+++ /dev/null
@@ -1,3 +0,0 @@
-div.mermaid {
-    text-align: center;
-}
\ No newline at end of file
diff --git a/docs/plugins.md b/docs/plugins.md
deleted file mode 100644
index b72fcc09e..000000000
--- a/docs/plugins.md
+++ /dev/null
@@ -1,16 +0,0 @@
-A full list of implemented PhEval runners are listed below along with links to the original tool:
-
-| Tool        | PhEval plugin                                                                    | Comment                                                                                                               |
-|-------------|----------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
-| Exomiser    | [Exomiser runner](https://github.com/monarch-initiative/pheval.exomiser)         | The link to the original tool can be found [here](https://github.com/exomiser/Exomiser)                               |
-| Phen2Gene   | [Phen2Gene runner](https://github.com/monarch-initiative/pheval.phen2gene)       | The link to the original tool can be found [here](https://github.com/WGLab/Phen2Gene)                                 |
-| LIRICAL     | [LIRICAL runner](https://github.com/monarch-initiative/pheval.lirical)           | The link to the original tool can be found [here](https://github.com/TheJacksonLaboratory/LIRICAL)                    |
-| SvAnna      | [SvAnna runner](https://github.com/monarch-initiative/pheval.svanna)             | The link to the original tool can be found [here](https://github.com/TheJacksonLaboratory/SvAnna)                     |
-| GADO        | [GADO runner](https://github.com/monarch-initiative/pheval.gado)                 | The link to the original tool can be found [here](https://github.com/molgenis/systemsgenetics/wiki/GADO-Command-line) |
-| Template    | [Template runner](https://github.com/monarch-initiative/pheval.template)         |                                                                                                                       |
-| OntoGPT     | [OntoGPT runner](https://github.com/monarch-initiative/pheval.ontogpt)           |                                                                                                                       |
-| ELDER       | [ELDER runner](https://github.com/iQuxLE/ELDER)                                  |                                                                                                                       |
-| MALCO       | [MALCO runner](https://github.com/monarch-initiative/pheval.llm.git)             |                                                                                                                       |
-| AI MARRVEL  | [AI MARRVEL runner](https://github.com/monarch-initiative/pheval.ai_marrvel.git) | The link to the original tool can be found [here](https://github.com/LiuzLab/AI_MARRVEL.git)                                                                   |
-| OAK         | [OAK runner](https://github.com/monarch-initiative/pheval.oak.git)               |                                                                                                                                                                |
-| PhenoGenius | [PhenoGenius runner](https://github.com/monarch-initiative/pheval.phenogenius)   | The link to the original tool can be found [here](https://github.com/kyauy/PhenoGeniusCli)                                                                   |
diff --git a/docs/plugins/index.md b/docs/plugins/index.md
new file mode 100644
index 000000000..b9f413df4
--- /dev/null
+++ b/docs/plugins/index.md
@@ -0,0 +1,43 @@
+# Plugins
+
+This page lists **available PhEval plugins** and the phenotype-driven tools they integrate.
+
+Plugins are separate Python packages that extend PhEval with tool-specific logic. Each plugin exposes one or more **runners** that can be invoked via the PhEval CLI.
+
+For details on how plugins and runners fit into the execution model, see:
+
+- [Using PhEval → Plugins and runners](../using_pheval/plugins_and_runners.md)
+
+For tool-specific usage, configuration options, and examples, always refer to the **plugin README**.
+
+---
+
+## Available plugins
+
+The table below lists currently implemented PhEval plugins, with links to both the PhEval runner and the original tool where applicable.
+
+| Tool        | PhEval plugin                                                                    | Original tool | Notes          |
+|-------------|----------------------------------------------------------------------------------|---------------|----------------|
+| Example     | [pheval.example](https://github.com/monarch-initiative/pheval.example)           | | Example Runner |
+| Exomiser    | [pheval.exomiser](https://github.com/monarch-initiative/pheval.exomiser)         | [Exomiser](https://github.com/exomiser/Exomiser) |                |
+| Phen2Gene   | [pheval.phen2gene](https://github.com/monarch-initiative/pheval.phen2gene)       | [Phen2Gene](https://github.com/WGLab/Phen2Gene) |                |
+| LIRICAL     | [pheval.lirical](https://github.com/monarch-initiative/pheval.lirical)           | [LIRICAL](https://github.com/TheJacksonLaboratory/LIRICAL) |                |
+| SvAnna      | [pheval.svanna](https://github.com/monarch-initiative/pheval.svanna)             | [SvAnna](https://github.com/TheJacksonLaboratory/SvAnna) |                |
+| GADO        | [pheval.gado](https://github.com/monarch-initiative/pheval.gado)                 | [GADO](https://github.com/molgenis/systemsgenetics/wiki/GADO-Command-line) |                |
+| PhenoGenius | [pheval.phenogenius](https://github.com/monarch-initiative/pheval.phenogenius)   | [PhenoGenius](https://github.com/kyauy/PhenoGeniusCli) |                |
+| AI MARRVEL  | [pheval.ai_marrvel](https://github.com/monarch-initiative/pheval.ai_marrvel.git) | [AI-MARRVEL](https://github.com/LiuzLab/AI_MARRVEL.git) |                |
+| OAK         | [pheval.oak](https://github.com/monarch-initiative/pheval.oak.git)               | |                |
+| OntoGPT     | [pheval.ontogpt](https://github.com/monarch-initiative/pheval.ontogpt)           | |                |
+| MALCO       | [pheval.llm](https://github.com/monarch-initiative/pheval.llm.git)               | |                |
+| ELDER       | [ELDER](https://github.com/monarch-initiative/ELDER)                             | |                |
+
+---
+
+## Notes
+
+- Plugins may expose **multiple runners**, depending on supported modes or input types.
+- Plugin availability does not imply identical functionality across tools.
+- Experimental or research plugins may change more frequently than core plugins.
+
+Users should always consult the plugin repository for the most up-to-date information.
+
diff --git a/docs/resources_for_contributors/api_reference/index.md b/docs/resources_for_contributors/api_reference/index.md
new file mode 100644
index 000000000..1c6668fcd
--- /dev/null
+++ b/docs/resources_for_contributors/api_reference/index.md
@@ -0,0 +1,23 @@
+# API reference
+
+This section contains **auto-generated API documentation** for PhEval internals.
+
+These pages are intended for:
+
+- Developers extending PhEval
+- Plugin authors needing internal references
+- Contributors working on core functionality
+
+If you are looking for **how to use PhEval**, start with:
+
+- Using PhEval
+- Utilities
+- Benchmarking
+
+---
+
+## Notes
+
+- API documentation is generated automatically on each deployment
+- Content reflects the current `main` branch
+- Public interfaces are documented; internal details may change
\ No newline at end of file
diff --git a/docs/resources_for_contributors/contributions.md b/docs/resources_for_contributors/contributions.md
new file mode 100644
index 000000000..17e65313f
--- /dev/null
+++ b/docs/resources_for_contributors/contributions.md
@@ -0,0 +1,138 @@
+# Contributing to PhEval
+
+Thank you for your interest in contributing to **PhEval**.  
+Contributions are welcome across code, documentation, testing, and ecosystem plugins.
+
+This page outlines **how to contribute**, **what kinds of contributions are encouraged**, and **how to get started**.
+
+---
+
+## Ways to contribute
+
+You can contribute to PhEval in several ways:
+
+- Reporting bugs or unexpected behaviour
+- Suggesting enhancements or new features
+- Improving documentation or examples
+- Adding tests or improving coverage
+- Developing new plugins or runners
+- Maintaining or extending existing plugins
+
+Not all contributions need to involve code - documentation and feedback are equally valuable.
+
+---
+
+## Before you start
+
+Before opening an issue or pull request, please:
+
+1. Check existing issues to avoid duplication
+2. Ensure you are using a supported version of PhEval
+3. Read the relevant documentation section (especially for plugins and runners)
+
+If you are unsure whether something is a bug or a usage issue, opening a discussion or issue is encouraged.
+
+---
+
+## Reporting issues
+
+Bugs, questions, and feature requests should be reported via GitHub issues.
+
+When reporting an issue, please include:
+
+- PhEval version
+- Plugin name and version (if applicable)
+- Command(s) executed
+- Relevant configuration files (sanitised if needed)
+- Error messages or stack traces
+- Expected vs observed behaviour
+
+Clear, minimal examples help issues get resolved faster.
+
+---
+
+## Contributing code
+
+### General guidelines
+
+- Keep changes focused and scoped
+- Follow existing code style and patterns
+- Add tests where appropriate
+- Update documentation if behaviour changes
+
+For larger changes, consider opening an issue first to discuss design and scope.
+
+---
+
+### Development setup
+
+PhEval uses modern Python tooling. A typical development setup involves:
+
+- Python 3.10+
+- `uv` for dependency management
+- `coverage` for testing
+- `ruff` for linting and formatting
+
+After cloning the repository:
+
+```bash
+uv sync
+uv run coverage run -p -m pytest --durations=20 tests
+```
+
+Ensure all tests pass before submitting a pull request.
+
+---
+
+## Pull requests
+
+When submitting a pull request:
+
+- Clearly describe what the change does
+- Reference any related issues
+- Explain breaking changes explicitly
+- Ensure CI passes
+
+Pull requests should be small enough to review easily whenever possible.
+
+---
+
+## Plugin contributions
+
+If you are contributing a new plugin:
+
+- Use the PhEval runner template
+- Follow the standard runner interface
+- Ensure standardised result schemas are respected
+- Verify result file stem matching with phenopackets
+- Document tool-specific configuration clearly in the plugin README
+
+See:
+
+- [`Developing a PhEval plugin`](developing_a_pheval_plugin.md) for detailed guidance
+
+---
+
+## Documentation contributions
+
+Documentation is built using **MkDocs Material**.
+
+You can contribute by:
+
+- Fixing typos or clarifying wording
+- Adding examples
+- Improving structure or navigation
+
+Documentation-only pull requests are welcome.
+
+---
+
+## Getting help
+
+If you are unsure where to start:
+
+- Open an issue describing what you would like to contribute
+- Ask questions in an existing issue or discussion
+- Start with documentation improvements to familiarise yourself with the project
+
+We appreciate all contributions, big or small, thank you for helping improve PhEval.
diff --git a/docs/resources_for_contributors/developing_a_pheval_plugin.md b/docs/resources_for_contributors/developing_a_pheval_plugin.md
new file mode 100644
index 000000000..79ae8300e
--- /dev/null
+++ b/docs/resources_for_contributors/developing_a_pheval_plugin.md
@@ -0,0 +1,422 @@
+# Developing a PhEval plugin
+
+This guide explains how to develop a **PhEval plugin** that exposes a **runner** and produces **PhEval standardised results** that can be benchmarked consistently.
+
+## Video walkthrough
+
+If you prefer a guided walkthrough, start here:
+
+[Write your own PhEval runner](https://www.youtube.com/watch?v=GMYzQO4OcfU)
+
+> !!! abstract "**Key takeaways**"
+    1. A runner must implement all `PhEvalRunner` methods (`prepare`, `run`, `post_process`).
+    2. Your runner must write **standardised result files** with the required columns for the benchmark type.
+    3. **Result filenames must match phenopacket filenames** (file stem matching) so PhEval can align outputs to cases.
+
+---
+
+## Standardised result schemas (required)
+
+PhEval benchmarking operates on **standardised result files**.
+Each result file must conform exactly to the required schema for the type of prioritisation being produced.
+
+Schemas are **validated** during post-processing.  
+Missing or incorrectly named columns will cause validation to fail.
+
+---
+
+### Gene prioritisation results
+
+Each gene result must contain the following columns:
+
+| Column name       | Type       | Description                  |
+|------------------|------------|------------------------------|
+| `gene_symbol`     | `pl.String`  | Gene symbol                  |
+| `gene_identifier` | `pl.String`  | Gene identifier              |
+| `score`           | `pl.Float64` | Tool-specific score          |
+| `grouping_id`     | `pl.Utf8`    | Optional grouping identifier |
+
+---
+
+### Variant prioritisation results
+
+Each variant result must contain the following columns:
+
+| Column name | Type       | Description |
+|------------|------------|-------------|
+| `chrom`     | `pl.String`  | Chromosome |
+| `start`     | `pl.Int64`   | Start position |
+| `end`       | `pl.Int64`   | End position |
+| `ref`       | `pl.String`  | Reference allele |
+| `alt`       | `pl.String`  | Alternate allele |
+| `score`     | `pl.Float64` | Tool-specific score |
+| `grouping_id` | `pl.Utf8`    | Optional grouping identifier |
+
+---
+
+### Disease prioritisation results
+
+Each disease result must contain the following columns:
+
+| Column name          | Type       | Description                          |
+|---------------------|------------|--------------------------------------|
+| `disease_identifier` | `pl.String`  | Disease identifier  |
+| `score`              | `pl.Float64` | Tool-specific score                  |
+
+### The `grouping_id` column (optional but important)
+
+`grouping_id` is optional and enables **joint ranking** of entities that should be treated as a single unit without penalty.
+
+Typical examples include:
+
+- Compound heterozygous variants (multiple variants contributing together)
+- Grouped variant representations within the same gene
+- Polygenic or grouped signals where multiple items should be evaluated jointly
+
+**How to use it**
+
+- Variants in the same group share the same `grouping_id`
+- Variants not in any group should each have a unique `grouping_id`
+
+This preserves ranking semantics when benchmarking.
+
+---
+
+## Result file naming (required)
+
+PhEval aligns result files to cases using **filename stem matching**.
+
+> !!! danger "**Rule:**" 
+    The **result filename stem must exactly match the phenopacket filename stem**.
+
+Example:
+
+- Phenopacket: `patient_001.json`
+- Result filename: `patient_001-exomiser.json`
+- Processed result filename passed to PhEval: `patient_001.json`
+
+If the stems do not match, PhEval cannot reliably associate results with
+phenopackets, and benchmarking may be incomplete or incorrect.
+
+> !!! tip "**Recommendation:**"
+    Always derive result filenames programmatically from the phenopacket stem.
+---
+
+## Step-by-step plugin development
+
+PhEval plugins are typically derived from the runner template and standardised tooling.
+The recommended approach uses the PhEval runner template, MkDocs, tox, and uv.
+
+The template is available [here](https://github.com/monarch-initiative/pheval-runner-template)
+
+---
+
+### 1. Scaffold a new plugin
+
+Install `cruft` (used to create projects from the template and keep them up to date):
+
+```bash
+pip install cruft
+```
+
+Create a project using the template:
+
+```bash
+cruft create https://github.com/monarch-initiative/pheval-runner-template
+```
+
+---
+
+### 2. Environment and dependencies
+
+Install `uv` (if you do not already use it):
+
+```bash
+pip install uv
+```
+
+Install dependencies and activate the environment:
+
+```bash
+uv sync
+source .venv/bin/activate
+```
+
+Run the test suite to confirm the setup:
+
+```bash
+uv run tox
+```
+
+> !!! note 
+    The template uses `uv` by default, but this is not required.
+    You may use any packaging/dependency manager.
+    PhEval only requires a valid `pheval.plugins` entry point.
+
+---
+
+### 3. Implement your custom runner
+
+In the generated template, implement your runner in `runner.py` (under `src/`).
+
+At minimum, implement `prepare`, `run`, and `post_process`:
+
+```python
+"""Runner."""
+
+from dataclasses import dataclass
+from pathlib import Path
+
+from pheval.runners.runner import PhEvalRunner
+
+
+@dataclass
+class CustomRunner(PhEvalRunner):
+    """Runner class implementation."""
+
+    input_dir: Path
+    testdata_dir: Path
+    tmp_dir: Path
+    output_dir: Path
+    config_file: Path
+    version: str
+
+    def prepare(self):
+        """Prepare inputs."""
+        print("preparing")
+
+    def run(self):
+        """Execute the tool."""
+        print("running")
+
+    def post_process(self):
+        """Convert raw outputs to PhEval standardised results."""
+        print("post processing")
+```
+
+---
+
+### 4. Register the runner entry point
+
+The template populates your `pyproject.toml` entry points.
+If you rename the runner class or move files, update this accordingly:
+
+```toml
+[project.entry-points."pheval.plugins"]
+customrunner = "pheval_plugin_example.runner:CustomRunner"
+```
+
+> !!! tip
+    The module path and class name are case-sensitive.
+
+---
+
+## Tool-specific configuration (config.yaml)
+
+For `pheval run` to execute, the input directory must contain a `config.yaml`:
+
+```yaml
+tool:
+tool_version:
+variant_analysis:
+gene_analysis:
+disease_analysis:
+tool_specific_configuration_options:
+```
+
+- `variant_analysis`, `gene_analysis`, `disease_analysis` must be booleans (`true` / `false`)
+- `tool_specific_configuration_options` is optional and may include plugin-specific configuration
+
+### Parsing tool-specific configuration (recommended)
+
+Using `pydantic` can simplify parsing:
+
+```python
+from pydantic import BaseModel, Field
+
+class CustomisedConfigurations(BaseModel):
+    environment: str = Field(...)
+```
+
+Then parse in your runner:
+
+```python
+config = CustomisedConfigurations.parse_obj(
+    self.input_dir_config.tool_specific_configuration_options
+)
+environment = config.environment
+```
+
+---
+
+## Post-processing: generating standardised results
+
+PhEval can handle ranking and writing result files in the correct locations.
+Your runner’s post-processing must:
+
+1. Read tool-specific raw outputs
+2. Extract the required fields
+3. Construct a Polars DataFrame with the required schema
+4. Call the appropriate PhEval helper method to write standardised results
+
+### Result generation helpers
+
+!!! warning "Breaking change (v0.5.0)"
+
+    `generate_pheval_result` was replaced with:
+
+    - `generate_gene_result`
+    - `generate_variant_result`
+    - `generate_disease_result`
+
+#### Generating gene result files
+
+Use `generate_gene_result` to write PhEval-standardised gene results
+from a Polars DataFrame.
+
+```python
+from pheval.post_processing.post_processing import (
+    generate_gene_result,
+    SortOrder,
+)
+
+generate_gene_result(
+    results=pheval_gene_result,      # Polars DataFrame (gene schema)
+    sort_order=SortOrder.DESCENDING, # or SortOrder.ASCENDING
+    output_dir=output_directory,     # typically self.output_dir
+    result_path=result_path,         # path to raw tool output, stem MUST match phenopacket stem exactly
+    phenopacket_dir=phenopacket_dir, # directory containing phenopackets
+)
+```
+
+#### Generating variant result files
+
+Use `generate_variant_result` to write PhEval-standardised variant results.
+
+```python
+from pheval.post_processing.post_processing import (
+    generate_variant_result,
+    SortOrder,
+)
+
+generate_variant_result(
+    results=pheval_variant_result,   # Polars DataFrame (variant schema)
+    sort_order=SortOrder.DESCENDING,
+    output_dir=output_directory,
+    result_path=result_path,         # stem must match phenopacket stem
+    phenopacket_dir=phenopacket_dir,
+)
+```
+
+#### Generating disease result files
+
+Use `generate_disease_result` to write PhEval-standardised disease results.
+
+```python
+from pheval.post_processing.post_processing import (
+    generate_disease_result,
+    SortOrder,
+)
+
+generate_disease_result(
+    results=pheval_disease_result,   # Polars DataFrame (disease schema)
+    sort_order=SortOrder.DESCENDING,
+    output_dir=output_directory,
+    result_path=result_path,         # stem must match phenopacket stem
+    phenopacket_dir=phenopacket_dir,
+)
+```
+
+> !!! important 
+    The stem of `result_path` must exactly match the phenopacket stem.
+    This often requires stripping tool-specific suffixes from raw output filenames.
+
+---
+
+## Adding metadata to results.yml (optional)
+
+PhEval writes a `results.yml` file to the output directory by default.
+You can add customised metadata by overriding `construct_meta_data()`.
+
+Example dataclass:
+
+```python
+from dataclasses import dataclass
+
+@dataclass
+class CustomisedMetaData:
+    customised_field: str
+```
+
+Runner implementation:
+
+```python
+def construct_meta_data(self):
+    self.meta_data.tool_specific_configuration_options = CustomisedMetaData(
+        customised_field="customised_value"
+    )
+    return self.meta_data
+```
+
+---
+
+## Helper utilities (optional)
+
+PhEval provides helper methods that can simplify runner implementations.
+
+### PhenopacketUtil
+
+Useful for extracting observed phenotypes when tools do not accept phenopackets directly:
+
+::: src.pheval.utils.phenopacket_utils.PhenopacketUtil
+    handler: python
+    options:
+      members:
+        - PhenopacketUtil
+      show_root_heading: false
+      show_source: true
+
+Example usage:
+
+```python
+from pheval.utils.phenopacket_utils import phenopacket_reader, PhenopacketUtil
+
+phenopacket = phenopacket_reader("/path/to/phenopacket.json")
+phenopacket_util = PhenopacketUtil(phenopacket)
+
+observed_phenotypes = phenopacket_util.observed_phenotypic_features()
+observed_phenotypes_hpo_ids = [p.type.id for p in observed_phenotypes]
+```
+
+---
+
+## Testing your runner
+
+Install dependencies:
+
+```bash
+uv sync
+```
+
+Run PhEval using your custom runner:
+
+```bash
+pheval run -i ./input_dir -t ./test_data_dir -r customrunner -o output_dir
+```
+
+Notes:
+
+- the `-r/--runner` value must match the entry point name (lowercase)
+- confirm that standardised result files are produced and validate correctly
+- confirm that result file stems match the phenopacket file stems
+
+---
+
+## Checklist before release
+
+- Runner implements `prepare`, `run`, `post_process`
+- Entry point registered under `pheval.plugins`
+- Standardised results conform to required schema(s)
+- Result filenames use **phenopacket stem matching**
+- Optional: `grouping_id` correctly set for grouped ranking scenarios
+- Optional: `results.yml` metadata populated where useful
diff --git a/docs/roadmap.md b/docs/roadmap.md
deleted file mode 100644
index b766775b4..000000000
--- a/docs/roadmap.md
+++ /dev/null
@@ -1,27 +0,0 @@
-# Roadmap
-
-The Roadmap is a rough plan, changes are expected throughout the year.
-
-## 2023
-
-### Q1 
-
-1. Finalising the PhEval architecture (draft is done)
-1. End-to-end pipeline for testing PhEval with Exomiser and two versions of HPO
-1. Submitting a poster to Biocuration which outlines the full vision
-
-### Q2
-
-1. Focus on an analytic framework around PhEval, focusing on studying how changes to ontologies affect changes in variant prioritisation
-1. Extend phenotype pipeline to enable base releases and alternative patterns
-
-### Q3
-
-1. Improving the analytic framework of PhEval, especially phenotype analysis
-1. All intermediate files of pipeline have a corresponding LinkML model
-1. Focus on studying the effect of KG snippets (p2ds) on VP performance
-
-### Q4
-
-1. Drafting a PhEval paper
-1. Building standalone pipeline that reports changes in algorithm behaviours to ontology developers.
diff --git a/docs/styleguide.md b/docs/styleguide.md
deleted file mode 100644
index b7bb14d41..000000000
--- a/docs/styleguide.md
+++ /dev/null
@@ -1,6 +0,0 @@
-# Monarch Style Guide for PhEval
-
-
-
-
-- No code in CLI methods
\ No newline at end of file
diff --git a/docs/using_pheval/plugins_and_runners.md b/docs/using_pheval/plugins_and_runners.md
new file mode 100644
index 000000000..edf64e27c
--- /dev/null
+++ b/docs/using_pheval/plugins_and_runners.md
@@ -0,0 +1,96 @@
+# Plugins and runners
+
+This page defines **how execution works in PhEval**.
+
+It is the single authoritative description of plugins, runners, and how phenotype-driven tools are run under the PhEval framework.
+
+## Execution model
+
+PhEval provides a general execution framework and command-line interface.
+It does not embed logic for running individual tools.
+
+Instead:
+
+- **Plugins** extend PhEval with tool-specific functionality
+- **Runners** (provided by plugins) implement complete execution workflows
+
+PhEval is responsible for orchestration.
+Runners are responsible for execution semantics.
+
+## Plugins
+
+A plugin is a Python package that integrates a specific tool with PhEval.
+
+A plugin typically:
+
+- Depends on PhEval
+- Wraps a specific phenotype-driven tool
+- Registers one or more runners
+- Documents tool-specific configuration and usage
+
+Plugins are installed separately from PhEval.
+Once installed, their runners are automatically discovered.
+
+## Runners
+
+A runner is the unit of execution within PhEval.
+
+Each runner is responsible for the full workflow for a given tool and configuration:
+
+1. Preparing inputs (phenopackets, variants, resources)
+2. Executing the tool
+3. Post-processing raw outputs into the PhEval standardised results format
+
+Runners encapsulate tool-specific assumptions while conforming to shared PhEval interfaces.
+
+## Running a runner
+
+All tool execution is performed via the PhEval CLI.
+
+The general pattern is:
+
+    pheval run --runner <runner_name> [options]
+
+Where:
+
+- `<runner_name>` identifies a runner exposed by an installed plugin
+
+PhEval manages discovery and orchestration.
+The runner controls execution logic and output generation.
+
+## Multiple runners
+
+A plugin may expose multiple runners.
+
+This allows support for:
+
+- Different tool modes
+- Different input types
+- Alternative workflows or configurations
+
+Each runner is invoked explicitly by name.
+
+## Outputs and standardisation
+
+All runners are expected to produce outputs in the **PhEval standardised results format**.
+
+This standardisation allows results to be:
+
+- Benchmarked using shared metrics
+- Compared across tools and versions
+- Analysed using common plotting and reporting utilities
+
+This is a core design principle of PhEval.
+
+## Where execution details live
+
+Tool-specific instructions do not live in the main PhEval documentation.
+
+They are documented in:
+
+- The plugin README
+- Tool-specific configuration guides
+- Examples provided by plugin authors
+
+This avoids duplication and keeps framework documentation focused and stable.
+
diff --git a/docs/utilities/data_preparation.md b/docs/utilities/data_preparation.md
new file mode 100644
index 000000000..cdbf64dd9
--- /dev/null
+++ b/docs/utilities/data_preparation.md
@@ -0,0 +1,126 @@
+# Data preparation utilities
+
+This page documents **data preparation utilities** provided by PhEval.
+These commands are used to prepare, normalise, and organise input data *before* running phenotype-driven tools via plugins.
+
+This page only covers commands related to **data preparation**.
+Variant spiking and other specialised workflows are documented elsewhere.
+
+---
+
+## Purpose
+
+Data preparation utilities help to:
+
+- Construct phenopacket corpora for evaluation
+- Normalise gene identifiers
+- Ensure consistent input structure across cohorts
+- Reduce technical variability unrelated to tool performance
+
+These steps are particularly important when benchmarking across tools, versions, or knowledge resources.
+
+---
+
+## Preparing a phenopacket corpus
+
+The `prepare-corpus` command is used to prepare a directory of phenopackets for downstream analysis.
+
+Typical use cases include:
+
+- Validating that phenopackets contain the required records
+- Preparing separate corpora for gene-, disease-, or variant-based analyses
+- Optionally generating associated VCFs for variant-based workflows
+
+### Basic example
+
+Prepare a corpus of phenopackets for gene-based analysis:
+
+```bash
+pheval-utils prepare-corpus \
+  --phenopacket-dir phenopackets/ \
+  --gene-analysis \
+  --output-dir prepared_corpus/
+```
+
+Prepare a corpus of phenopackets for gene-based analysis and update all gene identifiers to Ensembl IDs:
+
+```bash
+pheval-utils prepare-corpus \
+  --phenopacket-dir phenopackets/ \
+  --gene-analysis \
+  --gene-identifier ensembl_id \
+  --output-dir prepared_corpus/
+```
+
+### Variant-based analysis example
+
+Prepare a corpus for variant-based analysis using an hg38 VCF template:
+
+```bash
+pheval-utils prepare-corpus \
+  --phenopacket-dir phenopackets/ \
+  --variant-analysis \
+  --output-dir prepared_corpus/
+```
+
+Prepare a corpus for variant-based analysis and spike variants into an hg38 VCF template:
+
+```bash
+pheval-utils prepare-corpus \
+  --phenopacket-dir phenopackets/ \
+  --variant-analysis \
+  --hg38-template-vcf hg38_template.vcf \
+  --output-dir prepared_corpus/
+```
+
+
+> !!! note "Notes: "
+    - At least one of `--variant-analysis`, `--gene-analysis`, or `--disease-analysis` should be specified.
+    - For variant-based analysis, a VCF template or directory is required.
+    - The prepared output directory is used as input to runners provided by plugins.
+
+---
+
+## Updating phenopackets and identifiers
+
+The `update-phenopackets` command is used to update gene symbols and identifiers in existing phenopackets.
+
+This is useful when:
+
+- Phenopackets contain outdated gene identifiers
+- A consistent identifier scheme is required across a cohort
+- Benchmarking is performed across different database or ontology versions
+
+### Example: update a directory of phenopackets
+
+Update phenopackets to include Ensembl gene identifiers:
+
+```bash
+pheval-utils update-phenopackets \
+  --phenopacket-dir phenopackets/ \
+  --gene-identifier ensembl_id \
+  --output-dir updated_phenopackets/
+```
+
+### Example: update a single phenopacket
+
+```bash
+pheval-utils update-phenopackets \
+  --phenopacket-path case_001.json \
+  --gene-identifier hgnc_id \
+  --output-dir updated_case/
+```
+
+---
+
+## How data preparation fits into a workflow
+
+A typical workflow using data preparation utilities looks like:
+
+1. Collect or generate phenopackets
+2. Prepare and normalise phenopackets using data preparation utilities
+3. Run tools via plugin-provided runners using `pheval run`
+4. Benchmark and analyse the resulting outputs
+
+Not all workflows require all preparation steps, but these utilities help ensure reproducibility and consistency.
+
diff --git a/docs/utilities/index.md b/docs/utilities/index.md
new file mode 100644
index 000000000..9cfaaa42e
--- /dev/null
+++ b/docs/utilities/index.md
@@ -0,0 +1,66 @@
+# Utilities
+
+This section documents the **utility commands** provided with PhEval.
+These utilities support data preparation, manipulation, and experimental workflows that sit *around* tool execution and benchmarking.
+
+They are not required for every use case, but are commonly used when preparing cohorts, running robustness experiments, or standardising inputs.
+
+## What the utilities are for
+
+PhEval utilities are designed to help with tasks such as:
+
+- Preparing phenopacket corpora for evaluation
+- Updating identifiers and mappings in existing data
+- Generating synthetic or perturbed inputs for robustness testing
+- Supporting benchmarking and downstream analysis
+
+They are provided via the `pheval-utils` command-line interface, which is installed automatically when installing PhEval.
+
+## Scope and boundaries
+
+Utilities are intentionally separated from:
+
+- **tool execution**, which is handled by runners via plugins
+- **benchmarking logic**, which is documented in the Benchmarking section
+
+This separation keeps workflows modular and reproducible.
+
+## Categories of utilities
+
+The utilities fall broadly into the following categories.
+
+### [Data preparation](data_preparation.md)
+
+Commands used to prepare or normalise input data before execution, including:
+
+- Preparing corpora of phenopackets
+- Updating gene symbols and identifiers
+- Ensuring consistent formats for downstream tools
+
+
+### [Phenotype scrambling and noise experiments](phenotype_scrambling.md)
+
+Commands used to introduce noise or perturbations into phenotype data.
+These are commonly used to assess robustness and sensitivity of phenotype-driven methods.
+
+
+### [Variant-related utilities](variant_utilities.md)
+
+Commands that operate on variant-level data, such as creating spiked VCFs for controlled evaluation experiments.
+
+
+### [Resource and mapping updates](resource_updates.md)
+
+Commands used to download or update shared resources, such as ontology mappings and identifier tables.
+
+
+## How utilities fit into a workflow
+
+A typical workflow using utilities might look like:
+
+1. Prepare or update phenopacket data using utilities
+2. Execute tools via runners using `pheval run`
+3. Benchmark and analyse results
+
+Not all workflows require utilities; they are optional building blocks.
+
diff --git a/docs/utilities/phenotype_scrambling.md b/docs/utilities/phenotype_scrambling.md
new file mode 100644
index 000000000..fd0f59aa5
--- /dev/null
+++ b/docs/utilities/phenotype_scrambling.md
@@ -0,0 +1,57 @@
+# Phenotype scrambling utilities
+
+This page documents utilities used to **introduce noise or perturbations into phenotype data**.
+These commands are typically used to assess the robustness and sensitivity of phenotype-driven prioritisation methods.
+
+They operate on existing phenotype data and do not execute tools or perform benchmarking directly.
+
+---
+
+## Purpose
+
+Phenotype scrambling utilities are used to:
+
+- Simulate noisy or incomplete phenotypic observations
+- Evaluate how sensitive methods are to phenotype quality
+- Test robustness under controlled perturbations
+
+These experiments are useful when comparing tools, parameterisations, or ontology versions.
+
+---
+
+## Scrambling phenopackets
+
+The `scramble-phenopackets` command generates perturbed versions of existing phenopackets.
+
+The scrambled phenopackets can then be used as inputs to runners for execution and benchmarking.
+
+### Example: scramble a phenopacket corpus
+
+Generate scrambled phenopackets from an existing corpus:
+
+```bash
+pheval-utils scramble-phenopackets \
+  --phenopacket-dir phenopackets/ \
+  --output-dir scrambled_phenopackets/ \
+  --scramble-factor 0.7 \
+  --local-ontology-cache ./hp.obo
+```
+
+> !!! note "Notes: "
+    - The original phenopackets are not modified.
+    - Scrambled outputs are written to a separate directory.
+    - The resulting phenopackets can be used directly with plugin-provided runners.
+
+---
+
+## How phenotype scrambling fits into a workflow
+
+A typical robustness experiment using phenotype scrambling might look like:
+
+1. Prepare a clean phenopacket corpus
+2. Generate scrambled phenopackets
+3. Run tools via plugin-provided runners using `pheval run`
+4. Benchmark and compare performance against the original results
+
+Scrambling utilities are optional and primarily used in experimental or methodological evaluations.
+
diff --git a/docs/utilities/resource_updates.md b/docs/utilities/resource_updates.md
new file mode 100644
index 000000000..c0c74dfe0
--- /dev/null
+++ b/docs/utilities/resource_updates.md
@@ -0,0 +1,95 @@
+# Resource updates
+
+This page documents utilities used to **download and update shared resources** required by PhEval workflows.
+
+---
+
+## Purpose
+
+Resource update utilities are used to:
+
+- Ensure identifier mappings are up to date
+- Maintain consistency across benchmarking runs
+- Reduce errors caused by outdated or missing reference data
+
+Keeping resources updated is recommended, particularly when running new experiments or comparing results across time.
+
+---
+
+## Updating shared resources
+
+The `update` command downloads and refreshes shared resources used by PhEval and its plugins.
+
+This includes:
+
+- The MONDO SSSOM mapping file from the Monarch Initia
+- The HGNC complete gene set from the HGNC download site
+
+### Example: update all resources
+
+```bash
+pheval update
+```
+
+The command will download the latest versions of the supported resources and store them in PhEval’s configured data directory.
+
+---
+
+## When to run resource updates
+
+You should consider running `pheval update` when:
+
+- Installing PhEval for the first time
+- Starting a new benchmarking experiment
+- Updating PhEval or plugin versions
+
+Running updates explicitly helps ensure clarity about which resources are being used.
+
+---
+
+## Resource provenance and reproducibility
+
+When tools are executed via `pheval run`, information about shared resources is **automatically recorded** in the run metadata.
+
+Each run produces a `results.yml` file that captures, among other details:
+
+- The tool and tool version
+- The execution timestamp
+- The corpus used
+- The download dates of shared resources
+
+Example:
+
+```yaml
+tool: <TOOL_NAME>
+tool_version: <TOOL_VERSION>
+config: <TOOL_CONFIG>
+run_timestamp: <TIMESTAMP>
+corpus: <CORPUS_NAME>
+mondo_download_date: <MONDO_DOWNLOAD_DATE>
+hgnc_download_date: <HGNC_DOWNLOAD_DATE>
+tool_specific_configuration_options: null
+```
+
+By recording resource download timestamps alongside each run, PhEval enables:
+
+- Tracing which ontology and mapping versions were in use
+- Comparison of results across runs performed at different times
+- Transparent reporting and reproducibility
+
+As long as `results.yml` files are preserved alongside benchmarking outputs, manual tracking of resource versions is not required.
+
+Updating resources using `pheval update` should therefore be treated as an **explicit experimental choice** and interpreted in the context of the recorded run metadata.
+
+---
+
+## How resource updates fit into a workflow
+
+A typical workflow involving resource updates might look like:
+
+1. Install PhEval
+2. Update shared resources using `pheval update`
+3. Prepare input data and corpora
+4. Run tools via plugin-provided runners
+5. Benchmark and analyse results
+
diff --git a/docs/utilities/variant_utilities.md b/docs/utilities/variant_utilities.md
new file mode 100644
index 000000000..3e39953c7
--- /dev/null
+++ b/docs/utilities/variant_utilities.md
@@ -0,0 +1,85 @@
+# Variant utilities
+
+This page documents utilities used to work with **variant-level data** in PhEval workflows.
+These commands are primarily used to construct or manipulate VCF inputs for **variant-based evaluation experiments**.
+
+They do not execute tools directly and do not perform benchmarking.
+
+---
+
+## Purpose
+
+Variant utilities are used to:
+
+- Generate VCFs containing known ("spiked") variants
+- Support controlled variant-based evaluation experiments
+
+These utilities are typically used in conjunction with phenopacket data and plugin-provided runners.
+
+---
+
+## Creating spiked VCFs
+
+The `create-spiked-vcfs` command is used to generate VCF files containing known causal variants derived from phenopackets.
+
+This is particularly useful when:
+
+- Evaluating variant-based prioritisation methods
+- Simulating realistic diagnostic scenarios
+- Benchmarking tools that require both phenotypes and variants
+
+The command supports both single phenopackets and directories of phenopackets.
+
+---
+
+### Example: create spiked VCFs from a phenopacket directory (hg38)
+
+```bash
+pheval-utils create-spiked-vcfs \
+  --phenopacket-dir phenopackets/ \
+  --hg38-template-vcf hg38_template.vcf \
+  --output-dir spiked_vcfs/
+```
+
+---
+
+### Example: create a spiked VCF from a single phenopacket (hg19)
+
+```bash
+pheval-utils create-spiked-vcfs \
+  --phenopacket-path case_001.json \
+  --hg19-template-vcf hg19_template.vcf \
+  --output-dir spiked_vcf/
+```
+
+---
+
+### Example: use a directory of VCF templates
+
+Instead of a single template file, a directory of VCF templates can be provided:
+
+```bash
+pheval-utils create-spiked-vcfs \
+  --phenopacket-dir phenopackets/ \
+  --hg38-vcf-dir hg38_vcf_templates/ \
+  --output-dir spiked_vcfs/
+```
+
+---
+
+> !!! note "Notes and constraints:"
+    - Exactly one of `--phenopacket-path` or `--phenopacket-dir` must be provided.
+    - For each genome build, either a template VCF file or a directory of template VCFs must be supplied.
+    - The generated VCFs are written to the specified output directory.
+    - Spiked VCFs are typically consumed by runners that support variant-based analysis.
+
+---
+
+## How variant utilities fit into a workflow
+
+A typical variant-based evaluation workflow might look like:
+
+1. Prepare and normalise phenopackets
+2. Generate spiked VCFs using variant utilities
+3. Run tools via plugin-provided runners using `pheval run`
+4. Benchmark and analyse variant-level results
\ No newline at end of file
diff --git a/mkdocs.yml b/mkdocs.yml
index ae707fec1..dcb0118ab 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -1,6 +1,11 @@
 site_name: PhEval
 theme:
+  logo: images/pheval-logo-2025-01-30_icon white.svg
   name: material
+  features:
+    - navigation.tabs
+    - navigation.tabs.sticky
+    - navigation.top
   palette:
     - scheme: default
       primary: indigo
@@ -20,15 +25,10 @@ markdown_extensions:
   - pymdownx.tilde
   - pymdownx.highlight
   - mkdocs-click
-  - pymdownx.superfences:
-      custom_fences:
-        - name: mermaid
-          class: mermaid
-          format: !!python/name:pymdownx.superfences.fence_div_format
+  - admonition
+  - pymdownx.details
+  - pymdownx.superfences
 
-extra_css:
-  - https://unpkg.com/mermaid@8.5.1/dist/mermaid.css
-  - mermaid.css
 extra_javascript:
   - https://unpkg.com/mermaid@8.5.1/dist/mermaid.min.js
 
@@ -43,18 +43,29 @@ watch:
   - src
 
 nav:
-  - "index.md"
-  - About: "about.md"
-  - Contact Us: "contact.md"
-  - API Documentation: api/pheval/
-  - Resources for contributors:
-      - "developing_a_pheval_plugin.md"
-      - "contributing.md"
-      - "styleguide.md"
-      - "CODE_OF_CONDUCT.md"
-  - Plugins: "plugins.md"
-  - Executing a Benchmark: "executing_a_benchmark.md"
-  - "roadmap.md"
+  - Home: "index.md"
+  - Getting Started:
+    - Overview: "getting_started/getting_started.md"
+    - Installation: "getting_started/installation.md"
+  - Using PhEval:
+    - Plugins and runners: using_pheval/plugins_and_runners.md
+  - Utilities:
+    - Overview: utilities/index.md
+    - Data preparation: utilities/data_preparation.md
+    - Phenotype scrambling: utilities/phenotype_scrambling.md
+    - Variant utilities: utilities/variant_utilities.md
+    - Resource updates: utilities/resource_updates.md
+  - Benchmarking:
+      - Overview: benchmarking/index.md
+      - Executing a benchmark: benchmarking/executing_a_benchmark.md
+  - Plugins:
+      - Overview: plugins/index.md
+  - Developer docs:
+      - Developing a PhEval plugin: resources_for_contributors/developing_a_pheval_plugin.md
+      - Contributing: resources_for_contributors/contributions.md
+      - API reference:
+          - Overview: resources_for_contributors/api_reference/index.md
+          - API documentation: resources_for_contributors/api_reference/api/pheval
 
 
 site_url: https://monarch-initiative.github.io/pheval/
diff --git a/src/pheval/utils/docs_gen.py b/src/pheval/utils/docs_gen.py
index 8efe63ba2..e47f0478a 100644
--- a/src/pheval/utils/docs_gen.py
+++ b/src/pheval/utils/docs_gen.py
@@ -37,7 +37,7 @@ def list_valid_files():
         folder = "/".join(folder_parts[:-1])
         basename = os.path.basename(file).split(".")[0]
 
-        docs_path = f"./docs/api/{folder.replace('src/', '')}/{basename}.md"
+        docs_path = f"./docs/resources_for_contributors/api_reference/api/{folder.replace('src/', '')}/{basename}.md"
         if basename in ignored_files:
             continue
 
@@ -85,7 +85,7 @@ def print_cli_doc(file_item):
 
 def gen_docs():
     """The main method for generating documentation"""
-    api_folder = f"{os.path.abspath(os.curdir)}/docs/api"
+    api_folder = f"{os.path.abspath(os.curdir)}/docs/resources_for_contributors/api_reference/api/pheval"
     print(api_folder)
     shutil.rmtree(api_folder, ignore_errors=True)
     valid_files = list_valid_files()
diff --git a/src/pheval/utils/docs_gen.sh b/src/pheval/utils/docs_gen.sh
deleted file mode 100755
index 37f61c950..000000000
--- a/src/pheval/utils/docs_gen.sh
+++ /dev/null
@@ -1,18 +0,0 @@
-#!/bin/bash
-
-# set -e
-cd ../../../
-SOURCE_FOLDER='./src'
-FILES=$(find $SOURCE_FOLDER -type f -iname '*.py' -not -iname '__init__.py' -not -empty)
-rm -rf ./docs/api
-
-for f in $FILES
-do
-    clean_dir=${f#./src/}
-    last_folder=`dirname $clean_dir`
-    full_fname="${f##*/}"
-    fname="${full_fname%%.*}"
-    mkdir -p ./docs/api/$last_folder
-    ref=$(echo $f | sed 's#/#.#g' | sed 's/..src/src/g' | sed 's/\.[^.]*$//')
-    echo ::: $ref >> ./docs/api/$last_folder/$fname.md
-done
\ No newline at end of file
diff --git a/uv.lock b/uv.lock
index 03dea220a..6996dcd0a 100644
--- a/uv.lock
+++ b/uv.lock
@@ -1914,7 +1914,7 @@ wheels = [
 
 [[package]]
 name = "pheval"
-version = "0.6.6"
+version = "0.7.8"
 source = { editable = "." }
 dependencies = [
     { name = "class-resolver" },