NVIDIA-NeMo · Kipok · Feb 11, 2026 · Feb 11, 2026 · Feb 11, 2026 · Feb 11, 2026
diff --git a/.gitignore b/.gitignore
@@ -39,8 +39,7 @@ cluster_configs/*
 !cluster_configs/example-*.yaml
 
 nemo_skills/dataset/ruler/*/
-nemo_skills/dataset/bfcl_v3/*/
-nemo_skills/dataset/bfcl_v4/*/
+nemo_skills/dataset/ruler2/*/
 nemo_skills/dataset/aalcr/lcr/
 .idea/
 .idea/*

diff --git a/docs/evaluation/custom-benchmarks.md b/docs/evaluation/custom-benchmarks.md
@@ -0,0 +1,351 @@
+# Custom benchmarks
+
+NeMo-Skills supports defining benchmarks in external repositories. This lets you
+keep proprietary data private, iterate on benchmarks independently of NeMo-Skills
+releases, and share team-owned benchmarks without modifying the main repository.
+
+An external benchmark can customize every part of the evaluation pipeline:
+dataset preparation, prompt template, generation logic, evaluator, and metrics.
+
+## Quick start
+
+1. **Create a repo** with `benchmark_map.json`, a dataset `__init__.py`, and a `prepare.py`.
+2. **Set the env var** `NEMO_SKILLS_EXTRA_BENCHMARK_MAP` to point at your `benchmark_map.json` (`name -> path` structure).
+3. **Install** the repo (`pip install -e .`) so that Python can import your modules.
+4. **Run** `ns prepare_data <name>` and `ns eval --benchmarks=<name> ...` as usual.
+
+The rest of this page walks through a complete example.
+
+## Walkthrough: a "word_count" benchmark
+
+We will build a small benchmark that asks a model to count the words in a sentence.
+This is deliberately simple so the focus stays on the plugin wiring rather than
+the task itself.
+
+### Step 1 - Repository layout
+
+```
+my-benchmark-repo/
+├── pyproject.toml
+├── benchmark_map.json
+└── my_benchmarks/
+    ├── dataset/word_count/
+    │   ├── __init__.py
+    │   └── prepare.py
+    ├── inference/word_count.py
+    ├── evaluation/word_count.py
+    ├── metrics/word_count.py
+    └── prompt/eval/word_count/
+        └── default.yaml
+```
+
+**`pyproject.toml`** - makes the repo installable so that `my_benchmarks.*` is
+importable:
+
+```toml title="pyproject.toml"
+[project]
+name = "my-benchmarks"
+version = "0.1.0"
+```
+
+**`benchmark_map.json`** - maps short names to dataset directories (paths are
+relative to this file):
+
+```json title="benchmark_map.json"
+{
+    "word_count": "./my_benchmarks/dataset/word_count"
+}
+```
+
+### Step 2 - Dataset `__init__.py`
+
+This file tells the eval pipeline which prompt, evaluator, metrics, and generation
+module to use by default. All of these can still be overridden from the command line.
+
+```python title="my_benchmarks/dataset/word_count/__init__.py"
+from nemo_skills.pipeline.utils.packager import (
+    register_external_repo,
+    RepoMetadata,
+)
+from pathlib import Path
+
+# Register repo so it gets packaged inside containers.
+# ignore_if_registered avoids errors when the module is imported more than once.
+register_external_repo(
+    RepoMetadata(name="my_benchmarks", path=Path(__file__).parents[2]),
+    ignore_if_registered=True,
+)
+
+# Metrics class - use module::Class format for custom metrics
+METRICS_TYPE = "my_benchmarks.metrics.word_count::WordCountMetrics"
+
+# Default generation arguments
+# prompt_config ending in .yaml triggers absolute-path resolution;
+# /nemo_run/code/ is the root where code is extracted inside the container
+GENERATION_ARGS = (
+    "++prompt_config=/nemo_run/code/my_benchmarks/prompt/eval/word_count/default.yaml "
+    "++eval_type=my_benchmarks.evaluation.word_count::WordCountEvaluator"
+)
+
+# Custom generation module (optional - remove this line to use the default)
+GENERATION_MODULE = "my_benchmarks.inference.word_count"
+```
+
+
+### Step 3 - `prepare.py`
+
+This script creates the test data. It is called by `ns prepare_data word_count`.
+
+```python title="my_benchmarks/dataset/word_count/prepare.py"
+import json
+from pathlib import Path
+
+SAMPLES = [
+    {"sentence": "The quick brown fox", "expected_answer": 4},
+    {"sentence": "Hello world", "expected_answer": 2},
+    {"sentence": "NeMo Skills is great for evaluation", "expected_answer": 6},
+    {"sentence": "One", "expected_answer": 1},
+    {"sentence": "A B C D E F G", "expected_answer": 7},
+]
+
+if __name__ == "__main__":
+    data_dir = Path(__file__).absolute().parent
+    output_file = data_dir / "test.jsonl"
+    with open(output_file, "wt", encoding="utf-8") as fout:
+        for sample in SAMPLES:
+            fout.write(json.dumps(sample) + "\n")
+```
+
+### Step 4 - Prompt template
+
+Prompt configs live in your external repo and are referenced by their full `.yaml`
+path. The path must end with `.yaml` so that the framework treats it as an absolute
+path rather than a built-in config name.
+
+```yaml title="my_benchmarks/prompt/eval/word_count/default.yaml"
+user: |-
+  Count the number of words in the quoted sentence below.
+  Put your final answer (just the number) inside \boxed{{}}.
+
+  {sentence}
+```
+
+In `GENERATION_ARGS` this is referenced as:
+
+```
+++prompt_config=/nemo_run/code/my_benchmarks/prompt/eval/word_count/default.yaml
+```
+
+### Step 5 - Custom generation module (optional)
+
+A custom generation module lets you change how the model is called - for example
+to implement multi-step generation.
+
+This example adds an optional **verify** step where the model is asked to double-check
+its own answer.
+
+```python title="my_benchmarks/inference/word_count.py"
+import hydra
+
+from nemo_skills.inference.generate import GenerationTask, GenerationTaskConfig
+from nemo_skills.utils import nested_dataclass
+
+
+@nested_dataclass(kw_only=True)
+class WordCountGenerationConfig(GenerationTaskConfig):
+    # Add a custom flag that controls whether to do a verification step
+    verify: bool = False
+
+
+cs = hydra.core.config_store.ConfigStore.instance()
+cs.store(name="base_generation_config", node=WordCountGenerationConfig)
+
+
+class WordCountGenerationTask(GenerationTask):
+    """Generation task with an optional verification step."""
+
+    async def process_single_datapoint(self, data_point, all_data):
+        # Step 1: normal generation
+        result = await super().process_single_datapoint(data_point, all_data)
+
+        if not self.cfg.verify:
+            return result
+
+        # Step 2: ask the model to verify its own answer
+        verify_prompt = (
+            f"You previously answered the following question:\n\n"
+            f"{data_point['problem']}\n\n"
+            f"Your answer was:\n{result['generation']}\n\n"
+            f"Please verify this is correct. "
+            f"If it is, repeat the same answer inside \\boxed{{}}. "
+            f"If not, provide the corrected answer inside \\boxed{{}}."
+        )
+        new_data_point = [{"role": "user", "content": verify_prompt}]
+        # We use prompt_format=openai as we already prepared the full message
+        verify_result = await super().process_single_datapoint(
+            new_data_point,
+            all_data,
+            prompt_format="openai",
+        )
+        # Replace generation with the verified answer
+        result["generation"] = verify_result["generation"]
+        return result
+
+
+GENERATION_TASK_CLASS = WordCountGenerationTask
+
+
+@hydra.main(version_base=None, config_name="base_generation_config")
+def generate(cfg: WordCountGenerationConfig):
+    cfg = WordCountGenerationConfig(_init_nested=True, **cfg)
+    LOG.info("Config used: %s", cfg)
+    task = WordCountGenerationTask(cfg)
+    task.generate()
+
+
+if __name__ == "__main__":
+    generate()
+```
+
+If you don't need custom generation logic, simply remove the `GENERATION_MODULE`
+line from `__init__.py` and the default
+[`nemo_skills.inference.generate`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/inference/generate.py)
+module will be used.
+
+### Step 6 - Custom evaluator
+
+Here is an example of a basic evaluator class. The extra "predicted_answer" and "is_correct" fields will be added
+to the output jsonl produced by the generation step.
+
+```python title="my_benchmarks/evaluation/word_count.py"
+import re
+
+from nemo_skills.evaluation.evaluator.base import BaseEvaluator
+
+
+class WordCountEvaluator(BaseEvaluator):
+    async def eval_single(self, data_point):
+        """Extract predicted answer and compare to expected."""
+        match = re.search(r"\\boxed\{(\d+)\}", data_point["generation"])
+        predicted = int(match.group(1)) if match else None
+
+        return {
+            "predicted_answer": predicted,
+            "is_correct": predicted == data_point["expected_answer"],
+        }
+```
+
+This is referenced in `GENERATION_ARGS` using the `module::Class` format:
+
+```
+++eval_type=my_benchmarks.evaluation.word_count::WordCountEvaluator
+```
+
+### Step 7 - Custom metrics
+
+The metrics class reads the evaluated JSONL and computes summary statistics.
+
+```python title="my_benchmarks/metrics/word_count.py"
+from nemo_skills.evaluation.metrics.base import BaseMetrics
+
+
+class WordCountMetrics(BaseMetrics):
+    def _get_score_dict(self, prediction):
+        return {"is_correct": prediction.get("is_correct", False)}
+
+    def get_incorrect_sample(self, prediction):
+        # used for automatic filtering data based on length
+        # (we mark too long examples as incorrect using this method)
+        prediction = prediction.copy()
+        prediction["is_correct"] = False
+        prediction["predicted_answer"] = None
+        return prediction
+
+    def update(self, predictions):
+        # base class provides convenient helpers for calculating
+        # common metrics like majority / pass
+        super().update(predictions)
+        predicted_answers = [pred["predicted_answer"] for pred in predictions]
+        self._compute_pass_at_k(
+            predictions=predictions,
+            predicted_answers=predicted_answer,
+        )
+        self._compute_majority_at_k(
+            predictions=predictions,
+            predicted_answers=predicted_answers,
+        )
+```
+
+Referenced in `__init__.py` as:
+
+```
+METRICS_TYPE = "my_benchmarks.metrics.word_count::WordCountMetrics"
+```
+
+### Step 8 - Running the benchmark
+
+Install your repo and set the env var:
+
+```bash
+cd my-benchmark-repo
+pip install -e .
+export NEMO_SKILLS_EXTRA_BENCHMARK_MAP=$(pwd)/benchmark_map.json
+```
+
+Prepare the data:
+
+```bash
+ns prepare_data word_count
+```
+
+Run evaluation (using an API model as an example):
+
+```bash
+ns eval \
+    --cluster=local \
+    --server_type=openai \
+    --model=nvidia/nemotron-3-nano-30b-a3b \
+    --server_address=https://integrate.api.nvidia.com/v1 \
+    --benchmarks=word_count \
+    --output_dir=/workspace/test-eval
+```
+
+View results:
+
+```bash
+ns summarize_results --cluster=local /workspace/test-eval
+```
+
+
+## Minimal example
+
+If your benchmark can reuse built-in evaluation and metrics (e.g. the standard math
+evaluator), you only need two files:
+
+```python title="my_benchmarks/dataset/my_simple_bench/__init__.py"
+METRICS_TYPE = "math"
+GENERATION_ARGS = "++prompt_config=generic/math ++eval_type=math"
+```
+
+```python title="my_benchmarks/dataset/my_simple_bench/prepare.py"
+import json
+from pathlib import Path
+
+if __name__ == "__main__":
+    data_dir = Path(__file__).absolute().parent
+    with open(data_dir / "test.jsonl", "wt", encoding="utf-8") as fout:
+        fout.write(json.dumps({
+            "problem": "What is 2 + 2?",
+            "expected_answer": 4,
+        }) + "\n")
+```
+
+And a `benchmark_map.json`:
+
+```json
+{
+    "my_simple_bench": "./my_benchmarks/dataset/my_simple_bench"
+}
+```
+
+No custom generation module, evaluator, or metrics needed.
diff --git a/docs/evaluation/index.md b/docs/evaluation/index.md
@@ -233,16 +233,13 @@ Inside [`nemo_skills/dataset/gsm8k/__init__.py`](https://github.com/NVIDIA-NeMo/
 
 ```python
 # settings that define how evaluation should be done by default (all can be changed from cmdline)
-DATASET_GROUP = 'math'
 METRICS_TYPE = "math"
 GENERATION_ARGS = "++eval_type=math ++prompt_config=generic/math"
 ```
 
 The prompt config and default generation arguments are passed to the
 [nemo_skills/inference/generate.py](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/inference/generate.py).
-The dataset group is used by [nemo_skills/dataset/prepare.py](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/prepare.py)
-to help download only benchmarks from a particular group if `--dataset_groups` parameter is used.
-Finally, the metrics type is used to pick a metrics class from [nemo_skills/evaluation/metrics/map_metrics.py](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/evaluation/metrics/map_metrics.py)
+The metrics type is used to pick a metrics class from [nemo_skills/evaluation/metrics/map_metrics.py](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/evaluation/metrics/map_metrics.py)
 which is called at the end of the evaluation to compute final scores.
 
 ## Adding new benchmarks
@@ -256,3 +253,6 @@ To create a new benchmark follow this process:
    a fully custom generation module. See [scicode](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/scicode/__init__.py) or [swe-bench](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/swe-bench/__init__.py) for examples of this.
 4. Create a new [evaluation class](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/evaluation/evaluator/__init__.py) (if cannot re-use existing one).
 5. Create a new [metrics class](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/evaluation/metrics/map_metrics.py) ( if cannot re-use existing one).
+
+You can also define benchmarks in a **separate git repository** without modifying NeMo-Skills.
+See [Custom benchmarks](./custom-benchmarks.md) for a full walkthrough.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -86,6 +86,7 @@ nav:
     - evaluation/vlm.md
     - evaluation/other-benchmarks.md
     - evaluation/robustness.md
+    - Custom benchmarks: evaluation/custom-benchmarks.md
   - Agentic Inference:
     - agentic_inference/parallel_thinking.md
     - agentic_inference/tool_calling.md