Skip to content
Open
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
f5ebde1
Explicitly add bfcl init files as they are static
Kipok Feb 11, 2026
c28972e
Refactor ruler into prepare_data + prepare_init
Kipok Feb 11, 2026
f49dc2a
Basic split into init and data prepare in pipeline
Kipok Feb 11, 2026
e3688de
Partial refactoring - remove module download logic
Kipok Feb 11, 2026
edab4db
Always specify metrics type in eval
Kipok Feb 11, 2026
27f24fb
Basic support for specifying external path
Kipok Feb 11, 2026
deb17ad
Add basic support for extra dataset map
Kipok Feb 11, 2026
b22930b
Add dynamic data prep
Kipok Feb 11, 2026
dfb40d2
Remove data group from inits
Kipok Feb 11, 2026
ec5353f
Clean up data groups and properly track external datasets
Kipok Feb 11, 2026
2b853ee
Remove logic to add lean headers
Kipok Feb 11, 2026
506d909
Add packaging for jsonl files in external repos
Kipok Feb 12, 2026
52cedd9
Add explicit jsonl files for packaging
Kipok Feb 12, 2026
98a7623
Fix packaging for external repos
Kipok Feb 12, 2026
bb25cfe
Tmp remove scicode
Kipok Feb 12, 2026
147a459
Add a way to pass evaluator as a string
Kipok Feb 12, 2026
cc2d744
Update to rglob for main datasets
Kipok Feb 12, 2026
d8d5444
Add relative path resolution
Kipok Feb 12, 2026
f5d79f8
Fix data_dir usage
Kipok Feb 12, 2026
643438f
Refactor prepare data logic to fix data dir issue
Kipok Feb 12, 2026
3b33b36
Unconditional trigger for init
Kipok Feb 12, 2026
fbef304
Fix resolve
Kipok Feb 12, 2026
72a2c82
Add docs
Kipok Feb 12, 2026
8fac5a8
Revert remove scicode
Kipok Feb 12, 2026
bd0bada
Small doc update
Kipok Feb 13, 2026
6fa989d
Add tests
Kipok Feb 13, 2026
0c946e2
Add license
Kipok Feb 13, 2026
00b98c9
Simplify test
Kipok Feb 13, 2026
e9280d2
Update api to add prompt format
Kipok Feb 13, 2026
04f5452
Fix test
Kipok Feb 13, 2026
ba8b4d0
Working through test issues
Kipok Feb 13, 2026
f5aa5cb
Fix more test errors
Kipok Feb 13, 2026
efd0629
Fix data dir handling
Kipok Feb 13, 2026
094ba06
Remove data dir from summarized results
Kipok Feb 13, 2026
5c1a29a
Refactor tests and start_server to be able to launch/clean server fro…
Kipok Feb 13, 2026
79e6d60
Make exception more explicit
Kipok Feb 13, 2026
1f55ac1
Fix metrics location
Kipok Feb 13, 2026
6e2ac12
Add word_count benchmark to tests as well
Kipok Feb 13, 2026
4a068c0
Fix installation issues
Kipok Feb 13, 2026
3f3570f
Simplify pyproject
Kipok Feb 13, 2026
b9f673e
Fix installation with sys.path
Kipok Feb 13, 2026
744aac0
Merge branch 'main' into igitman/benchmarks-plugin
Kipok Feb 13, 2026
d038da5
Fix review issues
Kipok Feb 13, 2026
61d8f93
Make datasets an explicit parameter
Kipok Feb 13, 2026
f382779
Update docs to mention editable install
Kipok Feb 13, 2026
f55bd96
Add packaging location to tests
Kipok Feb 13, 2026
221b073
Rollback context manager change
Kipok Feb 13, 2026
99bf5cb
Fix permissions
Kipok Feb 13, 2026
5c31a86
Patch wrap args
Kipok Feb 13, 2026
c069b43
Add copyright
Kipok Feb 14, 2026
350be38
Fix minor review issues
Kipok Feb 14, 2026
028fe5c
Add todo
Kipok Feb 14, 2026
aca2da1
Roll-back split to prepare data and prepare init as it's not needed
Kipok Feb 14, 2026
f682692
rollback api change for datasets
Kipok Feb 14, 2026
ca7241f
Bug fix for files check
Kipok Feb 14, 2026
9c7b433
Add uncommitted env var
Kipok Feb 14, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,7 @@ cluster_configs/*
!cluster_configs/example-*.yaml

nemo_skills/dataset/ruler/*/
nemo_skills/dataset/bfcl_v3/*/
nemo_skills/dataset/bfcl_v4/*/
nemo_skills/dataset/ruler2/*/
nemo_skills/dataset/aalcr/lcr/
.idea/
.idea/*
Expand Down
351 changes: 351 additions & 0 deletions docs/evaluation/custom-benchmarks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,351 @@
# Custom benchmarks

NeMo-Skills supports defining benchmarks in external repositories. This lets you
keep proprietary data private, iterate on benchmarks independently of NeMo-Skills
releases, and share team-owned benchmarks without modifying the main repository.

An external benchmark can customize every part of the evaluation pipeline:
dataset preparation, prompt template, generation logic, evaluator, and metrics.

## Quick start

1. **Create a repo** with `benchmark_map.json`, a dataset `__init__.py`, and a `prepare.py`.
2. **Set the env var** `NEMO_SKILLS_EXTRA_BENCHMARK_MAP` to point at your `benchmark_map.json` (`name -> path` structure).
3. **Install** the repo (`pip install -e .`) so that Python can import your modules.
4. **Run** `ns prepare_data <name>` and `ns eval --benchmarks=<name> ...` as usual.

The rest of this page walks through a complete example.

## Walkthrough: a "word_count" benchmark

We will build a small benchmark that asks a model to count the words in a sentence.
This is deliberately simple so the focus stays on the plugin wiring rather than
the task itself.

### Step 1 - Repository layout

```
my-benchmark-repo/
├── pyproject.toml
├── benchmark_map.json
└── my_benchmarks/
├── dataset/word_count/
│ ├── __init__.py
│ └── prepare.py
├── inference/word_count.py
├── evaluation/word_count.py
├── metrics/word_count.py
└── prompt/eval/word_count/
└── default.yaml
```

**`pyproject.toml`** - makes the repo installable so that `my_benchmarks.*` is
importable:

```toml title="pyproject.toml"
[project]
name = "my-benchmarks"
version = "0.1.0"
```

**`benchmark_map.json`** - maps short names to dataset directories (paths are
relative to this file):

```json title="benchmark_map.json"
{
"word_count": "./my_benchmarks/dataset/word_count"
}
```

### Step 2 - Dataset `__init__.py`

This file tells the eval pipeline which prompt, evaluator, metrics, and generation
module to use by default. All of these can still be overridden from the command line.

```python title="my_benchmarks/dataset/word_count/__init__.py"
from nemo_skills.pipeline.utils.packager import (
register_external_repo,
RepoMetadata,
)
from pathlib import Path

# Register repo so it gets packaged inside containers.
# ignore_if_registered avoids errors when the module is imported more than once.
register_external_repo(
RepoMetadata(name="my_benchmarks", path=Path(__file__).parents[2]),
ignore_if_registered=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when do you anticipate this happening? if multiple custom benchmarks are used? or if you are writing to the same name as an existing dataset?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if multiple benchmarks are used. I think main use-case will be to have a single internal repo with 10s of internal benchmarks. Then each of them has to have this register call and if a few are specified together, it will fail if we don't ignore registered

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by "specified together"? Is this targeted at clashing across namespaces (e.g., two different benchmarks register a "my_dataset" dataset?)? I'm wary of ignores, because they can make it easy to do something different than you intend.

)

# Metrics class - use module::Class format for custom metrics
METRICS_TYPE = "my_benchmarks.metrics.word_count::WordCountMetrics"

# Default generation arguments
# prompt_config ending in .yaml triggers absolute-path resolution;
# /nemo_run/code/ is the root where code is extracted inside the container
GENERATION_ARGS = (
"++prompt_config=/nemo_run/code/my_benchmarks/prompt/eval/word_count/default.yaml "
"++eval_type=my_benchmarks.evaluation.word_count::WordCountEvaluator"
)

# Custom generation module (optional - remove this line to use the default)
GENERATION_MODULE = "my_benchmarks.inference.word_count"
```


### Step 3 - `prepare.py`

This script creates the test data. It is called by `ns prepare_data word_count`.

```python title="my_benchmarks/dataset/word_count/prepare.py"
import json
from pathlib import Path

SAMPLES = [
{"sentence": "The quick brown fox", "expected_answer": 4},
{"sentence": "Hello world", "expected_answer": 2},
{"sentence": "NeMo Skills is great for evaluation", "expected_answer": 6},
{"sentence": "One", "expected_answer": 1},
{"sentence": "A B C D E F G", "expected_answer": 7},
]

if __name__ == "__main__":
data_dir = Path(__file__).absolute().parent
output_file = data_dir / "test.jsonl"
with open(output_file, "wt", encoding="utf-8") as fout:
for sample in SAMPLES:
fout.write(json.dumps(sample) + "\n")
```

### Step 4 - Prompt template

Prompt configs live in your external repo and are referenced by their full `.yaml`
path. The path must end with `.yaml` so that the framework treats it as an absolute
path rather than a built-in config name.

```yaml title="my_benchmarks/prompt/eval/word_count/default.yaml"
user: |-
Count the number of words in the quoted sentence below.
Put your final answer (just the number) inside \boxed{{}}.

{sentence}
```

In `GENERATION_ARGS` this is referenced as:

```
++prompt_config=/nemo_run/code/my_benchmarks/prompt/eval/word_count/default.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any value to allowing the benchmark_dataset.jsonl to account for prompt configs too? or otherwise use the :: syntax in some way

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I fully understand, can you clarify this please? Prompt is just a yaml file, how would we use :: syntax?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I mean more so is being able to specify it relative to the custom benchmark, like my_benchmarks/prompt/eval/word_count/default.yaml. It seems minor, but the /nemo_run/code mount causes a lot of friction for people, and I think allowing the relative usage would make the interface a bit more ergonomic for a lot of users.

```

### Step 5 - Custom generation module (optional)

A custom generation module lets you change how the model is called - for example
to implement multi-step generation.

This example adds an optional **verify** step where the model is asked to double-check
its own answer.

```python title="my_benchmarks/inference/word_count.py"
import hydra

from nemo_skills.inference.generate import GenerationTask, GenerationTaskConfig
from nemo_skills.utils import nested_dataclass


@nested_dataclass(kw_only=True)
class WordCountGenerationConfig(GenerationTaskConfig):
# Add a custom flag that controls whether to do a verification step
verify: bool = False


cs = hydra.core.config_store.ConfigStore.instance()
cs.store(name="base_generation_config", node=WordCountGenerationConfig)


class WordCountGenerationTask(GenerationTask):
"""Generation task with an optional verification step."""

async def process_single_datapoint(self, data_point, all_data):
# Step 1: normal generation
result = await super().process_single_datapoint(data_point, all_data)

if not self.cfg.verify:
return result

# Step 2: ask the model to verify its own answer
verify_prompt = (
f"You previously answered the following question:\n\n"
f"{data_point['problem']}\n\n"
f"Your answer was:\n{result['generation']}\n\n"
f"Please verify this is correct. "
f"If it is, repeat the same answer inside \\boxed{{}}. "
f"If not, provide the corrected answer inside \\boxed{{}}."
)
new_data_point = [{"role": "user", "content": verify_prompt}]
# We use prompt_format=openai as we already prepared the full message
verify_result = await super().process_single_datapoint(
new_data_point,
all_data,
prompt_format="openai",
)
# Replace generation with the verified answer
result["generation"] = verify_result["generation"]
return result


GENERATION_TASK_CLASS = WordCountGenerationTask


@hydra.main(version_base=None, config_name="base_generation_config")
def generate(cfg: WordCountGenerationConfig):
cfg = WordCountGenerationConfig(_init_nested=True, **cfg)
LOG.info("Config used: %s", cfg)
task = WordCountGenerationTask(cfg)
task.generate()


if __name__ == "__main__":
generate()
```

If you don't need custom generation logic, simply remove the `GENERATION_MODULE`
line from `__init__.py` and the default
[`nemo_skills.inference.generate`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/inference/generate.py)
module will be used.

### Step 6 - Custom evaluator

Here is an example of a basic evaluator class. The extra "predicted_answer" and "is_correct" fields will be added
to the output jsonl produced by the generation step.

```python title="my_benchmarks/evaluation/word_count.py"
import re

from nemo_skills.evaluation.evaluator.base import BaseEvaluator


class WordCountEvaluator(BaseEvaluator):
async def eval_single(self, data_point):
"""Extract predicted answer and compare to expected."""
match = re.search(r"\\boxed\{(\d+)\}", data_point["generation"])
predicted = int(match.group(1)) if match else None

return {
"predicted_answer": predicted,
"is_correct": predicted == data_point["expected_answer"],
}
```

This is referenced in `GENERATION_ARGS` using the `module::Class` format:

```
++eval_type=my_benchmarks.evaluation.word_count::WordCountEvaluator
```

### Step 7 - Custom metrics

The metrics class reads the evaluated JSONL and computes summary statistics.

```python title="my_benchmarks/metrics/word_count.py"
from nemo_skills.evaluation.metrics.base import BaseMetrics


class WordCountMetrics(BaseMetrics):
def _get_score_dict(self, prediction):
return {"is_correct": prediction.get("is_correct", False)}

def get_incorrect_sample(self, prediction):
# used for automatic filtering data based on length
# (we mark too long examples as incorrect using this method)
prediction = prediction.copy()
prediction["is_correct"] = False
prediction["predicted_answer"] = None
return prediction

def update(self, predictions):
# base class provides convenient helpers for calculating
# common metrics like majority / pass
super().update(predictions)
predicted_answers = [pred["predicted_answer"] for pred in predictions]
self._compute_pass_at_k(
predictions=predictions,
predicted_answers=predicted_answer,
)
self._compute_majority_at_k(
predictions=predictions,
predicted_answers=predicted_answers,
)
```

Referenced in `__init__.py` as:

```
METRICS_TYPE = "my_benchmarks.metrics.word_count::WordCountMetrics"
```

### Step 8 - Running the benchmark

Install your repo and set the env var:

```bash
cd my-benchmark-repo
pip install -e .
export NEMO_SKILLS_EXTRA_BENCHMARK_MAP=$(pwd)/benchmark_map.json
```

Prepare the data:

```bash
ns prepare_data word_count
```

Run evaluation (using an API model as an example):

```bash
ns eval \
--cluster=local \
--server_type=openai \
--model=nvidia/nemotron-3-nano-30b-a3b \
--server_address=https://integrate.api.nvidia.com/v1 \
--benchmarks=word_count \
--output_dir=/workspace/test-eval
```

View results:

```bash
ns summarize_results --cluster=local /workspace/test-eval
```


## Minimal example

If your benchmark can reuse built-in evaluation and metrics (e.g. the standard math
evaluator), you only need two files:

```python title="my_benchmarks/dataset/my_simple_bench/__init__.py"
METRICS_TYPE = "math"
GENERATION_ARGS = "++prompt_config=generic/math ++eval_type=math"
```

```python title="my_benchmarks/dataset/my_simple_bench/prepare.py"
import json
from pathlib import Path

if __name__ == "__main__":
data_dir = Path(__file__).absolute().parent
with open(data_dir / "test.jsonl", "wt", encoding="utf-8") as fout:
fout.write(json.dumps({
"problem": "What is 2 + 2?",
"expected_answer": 4,
}) + "\n")
```

And a `benchmark_map.json`:

```json
{
"my_simple_bench": "./my_benchmarks/dataset/my_simple_bench"
}
```

No custom generation module, evaluator, or metrics needed.
8 changes: 4 additions & 4 deletions docs/evaluation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -233,16 +233,13 @@ Inside [`nemo_skills/dataset/gsm8k/__init__.py`](https://github.com/NVIDIA-NeMo/

```python
# settings that define how evaluation should be done by default (all can be changed from cmdline)
DATASET_GROUP = 'math'
METRICS_TYPE = "math"
GENERATION_ARGS = "++eval_type=math ++prompt_config=generic/math"
```

The prompt config and default generation arguments are passed to the
[nemo_skills/inference/generate.py](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/inference/generate.py).
The dataset group is used by [nemo_skills/dataset/prepare.py](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/prepare.py)
to help download only benchmarks from a particular group if `--dataset_groups` parameter is used.
Finally, the metrics type is used to pick a metrics class from [nemo_skills/evaluation/metrics/map_metrics.py](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/evaluation/metrics/map_metrics.py)
The metrics type is used to pick a metrics class from [nemo_skills/evaluation/metrics/map_metrics.py](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/evaluation/metrics/map_metrics.py)
which is called at the end of the evaluation to compute final scores.

## Adding new benchmarks
Expand All @@ -256,3 +253,6 @@ To create a new benchmark follow this process:
a fully custom generation module. See [scicode](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/scicode/__init__.py) or [swe-bench](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/swe-bench/__init__.py) for examples of this.
4. Create a new [evaluation class](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/evaluation/evaluator/__init__.py) (if cannot re-use existing one).
5. Create a new [metrics class](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/evaluation/metrics/map_metrics.py) ( if cannot re-use existing one).

You can also define benchmarks in a **separate git repository** without modifying NeMo-Skills.
See [Custom benchmarks](./custom-benchmarks.md) for a full walkthrough.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ nav:
- evaluation/vlm.md
- evaluation/other-benchmarks.md
- evaluation/robustness.md
- Custom benchmarks: evaluation/custom-benchmarks.md
- Agentic Inference:
- agentic_inference/parallel_thinking.md
- agentic_inference/tool_calling.md
Expand Down
Loading
Loading