-
Notifications
You must be signed in to change notification settings - Fork 150
Adding an option to store benchmarks in external repo #1240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 24 commits
f5ebde1
c28972e
f49dc2a
e3688de
edab4db
27f24fb
deb17ad
b22930b
dfb40d2
ec5353f
2b853ee
506d909
52cedd9
98a7623
bb25cfe
147a459
cc2d744
d8d5444
f5d79f8
643438f
3b33b36
fbef304
72a2c82
8fac5a8
bd0bada
6fa989d
0c946e2
00b98c9
e9280d2
04f5452
ba8b4d0
f5aa5cb
efd0629
094ba06
5c1a29a
79e6d60
1f55ac1
6e2ac12
4a068c0
3f3570f
b9f673e
744aac0
d038da5
61d8f93
f382779
f55bd96
221b073
99bf5cb
5c31a86
c069b43
350be38
028fe5c
aca2da1
f682692
ca7241f
9c7b433
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,351 @@ | ||
| # Custom benchmarks | ||
|
|
||
| NeMo-Skills supports defining benchmarks in external repositories. This lets you | ||
| keep proprietary data private, iterate on benchmarks independently of NeMo-Skills | ||
| releases, and share team-owned benchmarks without modifying the main repository. | ||
|
|
||
| An external benchmark can customize every part of the evaluation pipeline: | ||
| dataset preparation, prompt template, generation logic, evaluator, and metrics. | ||
|
|
||
| ## Quick start | ||
|
|
||
| 1. **Create a repo** with `benchmark_map.json`, a dataset `__init__.py`, and a `prepare.py`. | ||
| 2. **Set the env var** `NEMO_SKILLS_EXTRA_BENCHMARK_MAP` to point at your `benchmark_map.json` (`name -> path` structure). | ||
| 3. **Install** the repo (`pip install -e .`) so that Python can import your modules. | ||
| 4. **Run** `ns prepare_data <name>` and `ns eval --benchmarks=<name> ...` as usual. | ||
|
|
||
| The rest of this page walks through a complete example. | ||
|
|
||
| ## Walkthrough: a "word_count" benchmark | ||
|
|
||
| We will build a small benchmark that asks a model to count the words in a sentence. | ||
| This is deliberately simple so the focus stays on the plugin wiring rather than | ||
| the task itself. | ||
|
|
||
| ### Step 1 - Repository layout | ||
|
|
||
| ``` | ||
| my-benchmark-repo/ | ||
| ├── pyproject.toml | ||
| ├── benchmark_map.json | ||
| └── my_benchmarks/ | ||
| ├── dataset/word_count/ | ||
| │ ├── __init__.py | ||
| │ └── prepare.py | ||
| ├── inference/word_count.py | ||
| ├── evaluation/word_count.py | ||
| ├── metrics/word_count.py | ||
| └── prompt/eval/word_count/ | ||
| └── default.yaml | ||
| ``` | ||
|
|
||
| **`pyproject.toml`** - makes the repo installable so that `my_benchmarks.*` is | ||
| importable: | ||
|
|
||
| ```toml title="pyproject.toml" | ||
| [project] | ||
| name = "my-benchmarks" | ||
| version = "0.1.0" | ||
| ``` | ||
|
|
||
| **`benchmark_map.json`** - maps short names to dataset directories (paths are | ||
| relative to this file): | ||
|
|
||
| ```json title="benchmark_map.json" | ||
| { | ||
| "word_count": "./my_benchmarks/dataset/word_count" | ||
| } | ||
| ``` | ||
|
|
||
| ### Step 2 - Dataset `__init__.py` | ||
|
|
||
| This file tells the eval pipeline which prompt, evaluator, metrics, and generation | ||
| module to use by default. All of these can still be overridden from the command line. | ||
|
|
||
| ```python title="my_benchmarks/dataset/word_count/__init__.py" | ||
| from nemo_skills.pipeline.utils.packager import ( | ||
| register_external_repo, | ||
| RepoMetadata, | ||
| ) | ||
| from pathlib import Path | ||
|
|
||
| # Register repo so it gets packaged inside containers. | ||
| # ignore_if_registered avoids errors when the module is imported more than once. | ||
| register_external_repo( | ||
| RepoMetadata(name="my_benchmarks", path=Path(__file__).parents[2]), | ||
| ignore_if_registered=True, | ||
| ) | ||
|
|
||
| # Metrics class - use module::Class format for custom metrics | ||
| METRICS_TYPE = "my_benchmarks.metrics.word_count::WordCountMetrics" | ||
|
|
||
| # Default generation arguments | ||
| # prompt_config ending in .yaml triggers absolute-path resolution; | ||
| # /nemo_run/code/ is the root where code is extracted inside the container | ||
| GENERATION_ARGS = ( | ||
| "++prompt_config=/nemo_run/code/my_benchmarks/prompt/eval/word_count/default.yaml " | ||
| "++eval_type=my_benchmarks.evaluation.word_count::WordCountEvaluator" | ||
| ) | ||
|
|
||
| # Custom generation module (optional - remove this line to use the default) | ||
| GENERATION_MODULE = "my_benchmarks.inference.word_count" | ||
| ``` | ||
|
|
||
|
|
||
| ### Step 3 - `prepare.py` | ||
|
|
||
| This script creates the test data. It is called by `ns prepare_data word_count`. | ||
|
|
||
| ```python title="my_benchmarks/dataset/word_count/prepare.py" | ||
| import json | ||
| from pathlib import Path | ||
|
|
||
| SAMPLES = [ | ||
| {"sentence": "The quick brown fox", "expected_answer": 4}, | ||
| {"sentence": "Hello world", "expected_answer": 2}, | ||
| {"sentence": "NeMo Skills is great for evaluation", "expected_answer": 6}, | ||
| {"sentence": "One", "expected_answer": 1}, | ||
| {"sentence": "A B C D E F G", "expected_answer": 7}, | ||
| ] | ||
|
|
||
| if __name__ == "__main__": | ||
| data_dir = Path(__file__).absolute().parent | ||
| output_file = data_dir / "test.jsonl" | ||
| with open(output_file, "wt", encoding="utf-8") as fout: | ||
| for sample in SAMPLES: | ||
| fout.write(json.dumps(sample) + "\n") | ||
| ``` | ||
|
|
||
| ### Step 4 - Prompt template | ||
|
|
||
| Prompt configs live in your external repo and are referenced by their full `.yaml` | ||
| path. The path must end with `.yaml` so that the framework treats it as an absolute | ||
| path rather than a built-in config name. | ||
|
|
||
| ```yaml title="my_benchmarks/prompt/eval/word_count/default.yaml" | ||
| user: |- | ||
| Count the number of words in the quoted sentence below. | ||
| Put your final answer (just the number) inside \boxed{{}}. | ||
|
|
||
| {sentence} | ||
| ``` | ||
|
|
||
| In `GENERATION_ARGS` this is referenced as: | ||
|
|
||
| ``` | ||
| ++prompt_config=/nemo_run/code/my_benchmarks/prompt/eval/word_count/default.yaml | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any value to allowing the benchmark_dataset.jsonl to account for prompt configs too? or otherwise use the
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure I fully understand, can you clarify this please? Prompt is just a yaml file, how would we use
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I mean more so is being able to specify it relative to the custom benchmark, like |
||
| ``` | ||
|
|
||
| ### Step 5 - Custom generation module (optional) | ||
|
|
||
| A custom generation module lets you change how the model is called - for example | ||
| to implement multi-step generation. | ||
|
|
||
| This example adds an optional **verify** step where the model is asked to double-check | ||
| its own answer. | ||
|
|
||
| ```python title="my_benchmarks/inference/word_count.py" | ||
| import hydra | ||
|
|
||
| from nemo_skills.inference.generate import GenerationTask, GenerationTaskConfig | ||
| from nemo_skills.utils import nested_dataclass | ||
|
|
||
|
|
||
| @nested_dataclass(kw_only=True) | ||
| class WordCountGenerationConfig(GenerationTaskConfig): | ||
| # Add a custom flag that controls whether to do a verification step | ||
| verify: bool = False | ||
|
|
||
|
|
||
| cs = hydra.core.config_store.ConfigStore.instance() | ||
| cs.store(name="base_generation_config", node=WordCountGenerationConfig) | ||
|
|
||
|
|
||
| class WordCountGenerationTask(GenerationTask): | ||
| """Generation task with an optional verification step.""" | ||
|
|
||
| async def process_single_datapoint(self, data_point, all_data): | ||
| # Step 1: normal generation | ||
| result = await super().process_single_datapoint(data_point, all_data) | ||
|
|
||
| if not self.cfg.verify: | ||
| return result | ||
|
|
||
| # Step 2: ask the model to verify its own answer | ||
| verify_prompt = ( | ||
| f"You previously answered the following question:\n\n" | ||
| f"{data_point['problem']}\n\n" | ||
| f"Your answer was:\n{result['generation']}\n\n" | ||
| f"Please verify this is correct. " | ||
| f"If it is, repeat the same answer inside \\boxed{{}}. " | ||
| f"If not, provide the corrected answer inside \\boxed{{}}." | ||
| ) | ||
| new_data_point = [{"role": "user", "content": verify_prompt}] | ||
| # We use prompt_format=openai as we already prepared the full message | ||
| verify_result = await super().process_single_datapoint( | ||
| new_data_point, | ||
| all_data, | ||
| prompt_format="openai", | ||
| ) | ||
| # Replace generation with the verified answer | ||
| result["generation"] = verify_result["generation"] | ||
| return result | ||
|
|
||
|
|
||
| GENERATION_TASK_CLASS = WordCountGenerationTask | ||
|
|
||
|
|
||
| @hydra.main(version_base=None, config_name="base_generation_config") | ||
| def generate(cfg: WordCountGenerationConfig): | ||
| cfg = WordCountGenerationConfig(_init_nested=True, **cfg) | ||
| LOG.info("Config used: %s", cfg) | ||
| task = WordCountGenerationTask(cfg) | ||
| task.generate() | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| generate() | ||
| ``` | ||
coderabbitai[bot] marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| If you don't need custom generation logic, simply remove the `GENERATION_MODULE` | ||
| line from `__init__.py` and the default | ||
| [`nemo_skills.inference.generate`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/inference/generate.py) | ||
| module will be used. | ||
|
|
||
| ### Step 6 - Custom evaluator | ||
|
|
||
| Here is an example of a basic evaluator class. The extra "predicted_answer" and "is_correct" fields will be added | ||
| to the output jsonl produced by the generation step. | ||
|
|
||
| ```python title="my_benchmarks/evaluation/word_count.py" | ||
| import re | ||
|
|
||
| from nemo_skills.evaluation.evaluator.base import BaseEvaluator | ||
|
|
||
|
|
||
| class WordCountEvaluator(BaseEvaluator): | ||
| async def eval_single(self, data_point): | ||
| """Extract predicted answer and compare to expected.""" | ||
| match = re.search(r"\\boxed\{(\d+)\}", data_point["generation"]) | ||
| predicted = int(match.group(1)) if match else None | ||
|
|
||
| return { | ||
| "predicted_answer": predicted, | ||
| "is_correct": predicted == data_point["expected_answer"], | ||
| } | ||
| ``` | ||
|
|
||
| This is referenced in `GENERATION_ARGS` using the `module::Class` format: | ||
|
|
||
| ``` | ||
| ++eval_type=my_benchmarks.evaluation.word_count::WordCountEvaluator | ||
| ``` | ||
|
|
||
| ### Step 7 - Custom metrics | ||
|
|
||
| The metrics class reads the evaluated JSONL and computes summary statistics. | ||
|
|
||
| ```python title="my_benchmarks/metrics/word_count.py" | ||
| from nemo_skills.evaluation.metrics.base import BaseMetrics | ||
|
|
||
|
|
||
| class WordCountMetrics(BaseMetrics): | ||
| def _get_score_dict(self, prediction): | ||
| return {"is_correct": prediction.get("is_correct", False)} | ||
|
|
||
| def get_incorrect_sample(self, prediction): | ||
| # used for automatic filtering data based on length | ||
| # (we mark too long examples as incorrect using this method) | ||
| prediction = prediction.copy() | ||
| prediction["is_correct"] = False | ||
| prediction["predicted_answer"] = None | ||
| return prediction | ||
|
|
||
| def update(self, predictions): | ||
| # base class provides convenient helpers for calculating | ||
| # common metrics like majority / pass | ||
| super().update(predictions) | ||
| predicted_answers = [pred["predicted_answer"] for pred in predictions] | ||
| self._compute_pass_at_k( | ||
| predictions=predictions, | ||
| predicted_answers=predicted_answer, | ||
| ) | ||
| self._compute_majority_at_k( | ||
| predictions=predictions, | ||
| predicted_answers=predicted_answers, | ||
| ) | ||
| ``` | ||
coderabbitai[bot] marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Referenced in `__init__.py` as: | ||
|
|
||
| ``` | ||
| METRICS_TYPE = "my_benchmarks.metrics.word_count::WordCountMetrics" | ||
| ``` | ||
|
|
||
| ### Step 8 - Running the benchmark | ||
|
|
||
| Install your repo and set the env var: | ||
|
|
||
| ```bash | ||
| cd my-benchmark-repo | ||
| pip install -e . | ||
| export NEMO_SKILLS_EXTRA_BENCHMARK_MAP=$(pwd)/benchmark_map.json | ||
| ``` | ||
|
|
||
| Prepare the data: | ||
|
|
||
| ```bash | ||
| ns prepare_data word_count | ||
| ``` | ||
|
|
||
| Run evaluation (using an API model as an example): | ||
|
|
||
| ```bash | ||
| ns eval \ | ||
| --cluster=local \ | ||
| --server_type=openai \ | ||
| --model=nvidia/nemotron-3-nano-30b-a3b \ | ||
| --server_address=https://integrate.api.nvidia.com/v1 \ | ||
| --benchmarks=word_count \ | ||
| --output_dir=/workspace/test-eval | ||
| ``` | ||
|
|
||
| View results: | ||
|
|
||
| ```bash | ||
| ns summarize_results --cluster=local /workspace/test-eval | ||
| ``` | ||
|
|
||
|
|
||
| ## Minimal example | ||
|
|
||
| If your benchmark can reuse built-in evaluation and metrics (e.g. the standard math | ||
| evaluator), you only need two files: | ||
|
|
||
| ```python title="my_benchmarks/dataset/my_simple_bench/__init__.py" | ||
| METRICS_TYPE = "math" | ||
| GENERATION_ARGS = "++prompt_config=generic/math ++eval_type=math" | ||
| ``` | ||
|
|
||
| ```python title="my_benchmarks/dataset/my_simple_bench/prepare.py" | ||
| import json | ||
| from pathlib import Path | ||
|
|
||
| if __name__ == "__main__": | ||
| data_dir = Path(__file__).absolute().parent | ||
| with open(data_dir / "test.jsonl", "wt", encoding="utf-8") as fout: | ||
| fout.write(json.dumps({ | ||
| "problem": "What is 2 + 2?", | ||
| "expected_answer": 4, | ||
| }) + "\n") | ||
| ``` | ||
|
|
||
| And a `benchmark_map.json`: | ||
|
|
||
| ```json | ||
| { | ||
| "my_simple_bench": "./my_benchmarks/dataset/my_simple_bench" | ||
| } | ||
| ``` | ||
|
|
||
| No custom generation module, evaluator, or metrics needed. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when do you anticipate this happening? if multiple custom benchmarks are used? or if you are writing to the same name as an existing dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if multiple benchmarks are used. I think main use-case will be to have a single internal repo with 10s of internal benchmarks. Then each of them has to have this register call and if a few are specified together, it will fail if we don't ignore registered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by "specified together"? Is this targeted at clashing across namespaces (e.g., two different benchmarks register a "my_dataset" dataset?)? I'm wary of ignores, because they can make it easy to do something different than you intend.