Adding an option to store benchmarks in external repo #1240

Kipok · 2026-02-12T23:39:54Z

Will update in a bit

Will also add tests in a bit, but core logic should be good

Summary by CodeRabbit

New Features
- Support for external/custom benchmarks (prepare, package, evaluate) and improved dataset path resolution.
- Dynamic evaluator/metrics dispatch to load external implementations.
- Optional prompt_format passthrough for generation/eval tasks.
Improvements
- Simplified dataset preparation flow and packaging with clearer data-dir handling and container path resolution.
- Reduced CLI surface for eval/summarize; streamlined metric initialization.
Documentation
- Added a guide for creating and integrating external benchmarks.
Tests
- New integration tests covering external benchmark prepare+eval.

Signed-off-by: Igor Gitman <[email protected]>

greptile-apps · 2026-02-12T23:39:58Z

Too many files changed for review. (173 files found, 100 file limit)

Signed-off-by: Igor Gitman <[email protected]>

coderabbitai · 2026-02-12T23:55:22Z

📝 Walkthrough

Walkthrough

Refactors dataset configuration (removes DATASET_GROUP widely; adds REQUIRES_DATA_DIR / HAS_DYNAMIC_INIT), modularizes RULER/RULER2 prepare flow, adds external benchmark support and tests, enhances evaluator dispatch and dataset resolution utilities, updates pipelines/packaging, and documents custom benchmarks.

Changes

Cohort / File(s)	Summary
Dataset config removals / flags `nemo_skills/dataset/.../__init__.py` (many files), `nemo_skills/dataset/ruler2/__init__.py`, `nemo_skills/dataset/ruler/__init__.py`	Removed `DATASET_GROUP` across ~70+ datasets; several modules now expose `REQUIRES_DATA_DIR = True` and/or `HAS_DYNAMIC_INIT = True` instead.
BFCL constants added `nemo_skills/dataset/bfcl_v3/.../__init__.py`, `nemo_skills/dataset/bfcl_v4/.../__init__.py`, `nemo_skills/dataset/bfcl_v3/prepare.py`	Added `METRICS_TYPE`, `GENERATION_ARGS`, `GENERATION_MODULE` to many bfcl v3/v4 subpackages; removed DEFAULT_SETTINGS and stopped auto-writing init in bfcl_v3 prepare.
RULER modularization `nemo_skills/dataset/ruler/prepare.py`, `.../prepare_common.py`, `.../prepare_data.py`, `.../prepare_init.py`	Split monolithic RULER prepare into smaller modules: common arg parsing, data preparation, and init generation; top-level prepare delegates to new scripts.
RULER2 modularization `nemo_skills/dataset/ruler2/prepare.py`, `.../prepare_common.py`, `.../prepare_data.py`, `.../prepare_init.py`	Similar refactor for ruler2: replaced large prepare with delegated prepare_init.py and prepare_data.py plus helpers.
Dataset utils & external benchmarks `nemo_skills/dataset/utils.py`	Introduced `get_dataset_path`, extra benchmark map loading, `_load_external_dataset`, and new dataset resolution logic supporting path/map/built-in lookups; removed cluster-based helpers.
Evaluation dispatch `nemo_skills/evaluation/evaluator/__init__.py`	Added `_resolve_eval_type` supporting module/path imports and function/class dispatch; improved error messages and register_evaluator option.
Metrics & pipeline simplification `nemo_skills/evaluation/metrics/compute_metrics.py`, `nemo_skills/pipeline/summarize_results.py`, `nemo_skills/pipeline/eval.py`	Removed data_dir/extra_datasets plumbing from ComputeMetrics and pipeline commands; summarize_results and eval no longer accept extra_datasets options.
Prepare flow & CLI `nemo_skills/dataset/prepare.py`, `nemo_skills/pipeline/prepare_data.py`	Added `parse_prepare_cli_arguments`, changed `prepare_datasets` signature, and refactored prepare_data to handle external datasets, dynamic init, data_dir/container path resolution, and split vs non-split flows.
Generation / prompt_format plumbing `nemo_skills/inference/generate.py` and many inference/eval modules (`autoformalize.py`, `check_contamination.py`, `prover.py`, `inference/eval/`, recipes/)	Added optional `prompt_format` parameter to fill_prompt/process_single_datapoint signatures and propagated through generation/eval call sites.
Packaging / container path helpers `nemo_skills/pipeline/utils/packager.py`, `nemo_skills/pipeline/utils/eval.py`, `nemo_skills/pipeline/utils/__init__.py`	Added `resolve_external_data_path`, updated register_external_repo to support ignore-on-register, added local_data_path and metrics_type handling in benchmark args, and re-exported resolve_external_data_path.
External benchmark test fixtures & implementations `tests/data/dummy_external_benchmark/**`, `tests/gpu-tests/test_external_benchmark_eval.py`, `tests/test_external_benchmarks.py`	Added a dummy external-benchmark repo (prepare, dataset init.py, evaluator, metrics, prompts, benchmark_map.json) and comprehensive tests including a GPU/container integration test for external benchmark prepare+eval.
Docs, mkdocs, .gitignore `docs/evaluation/custom-benchmarks.md`, `docs/evaluation/index.md`, `mkdocs.yml`, `.gitignore`	Added new custom-benchmarks doc and nav entry; updated .gitignore to keep `ruler2` path and remove bfcl_v3/v4 ignore patterns.
Tests & cleanup `tests/test_datasets.py`, `tests/test_configs.py`	Removed dataset DATASET_GROUP validation test; added small test config change (METRICS_TYPE in mock) and many new external-benchmark tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Add RULERv2 #1106 — touches RULER/ruler2 prepare scripts and dataset init logic (closely related to RULER modularization here).
BFCLv4 support #908 — BFCL-related changes and prepare logic overlaps with bfcl_v3/v4 edits in this PR.
Add compute eval #1158 — evaluator registration/dispatch changes; overlaps with dynamic evaluator resolution added here.

Suggested labels

enhancement, run GPU tests

Suggested reviewers

gwarmstrong
activatedgeek
ekmb

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.61% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection	⚠️ Warning	❌ Merge conflicts detected (128 files): ⚔️ `.github/workflows/tests.yml` (content) ⚔️ `.gitignore` (content) ⚔️ `dockerfiles/Dockerfile.sandbox` (content) ⚔️ `docs/evaluation/index.md` (content) ⚔️ `docs/evaluation/scientific-knowledge.md` (content) ⚔️ `docs/index.md` (content) ⚔️ `docs/releases/openreasoning/training.md` (content) ⚔️ `mkdocs.yml` (content) ⚔️ `nemo_skills/dataset/aai/__init__.py` (content) ⚔️ `nemo_skills/dataset/aalcr/__init__.py` (content) ⚔️ `nemo_skills/dataset/aime24/__init__.py` (content) ⚔️ `nemo_skills/dataset/aime25/__init__.py` (content) ⚔️ `nemo_skills/dataset/algebra222/__init__.py` (content) ⚔️ `nemo_skills/dataset/amc23/__init__.py` (content) ⚔️ `nemo_skills/dataset/answer-judge/__init__.py` (content) ⚔️ `nemo_skills/dataset/apex-shortlist/__init__.py` (content) ⚔️ `nemo_skills/dataset/arena-hard-v2/__init__.py` (content) ⚔️ `nemo_skills/dataset/arena-hard/__init__.py` (content) ⚔️ `nemo_skills/dataset/asdiv/__init__.py` (content) ⚔️ `nemo_skills/dataset/asr-leaderboard/__init__.py` (content) ⚔️ `nemo_skills/dataset/audiobench/__init__.py` (content) ⚔️ `nemo_skills/dataset/audiobench/judge/__init__.py` (content) ⚔️ `nemo_skills/dataset/audiobench/nonjudge/__init__.py` (content) ⚔️ `nemo_skills/dataset/beyond-aime/__init__.py` (content) ⚔️ `nemo_skills/dataset/bfcl_v3/__init__.py` (content) ⚔️ `nemo_skills/dataset/bfcl_v3/prepare.py` (content) ⚔️ `nemo_skills/dataset/bfcl_v4/__init__.py` (content) ⚔️ `nemo_skills/dataset/bigcodebench/__init__.py` (content) ⚔️ `nemo_skills/dataset/birdbench/__init__.py` (content) ⚔️ `nemo_skills/dataset/brumo25/__init__.py` (content) ⚔️ `nemo_skills/dataset/challenge19/__init__.py` (content) ⚔️ `nemo_skills/dataset/college_math/__init__.py` (content) ⚔️ `nemo_skills/dataset/comp-math-24-25/__init__.py` (content) ⚔️ `nemo_skills/dataset/compute-eval/__init__.py` (content) ⚔️ `nemo_skills/dataset/flores200/__init__.py` (content) ⚔️ `nemo_skills/dataset/frontierscience-olympiad/__init__.py` (content) ⚔️ `nemo_skills/dataset/gaokao2023en/__init__.py` (content) ⚔️ `nemo_skills/dataset/gpqa/__init__.py` (content) ⚔️ `nemo_skills/dataset/gsm-plus/__init__.py` (content) ⚔️ `nemo_skills/dataset/gsm8k/__init__.py` (content) ⚔️ `nemo_skills/dataset/hendrycks_math/__init__.py` (content) ⚔️ `nemo_skills/dataset/hle/__init__.py` (content) ⚔️ `nemo_skills/dataset/hmmt_feb25/__init__.py` (content) ⚔️ `nemo_skills/dataset/hmmt_nov25/__init__.py` (content) ⚔️ `nemo_skills/dataset/human-eval-infilling/__init__.py` (content) ⚔️ `nemo_skills/dataset/human-eval/__init__.py` (content) ⚔️ `nemo_skills/dataset/icpc/__init__.py` (content) ⚔️ `nemo_skills/dataset/ifbench/__init__.py` (content) ⚔️ `nemo_skills/dataset/ifeval/__init__.py` (content) ⚔️ `nemo_skills/dataset/imo-answerbench/__init__.py` (content) ⚔️ `nemo_skills/dataset/imo-gradingbench/__init__.py` (content) ⚔️ `nemo_skills/dataset/imo-proofbench/__init__.py` (content) ⚔️ `nemo_skills/dataset/ioi/__init__.py` (content) ⚔️ `nemo_skills/dataset/librispeech-pc/__init__.py` (content) ⚔️ `nemo_skills/dataset/livebench-coding/__init__.py` (content) ⚔️ `nemo_skills/dataset/livecodebench-cpp/__init__.py` (content) ⚔️ `nemo_skills/dataset/livecodebench-cpp/prepare.py` (content) ⚔️ `nemo_skills/dataset/livecodebench-pro/__init__.py` (content) ⚔️ `nemo_skills/dataset/livecodebench-pro/prepare.py` (content) ⚔️ `nemo_skills/dataset/livecodebench/__init__.py` (content) ⚔️ `nemo_skills/dataset/livecodebench/prepare.py` (content) ⚔️ `nemo_skills/dataset/math-500/__init__.py` (content) ⚔️ `nemo_skills/dataset/math-odyssey/__init__.py` (content) ⚔️ `nemo_skills/dataset/mawps/__init__.py` (content) ⚔️ `nemo_skills/dataset/mbpp/__init__.py` (content) ⚔️ `nemo_skills/dataset/minerva_math/__init__.py` (content) ⚔️ `nemo_skills/dataset/minif2f/__init__.py` (content) ⚔️ `nemo_skills/dataset/mmau-pro/__init__.py` (content) ⚔️ `nemo_skills/dataset/mmau-pro/closed_form/__init__.py` (content) ⚔️ `nemo_skills/dataset/mmlu-pro/__init__.py` (content) ⚔️ `nemo_skills/dataset/mmlu-prox/__init__.py` (content) ⚔️ `nemo_skills/dataset/mmlu-redux/__init__.py` (content) ⚔️ `nemo_skills/dataset/mmlu/__init__.py` (content) ⚔️ `nemo_skills/dataset/mmmu-pro/__init__.py` (content) ⚔️ `nemo_skills/dataset/mobench/__init__.py` (content) ⚔️ `nemo_skills/dataset/mrcr/__init__.py` (content) ⚔️ `nemo_skills/dataset/musan/__init__.py` (content) ⚔️ `nemo_skills/dataset/olympiadbench/__init__.py` (content) ⚔️ `nemo_skills/dataset/omni-math/__init__.py` (content) ⚔️ `nemo_skills/dataset/omniscience/__init__.py` (content) ⚔️ `nemo_skills/dataset/open-proof-corpus-judge/__init__.py` (content) ⚔️ `nemo_skills/dataset/prepare.py` (content) ⚔️ `nemo_skills/dataset/proof-arena-judge/__init__.py` (content) ⚔️ `nemo_skills/dataset/proof-bench-judge/__init__.py` (content) ⚔️ `nemo_skills/dataset/proofnet/__init__.py` (content) ⚔️ `nemo_skills/dataset/putnam-bench/__init__.py` (content) ⚔️ `nemo_skills/dataset/ruler/__init__.py` (content) ⚔️ `nemo_skills/dataset/ruler/prepare.py` (content) ⚔️ `nemo_skills/dataset/ruler2/__init__.py` (content) ⚔️ `nemo_skills/dataset/ruler2/prepare.py` (content) ⚔️ `nemo_skills/dataset/scicode/__init__.py` (content) ⚔️ `nemo_skills/dataset/simpleqa/__init__.py` (content) ⚔️ `nemo_skills/dataset/supergpqa/__init__.py` (content) ⚔️ `nemo_skills/dataset/svamp/__init__.py` (content) ⚔️ `nemo_skills/dataset/swe-bench-multilingual/__init__.py` (content) ⚔️ `nemo_skills/dataset/swe-bench/__init__.py` (content) ⚔️ `nemo_skills/dataset/swe-rebench/__init__.py` (content) ⚔️ `nemo_skills/dataset/utils.py` (content) ⚔️ `nemo_skills/dataset/wmt24pp/__init__.py` (content) ⚔️ `nemo_skills/evaluation/evaluator/__init__.py` (content) ⚔️ `nemo_skills/evaluation/evaluator/livecodebench.py` (content) ⚔️ `nemo_skills/evaluation/metrics/compute_metrics.py` (content) ⚔️ `nemo_skills/evaluation/metrics/map_metrics.py` (content) ⚔️ `nemo_skills/evaluation/metrics/math_metrics.py` (content) ⚔️ `nemo_skills/inference/autoformalize.py` (content) ⚔️ `nemo_skills/inference/check_contamination.py` (content) ⚔️ `nemo_skills/inference/eval/arena_judge.py` (content) ⚔️ `nemo_skills/inference/eval/bfcl.py` (content) ⚔️ `nemo_skills/inference/eval/compute_eval.py` (content) ⚔️ `nemo_skills/inference/eval/scicode.py` (content) ⚔️ `nemo_skills/inference/eval/swebench.py` (content) ⚔️ `nemo_skills/inference/generate.py` (content) ⚔️ `nemo_skills/inference/prover.py` (content) ⚔️ `nemo_skills/pipeline/eval.py` (content) ⚔️ `nemo_skills/pipeline/prepare_data.py` (content) ⚔️ `nemo_skills/pipeline/summarize_results.py` (content) ⚔️ `nemo_skills/pipeline/utils/__init__.py` (content) ⚔️ `nemo_skills/pipeline/utils/eval.py` (content) ⚔️ `nemo_skills/pipeline/utils/packager.py` (content) ⚔️ `nemo_skills/prompt/config/gpt-oss/livecodebench.yaml` (content) ⚔️ `nemo_skills/prompt/config/judge/imo_answerbench.yaml` (content) ⚔️ `recipes/asr_tts/riva_generate.py` (content) ⚔️ `recipes/proof-gen-verification/scripts/script_generation.py` (content) ⚔️ `requirements/code_execution.txt` (content) ⚔️ `requirements/main.txt` (content) ⚔️ `tests/gpu-tests/run_qwen.sh` (content) ⚔️ `tests/test_configs.py` (content) ⚔️ `tests/test_datasets.py` (content) These conflicts must be resolved before merging into `main`.	Resolve conflicts locally and push changes to this branch.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Title 'Adding an option to store benchmarks in external repo' clearly describes the primary change: enabling external repository support for benchmark storage.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch igitman/benchmarks-plugin

⚔️ Resolve merge conflicts (beta)

Auto-commit resolved conflicts to branch igitman/benchmarks-plugin
Create stacked PR with resolved conflicts
Post resolved changes as copyable diffs in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments

docs/evaluation/custom-benchmarks.md (1)

319-323: Indented code block on line 322 doesn't match the fenced style used elsewhere.

The note block content on line 322 uses an indented code block. This is flagged by markdownlint (MD046) for inconsistent code block style. Consider using a fenced block instead for consistency.
tests/data/dummy_external_benchmark/my_benchmarks/metrics/word_count.py (1)
19-20: Use direct dictionary access for is_correct per coding guidelines.

Line 20 uses prediction.get("is_correct", False) — if is_correct is expected to be present on predictions (set by the evaluator), use direct access to fail fast on malformed data. Line 34 already uses direct access for predicted_answer.
Suggested fix
     def _get_score_dict(self, prediction):
-        return {"is_correct": prediction.get("is_correct", False)}
+        return {"is_correct": prediction["is_correct"]}
As per coding guidelines: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."
tests/gpu-tests/test_external_benchmark_eval.py (1)

112-129: Consider also asserting output for the path-based benchmark.

The eval runs two benchmarks (my_simple_bench and simple_bench_path) but only the map-name variant's output and metrics are validated. Adding a check for the path-based benchmark output would strengthen coverage.
tests/test_external_benchmarks.py (2)
240-245: Use a raw string for the match regex pattern.

The .* in the match string contains regex metacharacters. While it works, using a raw string is more explicit and avoids the RUF043 warning.
Suggested fix
-        with pytest.raises(RuntimeError, match="Expected .* to exist"):
+        with pytest.raises(RuntimeError, match=r"Expected .* to exist"):
201-220: Nit: prefix unused data_path with _.

Multiple tests unpack data_path but never use it. Prefix with _ to signal intent and silence linter warnings (RUF059).
Example fix (apply similarly to lines 208, 214, 219)
-        module, data_path = get_dataset_module(word_count_path)
+        module, _data_path = get_dataset_module(word_count_path)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/inference/generate.py (1)
583-591: ⚠️ Potential issue | 🔴 Critical

Breaking change: ArenaJudge will crash when calling super().process_single_datapoint().

ArenaJudge.process_single_datapoint calls super().process_single_datapoint(gen_base_data, all_data) at lines 152-153. This invokes the base class's process_single_datapoint, which then calls self.fill_prompt(data_point, all_data, prompt_format) as a positional call (line 692). Since self is an ArenaJudge instance, this attempts to pass 3 positional arguments to ArenaJudge.fill_prompt, which only accepts 2 parameters. This will raise TypeError: fill_prompt() takes 3 positional arguments but 4 were given.

To fix this, either:

Pass prompt_format as a keyword argument in the base class:
-"prompt": self.fill_prompt(data_point, all_data, prompt_format),
+"prompt": self.fill_prompt(data_point, all_data, prompt_format=prompt_format),
Update ArenaJudge.fill_prompt to accept the new parameter:
-def fill_prompt(self, data_point, data):
+def fill_prompt(self, data_point, data, prompt_format=None):
Both changes are needed for full compatibility.

🤖 Fix all issues with AI agents

In `@docs/evaluation/custom-benchmarks.md`:
- Around line 198-208: The example calls LOG.info in the generate function but
never defines LOG; add the logger setup by importing logging and get_logger_name
(from nemo_skills.utils) and creating LOG =
logging.getLogger(get_logger_name(__file__)); ensure these imports and the LOG
definition appear near the top of the example so LOG is available when
generate(cfg: WordCountGenerationConfig) calls LOG.info.
- Around line 264-277: In the update method, fix the NameError by passing the
correct variable name to _compute_pass_at_k: replace the incorrect
predicted_answer with the defined predicted_answers; specifically, in the
update(self, predictions) body ensure the call to _compute_pass_at_k uses
predicted_answers (the list defined from predictions) so it matches the later
_compute_majority_at_k call and the local variable name.

In `@nemo_skills/dataset/bfcl_v3/parallel/__init__.py`:
- Around line 1-3: The file is missing the standard NVIDIA Apache 2.0 license
header; add the exact repo-standard copyright/license header comment block to
the top of this __init__.py (and any other new bfcl_v3 __init__.py files) above
the existing constants METRICS_TYPE, GENERATION_ARGS, and GENERATION_MODULE so
the file matches other files in the PR.

In `@nemo_skills/dataset/prepare.py`:
- Around line 36-49: The CLI flag --retries currently defaults to 0 while the
prepare_datasets function signature defaults to 3, causing inconsistent
behavior; pick one canonical default and make both match (either change the
parser add_argument call for "--retries" to default=3 or change the
prepare_datasets signature to retries=0) so callers get the same retry behavior
whether invoked from the CLI or as a library function; update the parser
add_argument("--retries", ...) and/or the prepare_datasets(...) retries
parameter accordingly.

In `@nemo_skills/dataset/ruler/prepare_common.py`:
- Around line 33-39: The message MISSING_RULER_ARGS_MESSAGE currently begins
with "ERROR:" but parse_args_and_prepare_args exits with SystemExit(0),
producing a success status while signaling an error; fix by making them
consistent: either remove the "ERROR:" prefix from MISSING_RULER_ARGS_MESSAGE if
skipping is intentional, or change the exit in parse_args_and_prepare_args to
SystemExit(1) to indicate failure; locate and update the symbols
MISSING_RULER_ARGS_MESSAGE and the exit call in parse_args_and_prepare_args
(also consider how prepare.py's subprocess `check=True` will interpret the
chosen exit code) so the log text and exit code match the intended behavior.

In `@nemo_skills/dataset/ruler/prepare_data.py`:
- Around line 49-57: The git-LFS check in the "if 'cwe' in tasks" block
currently only catches subprocess.CalledProcessError but will crash with a
FileNotFoundError when git is not installed; update the exception handling
around the subprocess.run(["git", "lfs", "--version"]) call in prepare_data.py
to catch both subprocess.CalledProcessError and FileNotFoundError (or a broad
OSError) and then print the existing friendly message and exit(1) so missing git
or missing git-lfs both produce the same helpful output.

In `@nemo_skills/dataset/ruler2/prepare_data.py`:
- Around line 25-438: Many near-duplicate functions (e.g.,
prepare_mk_niah_basic, prepare_mk_niah_easy, prepare_mv_niah_basic,
prepare_qa_hard, etc.) repeat subprocess.run logic with only module name
(prepare_niah, prepare_mmlu, prepare_qa) and a few flag values changing; replace
them with one generic runner (e.g., run_prepare_task) that accepts a task key
and common params (output_folder, tokenizer_type, tokenizer_path, length,
dataset_size) and use a declarative dict mapping task keys to module name plus
per-task flag/value dicts; the runner should build the argv list by starting
with ["python","-m", module] then iterating the flag dict to append "--flag",
str(value), call subprocess.run(..., check=True), and replace all prepare_*
callers with a single call to run_prepare_task(task_key, ...).
- Around line 479-491: The current main block uses parse_known_args and assigns
unknown to `_`, silently discarding unrecognized CLI flags; change this to
either call parser.parse_args() to let argparse reject unknown args, or keep
parse_known_args but immediately check the returned unknown list and raise a
clear error (or call parser.error()) if it's non-empty; update the block around
build_prepare_parser, parse_known_args, and the call site before prepare_dataset
to perform this validation so typos/unsupported flags are not ignored.

In `@nemo_skills/evaluation/evaluator/__init__.py`:
- Around line 174-177: The debug print in the class-instantiation branch should
be removed or converted to a proper logger call; replace the line
`print(f"evaluator: {evaluator}")` with either nothing or `LOG.debug("evaluator:
%s", evaluator)` (or equivalent) inside the block where `is_class` is true after
`evaluator = obj(eval_config)` so you don't emit noisy stdout during
`evaluator.eval_full()` runs.

In `@nemo_skills/pipeline/eval.py`:
- Around line 745-746: The current f-string will inject the literal "None" when
both metric_type and benchmark_args.metrics_type are None; compute an effective
value (e.g., effective_metric_type = metric_type or benchmark_args.metrics_type)
and only append the "--metric_type=..." flag to the command when
effective_metric_type is truthy, so summarize_results never receives the string
"None" as a metric type.

In `@nemo_skills/pipeline/utils/packager.py`:
- Around line 191-194: The loop that builds include_patterns iterates over
dataset_dir.rglob("*.jsonl") and currently calls
include_pattern_relative_paths.append(str(nemo_skills_dir.parent)) inside that
loop, creating duplicate entries; move the append call outside the for f in
dataset_dir.rglob("*.jsonl") loop (or append once conditionally after detecting
at least one JSONL) so include_pattern_relative_paths only adds
str(nemo_skills_dir.parent) a single time when dataset files exist; update the
block around dataset_dir, include_patterns and include_pattern_relative_paths in
packager.py accordingly.

🧹 Nitpick comments (17)

nemo_skills/inference/generate.py (1)
583-585: Add type hints for the new prompt_format parameters.

Per coding guidelines, simple types should have type hints.
-    def fill_prompt(self, data_point, data, prompt_format=None):
+    def fill_prompt(self, data_point, data, prompt_format: str | None = None):
-    async def process_single_datapoint(self, data_point, all_data, prompt_format=None):
+    async def process_single_datapoint(self, data_point, all_data, prompt_format: str | None = None):
As per coding guidelines: "Use type hints for simple types (dict, list, int, float, existing classes) in Python code"

Also applies to: 680-681
nemo_skills/dataset/ruler/prepare_common.py (1)
89-94: Add return type hint.

Per project guidelines, use type hints for simple types.
Suggested fix
-def parse_args_and_prepare_args(parser: argparse.ArgumentParser):
+def parse_args_and_prepare_args(parser: argparse.ArgumentParser) -> tuple[argparse.Namespace, str]:
nemo_skills/dataset/ruler2/prepare_common.py (1)
75-77: Add return type hint.

Same as the ruler counterpart — add a return type hint for consistency.
Suggested fix
-def parse_known_args(parser: argparse.ArgumentParser):
+def parse_known_args(parser: argparse.ArgumentParser) -> tuple[argparse.Namespace, list[str]]:
nemo_skills/dataset/ruler/prepare_data.py (2)
60-60: Pass a string instead of a single-element list when using shell=True.

When shell=True, passing a list is misleading — the first element becomes the shell command string, and subsequent elements become arguments to the shell itself (not the command). Use a plain string here.
Suggested fix
-    subprocess.run(["pip install wonderwords html2text tenacity"], check=True, shell=True)
+    subprocess.run("pip install wonderwords html2text tenacity", check=True, shell=True)
62-69: Manual __enter__/__exit__ on TemporaryDirectory is fragile.

Calling dunder methods directly bypasses the context manager protocol's guarantees. Consider restructuring so the conditional temp directory uses a proper context manager or a simpler pattern:
Suggested refactor
-    if tmp_data_dir is not None:
-        tmpdirname = tmp_data_dir
-        Path(tmpdirname).mkdir(parents=True, exist_ok=True)
-        tmpdir_context = None
-    else:
-        tmpdir_context = tempfile.TemporaryDirectory()
-        tmpdirname = tmpdir_context.__enter__()
-
-    try:
+    with tempfile.TemporaryDirectory() as _tmpdir:
+        if tmp_data_dir is not None:
+            tmpdirname = tmp_data_dir
+            Path(tmpdirname).mkdir(parents=True, exist_ok=True)
+        else:
+            tmpdirname = _tmpdir
+
This removes the manual __enter__/__exit__ and the try/finally block entirely. The unused TemporaryDirectory is cheap when tmp_data_dir is provided.
nemo_skills/dataset/ruler2/prepare_init.py (1)

61-69: Silently discarding unknown CLI arguments may hide user errors.

Line 64 discards unknown args. Since prepare.py forwards all sys.argv to both prepare_init.py and prepare_data.py, this is understandable — init doesn't need data-prep args. However, if run standalone, typos or unsupported flags will be silently ignored.

Consider at minimum logging the discarded args for debuggability, or documenting that this script is intended to be invoked via prepare.py.
nemo_skills/dataset/ruler2/prepare_data.py (3)
459-460: Remove commented-out code.

The commented-out pip install on line 460 is dead code. If the dependency installation is needed, it should be documented or handled in a setup step, not left as a comment in runtime code.

441-476: No validation of task names — KeyError with no helpful message.

If a user passes an invalid task name, prepare_task[task] on line 466 raises a bare KeyError. Consider validating upfront with a clear error message listing available tasks.
Proposed fix
+    invalid_tasks = set(tasks) - set(prepare_task.keys())
+    if invalid_tasks:
+        raise ValueError(f"Unknown tasks: {invalid_tasks}. Available: {list(prepare_task.keys())}")
+
     with concurrent.futures.ThreadPoolExecutor() as executor:
441-441: Add type hints to function signatures.

All public functions in this file lack type hints. At minimum, prepare_dataset and the individual prepare_* functions should annotate their parameters with basic types (str, int, list[str], etc.). As per coding guidelines: "Use type hints for simple types (dict, list, int, float, existing classes) in Python code."
nemo_skills/pipeline/utils/packager.py (1)

82-129: Potential relative_to failure when repo_path is a subdirectory of the git root.

Line 106 validates that local_data_path is relative to repo_meta.path, but line 119 calls local_data_path.relative_to(effective_root) where effective_root is the git root. If repo_meta.path is registered as a parent of the actual git root (unlikely but not prevented by RepoMetadata), or if the repo structure has symlinks that break the ancestor chain, this relative_to call would raise an unhandled ValueError.

The happy path (repo_path is at or below git_root) is safe because git_root is always an ancestor of repo_path, hence also of local_data_path. Just something to be aware of in edge cases.
nemo_skills/pipeline/prepare_data.py (2)
213-225: --prepare_entrypoint prepare_data.py is appended after _build_command which already joined prepare_unknown_args.

Line 225 appends --prepare_entrypoint prepare_data.py after the unknown args were already joined on line 113 (inside _build_command). The resulting command would look like:
python -m nemo_skills.dataset.prepare <datasets> <unknown_args> --prepare_entrypoint prepare_data.py
This works because argparse parses named arguments positionally-independently, but the command string looks a bit odd. Consider passing prepare_entrypoint through the unknown args or appending it inside _build_command for clarity.

227-232: When executor == "none", _get_container_dataset_path returns container paths that won't exist locally.

After tracing the control flow, it appears that data_dir being set always implies containerized execution (lines 173-178 require cluster when data_dir is set, and line 181-185 only drops to executor="none" when data_dir is absent). So this path is safe.

However, this invariant is implicit and fragile — a future change to the early-exit logic could silently break the cp commands. Consider adding a defensive assertion or comment.
💡 Suggested comment for future maintainers
     if data_dir:
+        # data_dir implies containerized execution (executor != "none"),
+        # so container paths are valid for cp commands below.
         command += f" && mkdir -p {data_dir}"
nemo_skills/evaluation/evaluator/__init__.py (2)

92-117: _resolve_eval_type — clean centralized dispatch.

The dual-format support (built-in keys and :: path format) is well-structured. One minor note: getattr(module, attr_str) on line 109 will raise an AttributeError with a generic message if the attribute doesn't exist. Consider wrapping it to provide a more descriptive error mentioning the eval_type string.

147-154: supports_single_eval instantiates the evaluator just to check a capability flag.

obj(config) (line 153) constructs a full evaluator instance solely to call supports_single_eval(). If the constructor has side effects or is expensive, this is wasteful. Since supports_single_eval in BaseEvaluator only checks whether eval_single is overridden (a class-level property), this could be a static/class-level check instead. Low priority given this matches the previous pattern.

docs/evaluation/custom-benchmarks.md (1)

27-27: Add language specifiers to fenced code blocks.

Lines 27, 135, 240, and 281 have fenced code blocks without a language identifier. Consider adding text or the appropriate language (e.g., bash) to satisfy linting and improve rendering.
nemo_skills/dataset/utils.py (2)
62-75: Consider caching get_extra_benchmark_map() to avoid repeated file I/O.

get_extra_benchmark_map() is called here and again inside get_dataset_module(). Each call re-reads and parses the JSON file. If these functions are called in a loop over multiple datasets, the map file will be loaded on every iteration. A simple @functools.lru_cache on get_extra_benchmark_map (keyed on the env var value) would eliminate redundant reads.

55-59: Add type hints to new functions.

Per coding guidelines, simple types should be annotated. All new public functions (get_dataset_name, get_dataset_path, get_extra_benchmark_map, get_dataset_module, get_default_dataset_module) and the internal _load_external_dataset lack parameter and return type hints.

For example:
-def get_dataset_name(dataset):
+def get_dataset_name(dataset: str) -> str:
-def get_dataset_path(dataset):
+def get_dataset_path(dataset: str) -> Path:
-def get_extra_benchmark_map():
+def get_extra_benchmark_map() -> dict[str, str]:
As per coding guidelines: "Use type hints for simple types (dict, list, int, float, existing classes) in Python code."

Also applies to: 62-75, 95-104, 107-111, 114-155

coderabbitai · 2026-02-12T23:55:25Z

docs/evaluation/custom-benchmarks.md

+@hydra.main(version_base=None, config_name="base_generation_config")
+def generate(cfg: WordCountGenerationConfig):
+    cfg = WordCountGenerationConfig(_init_nested=True, **cfg)
+    LOG.info("Config used: %s", cfg)
+    task = WordCountGenerationTask(cfg)
+    task.generate()
+
+
+if __name__ == "__main__":
+    generate()
+```


⚠️ Potential issue | 🟡 Minor

LOG is not defined in this code example.

Line 201 uses LOG.info(...) but the example doesn't import or define LOG. Add the import for completeness:

from nemo_skills.utils import get_logger_name import logging LOG = logging.getLogger(get_logger_name(__file__))

🤖 Prompt for AI Agents

In `@docs/evaluation/custom-benchmarks.md` around lines 198 - 208, The example calls LOG.info in the generate function but never defines LOG; add the logger setup by importing logging and get_logger_name (from nemo_skills.utils) and creating LOG = logging.getLogger(get_logger_name(__file__)); ensure these imports and the LOG definition appear near the top of the example so LOG is available when generate(cfg: WordCountGenerationConfig) calls LOG.info.

coderabbitai · 2026-02-12T23:55:25Z

docs/evaluation/custom-benchmarks.md

+    def update(self, predictions):
+        # base class provides convenient helpers for calculating
+        # common metrics like majority / pass
+        super().update(predictions)
+        predicted_answers = [pred["predicted_answer"] for pred in predictions]
+        self._compute_pass_at_k(
+            predictions=predictions,
+            predicted_answers=predicted_answer,
+        )
+        self._compute_majority_at_k(
+            predictions=predictions,
+            predicted_answers=predicted_answers,
+        )
+```


⚠️ Potential issue | 🟡 Minor

Bug in example code: predicted_answer should be predicted_answers.

Line 271 references predicted_answer (singular) but the variable defined on line 268 is predicted_answers (plural). Users copying this example will get a NameError.

Proposed fix

self._compute_pass_at_k( predictions=predictions, - predicted_answers=predicted_answer, + predicted_answers=predicted_answers, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def update(self, predictions):

# base class provides convenient helpers for calculating

# common metrics like majority / pass

super().update(predictions)

predicted_answers = [pred["predicted_answer"] for pred in predictions]

self._compute_pass_at_k(

predictions=predictions,

predicted_answers=predicted_answer,

)

self._compute_majority_at_k(

predictions=predictions,

predicted_answers=predicted_answers,

)

```

def update(self, predictions):

# base class provides convenient helpers for calculating

# common metrics like majority / pass

super().update(predictions)

predicted_answers = [pred["predicted_answer"] for pred in predictions]

self._compute_pass_at_k(

predictions=predictions,

predicted_answers=predicted_answers,

)

self._compute_majority_at_k(

predictions=predictions,

predicted_answers=predicted_answers,

)

🤖 Prompt for AI Agents

In `@docs/evaluation/custom-benchmarks.md` around lines 264 - 277, In the update method, fix the NameError by passing the correct variable name to _compute_pass_at_k: replace the incorrect predicted_answer with the defined predicted_answers; specifically, in the update(self, predictions) body ensure the call to _compute_pass_at_k uses predicted_answers (the list defined from predictions) so it matches the later _compute_majority_at_k call and the local variable name.

coderabbitai · 2026-02-12T23:55:25Z

nemo_skills/dataset/bfcl_v3/parallel/__init__.py

+METRICS_TYPE = "bfcl"
+GENERATION_ARGS = "++eval_type=bfcl"
+GENERATION_MODULE = "nemo_skills.inference.eval.bfcl"


⚠️ Potential issue | 🟡 Minor

Missing copyright/license header.

Every other file in this PR includes the standard NVIDIA copyright and Apache 2.0 license header. This new file is missing it. The same likely applies to the other new bfcl_v3 __init__.py files in this PR.

Proposed fix

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + METRICS_TYPE = "bfcl" GENERATION_ARGS = "++eval_type=bfcl" GENERATION_MODULE = "nemo_skills.inference.eval.bfcl"

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/bfcl_v3/parallel/__init__.py` around lines 1 - 3, The file is missing the standard NVIDIA Apache 2.0 license header; add the exact repo-standard copyright/license header comment block to the top of this __init__.py (and any other new bfcl_v3 __init__.py files) above the existing constants METRICS_TYPE, GENERATION_ARGS, and GENERATION_MODULE so the file matches other files in the PR.

coderabbitai · 2026-02-12T23:55:25Z

nemo_skills/dataset/prepare.py

+        "--retries",
+        type=int,
+        default=0,
+        help="Number of retries per dataset if preparation fails",
+    )
+    return parser.parse_known_args(args)


 def prepare_datasets(
    datasets=None,
-    dataset_groups=None,
-    add_lean4_header=False,
+    prepare_entrypoint="prepare.py",
    extra_args="",
    parallelism=20,
    retries=3,


⚠️ Potential issue | 🟡 Minor

Inconsistent default for retries: 0 in CLI vs 3 in function signature.

The CLI parser (line 38) defaults --retries to 0, but prepare_datasets (line 49) defaults retries to 3. When invoked from the command line, the CLI default wins (0 retries). When called as a library function without specifying retries, the function default wins (3 retries). This silent inconsistency is likely to confuse callers.

Proposed fix — align the defaults

def prepare_datasets( datasets=None, prepare_entrypoint="prepare.py", extra_args="", parallelism=20, - retries=3, + retries=0, ):

Or change the CLI default to 3 if retries are desired by default.

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/prepare.py` around lines 36 - 49, The CLI flag --retries currently defaults to 0 while the prepare_datasets function signature defaults to 3, causing inconsistent behavior; pick one canonical default and make both match (either change the parser add_argument call for "--retries" to default=3 or change the prepare_datasets signature to retries=0) so callers get the same retry behavior whether invoked from the CLI or as a library function; update the parser add_argument("--retries", ...) and/or the prepare_datasets(...) retries parameter accordingly.

coderabbitai · 2026-02-12T23:55:25Z

nemo_skills/dataset/ruler/prepare_common.py

+MISSING_RULER_ARGS_MESSAGE = (
+    "ERROR: Can't prepare ruler without arguments provided! "
+    "Skipping the preparation step.\n"
+    "Example ruler prepare command:\n"
+    "ns prepare_data ruler --setup llama_128k "
+    "--tokenizer_path meta-llama/Llama-3.1-8B-Instruct --max_seq_length 131072"
+)


⚠️ Potential issue | 🟡 Minor

Message says "ERROR" but SystemExit(0) signals success.

The MISSING_RULER_ARGS_MESSAGE starts with "ERROR:" but parse_args_and_prepare_args exits with code 0. The calling prepare.py uses check=True on subprocess, so this "error" will be silently treated as success and the next script will still run. Either drop the "ERROR" prefix (if skipping is intentional) or use SystemExit(1) (if it's a real failure).

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/ruler/prepare_common.py` around lines 33 - 39, The message MISSING_RULER_ARGS_MESSAGE currently begins with "ERROR:" but parse_args_and_prepare_args exits with SystemExit(0), producing a success status while signaling an error; fix by making them consistent: either remove the "ERROR:" prefix from MISSING_RULER_ARGS_MESSAGE if skipping is intentional, or change the exit in parse_args_and_prepare_args to SystemExit(1) to indicate failure; locate and update the symbols MISSING_RULER_ARGS_MESSAGE and the exit call in parse_args_and_prepare_args (also consider how prepare.py's subprocess `check=True` will interpret the chosen exit code) so the log text and exit code match the intended behavior.

coderabbitai · 2026-02-12T23:55:25Z

nemo_skills/dataset/ruler2/prepare_data.py

+def prepare_mk_niah_basic(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
+    subprocess.run(
+        [
+            "python",
+            "-m",
+            "nemo_skills.dataset.ruler2.prepare_niah",
+            "--output_folder",
+            output_folder,
+            "--tokenizer_type",
+            tokenizer_type,
+            "--tokenizer_path",
+            tokenizer_path,
+            "--max_seq_length",
+            str(length),
+            "--num_samples",
+            str(dataset_size),
+            "--random_seed",
+            "42",
+            "--num_needle_k",
+            "1",
+            "--num_needle_v",
+            "1",
+            "--num_needle_q",
+            "1",
+            "--type_haystack",
+            "needle",
+            "--type_needle_k",
+            "words",
+            "--type_needle_v",
+            "numbers",
+            "--num_digits_v",
+            "10",
+        ],
+        check=True,
+    )
+
+
+def prepare_mk_niah_easy(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
+    subprocess.run(
+        [
+            "python",
+            "-m",
+            "nemo_skills.dataset.ruler2.prepare_mmlu",
+            "--output_folder",
+            output_folder,
+            "--tokenizer_type",
+            tokenizer_type,
+            "--tokenizer_path",
+            tokenizer_path,
+            "--max_seq_length",
+            str(length),
+            "--num_samples",
+            str(dataset_size),
+            "--random_seed",
+            "42",
+            "--dataset",
+            "mmlu",
+            "--fewshot",
+            "0",
+            "--prompt_type",
+            "instruct",
+            "--num_order",
+            "0",
+            "--task_type",
+            "retrieve",
+            "--algo_type",
+            "single",
+        ],
+        check=True,
+    )
+
+
+def prepare_mk_niah_medium(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
+    subprocess.run(
+        [
+            "python",
+            "-m",
+            "nemo_skills.dataset.ruler2.prepare_mmlu",
+            "--output_folder",
+            output_folder,
+            "--tokenizer_type",
+            tokenizer_type,
+            "--tokenizer_path",
+            tokenizer_path,
+            "--max_seq_length",
+            str(length),
+            "--num_samples",
+            str(dataset_size),
+            "--random_seed",
+            "42",
+            "--dataset",
+            "mmlu",
+            "--fewshot",
+            "5",
+            "--prompt_type",
+            "instruct",
+            "--num_order",
+            "0",
+            "--task_type",
+            "solve",
+            "--algo_type",
+            "2steps",
+        ],
+        check=True,
+    )
+
+
+def prepare_mk_niah_hard(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
+    subprocess.run(
+        [
+            "python",
+            "-m",
+            "nemo_skills.dataset.ruler2.prepare_mmlu",
+            "--output_folder",
+            output_folder,
+            "--tokenizer_type",
+            tokenizer_type,
+            "--tokenizer_path",
+            tokenizer_path,
+            "--max_seq_length",
+            str(length),
+            "--num_samples",
+            str(dataset_size),
+            "--random_seed",
+            "42",
+            "--dataset",
+            "mmlu",
+            "--fewshot",
+            "5",
+            "--prompt_type",
+            "instruct",
+            "--num_order",
+            "0",
+            "--task_type",
+            "solve",
+            "--algo_type",
+            "single",
+        ],
+        check=True,
+    )
+
+
+def prepare_mv_niah_basic(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
+    subprocess.run(
+        [
+            "python",
+            "-m",
+            "nemo_skills.dataset.ruler2.prepare_niah",
+            "--output_folder",
+            output_folder,
+            "--tokenizer_type",
+            tokenizer_type,
+            "--tokenizer_path",
+            tokenizer_path,
+            "--max_seq_length",
+            str(length),
+            "--num_samples",
+            str(dataset_size),
+            "--random_seed",
+            "42",
+            "--num_needle_k",
+            "1",
+            "--num_needle_v",
+            "4",
+            "--num_needle_q",
+            "1",
+            "--type_haystack",
+            "needle",
+            "--type_needle_k",
+            "words",
+            "--type_needle_v",
+            "numbers",
+            "--num_digits_v",
+            "10",
+        ],
+        check=True,
+    )
+
+
+def prepare_mv_niah_easy(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
+    subprocess.run(
+        [
+            "python",
+            "-m",
+            "nemo_skills.dataset.ruler2.prepare_mmlu",
+            "--output_folder",
+            output_folder,
+            "--tokenizer_type",
+            tokenizer_type,
+            "--tokenizer_path",
+            tokenizer_path,
+            "--max_seq_length",
+            str(length),
+            "--num_samples",
+            str(dataset_size),
+            "--random_seed",
+            "42",
+            "--dataset",
+            "mmlu",
+            "--fewshot",
+            "0",
+            "--prompt_type",
+            "instruct",
+            "--num_order",
+            "4",
+            "--task_type",
+            "niah",
+            "--algo_type",
+            "single",
+        ],
+        check=True,
+    )
+
+
+def prepare_mv_niah_medium(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
+    subprocess.run(
+        [
+            "python",
+            "-m",
+            "nemo_skills.dataset.ruler2.prepare_mmlu",
+            "--output_folder",
+            output_folder,
+            "--tokenizer_type",
+            tokenizer_type,
+            "--tokenizer_path",
+            tokenizer_path,
+            "--max_seq_length",
+            str(length),
+            "--num_samples",
+            str(dataset_size),
+            "--random_seed",
+            "42",
+            "--dataset",
+            "mmlu",
+            "--fewshot",
+            "0",
+            "--prompt_type",
+            "instruct",
+            "--num_order",
+            "4",
+            "--task_type",
+            "retrieve",
+            "--algo_type",
+            "2steps",
+        ],
+        check=True,
+    )
+
+
+def prepare_mv_niah_hard(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
+    subprocess.run(
+        [
+            "python",
+            "-m",
+            "nemo_skills.dataset.ruler2.prepare_mmlu",
+            "--output_folder",
+            output_folder,
+            "--tokenizer_type",
+            tokenizer_type,
+            "--tokenizer_path",
+            tokenizer_path,
+            "--max_seq_length",
+            str(length),
+            "--num_samples",
+            str(dataset_size),
+            "--random_seed",
+            "42",
+            "--dataset",
+            "mmlu",
+            "--fewshot",
+            "0",
+            "--prompt_type",
+            "instruct",
+            "--num_order",
+            "4",
+            "--task_type",
+            "retrieve",
+            "--algo_type",
+            "single",
+        ],
+        check=True,
+    )
+
+
+def prepare_qa_basic(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
+    subprocess.run(
+        [
+            "python",
+            "-m",
+            "nemo_skills.dataset.ruler2.prepare_qa",
+            "--output_folder",
+            output_folder,
+            "--tokenizer_type",
+            tokenizer_type,
+            "--tokenizer_path",
+            tokenizer_path,
+            "--max_seq_length",
+            str(length),
+            "--num_samples",
+            str(dataset_size),
+            "--random_seed",
+            "42",
+            "--dataset",
+            "hotpotqa",
+            "--fewshot",
+            "0",
+            "--prompt_type",
+            "instruct",
+            "--task_type",
+            "retrieve",
+            "--query_type",
+            "doc",
+        ],
+        check=True,
+    )
+
+
+def prepare_qa_easy(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
+    subprocess.run(
+        [
+            "python",
+            "-m",
+            "nemo_skills.dataset.ruler2.prepare_qa",
+            "--output_folder",
+            output_folder,
+            "--tokenizer_type",
+            tokenizer_type,
+            "--tokenizer_path",
+            tokenizer_path,
+            "--max_seq_length",
+            str(length),
+            "--num_samples",
+            str(dataset_size),
+            "--random_seed",
+            "42",
+            "--dataset",
+            "hotpotqa",
+            "--fewshot",
+            "0",
+            "--prompt_type",
+            "instruct",
+            "--task_type",
+            "retrieve",
+            "--query_type",
+            "question",
+        ],
+        check=True,
+    )
+
+
+def prepare_qa_medium(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
+    subprocess.run(
+        [
+            "python",
+            "-m",
+            "nemo_skills.dataset.ruler2.prepare_qa",
+            "--output_folder",
+            output_folder,
+            "--tokenizer_type",
+            tokenizer_type,
+            "--tokenizer_path",
+            tokenizer_path,
+            "--max_seq_length",
+            str(length),
+            "--num_samples",
+            str(dataset_size),
+            "--random_seed",
+            "42",
+            "--dataset",
+            "hotpotqa",
+            "--fewshot",
+            "0",
+            "--prompt_type",
+            "instruct",
+            "--task_type",
+            "solve",
+            "--algo_type",
+            "2steps",
+        ],
+        check=True,
+    )
+
+
+def prepare_qa_hard(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
+    subprocess.run(
+        [
+            "python",
+            "-m",
+            "nemo_skills.dataset.ruler2.prepare_qa",
+            "--output_folder",
+            output_folder,
+            "--tokenizer_type",
+            tokenizer_type,
+            "--tokenizer_path",
+            tokenizer_path,
+            "--max_seq_length",
+            str(length),
+            "--num_samples",
+            str(dataset_size),
+            "--random_seed",
+            "42",
+            "--dataset",
+            "hotpotqa",
+            "--fewshot",
+            "0",
+            "--prompt_type",
+            "instruct",
+            "--task_type",
+            "solve",
+            "--algo_type",
+            "single",
+        ],
+        check=True,
+    )


🛠️ Refactor suggestion | 🟠 Major

Massive DRY violation — 12 near-identical functions should be collapsed into a data-driven dispatcher.

Every prepare_* function follows the same pattern: build a list of subprocess args from a fixed module name plus task-specific parameters, then call subprocess.run. The only differences are the target module (prepare_niah, prepare_mmlu, prepare_qa) and a handful of flag values. This results in ~400 lines that could be ~40.

♻️ Suggested refactor — declarative task config + single runner

-def prepare_mk_niah_basic(output_folder, tokenizer_type, tokenizer_path, length, dataset_size): - subprocess.run( - [ - "python", - "-m", - "nemo_skills.dataset.ruler2.prepare_niah", - "--output_folder", - output_folder, - ... - ], - check=True, - ) - -# ... (11 more nearly identical functions) +TASK_CONFIGS: dict[str, dict] = { + "mk_niah_basic": { + "module": "nemo_skills.dataset.ruler2.prepare_niah", + "extra_args": [ + "--num_needle_k", "1", "--num_needle_v", "1", "--num_needle_q", "1", + "--type_haystack", "needle", "--type_needle_k", "words", + "--type_needle_v", "numbers", "--num_digits_v", "10", + ], + }, + "mk_niah_easy": { + "module": "nemo_skills.dataset.ruler2.prepare_mmlu", + "extra_args": [ + "--dataset", "mmlu", "--fewshot", "0", "--prompt_type", "instruct", + "--num_order", "0", "--task_type", "retrieve", "--algo_type", "single", + ], + }, + # ... one entry per task ... +} + + +def _run_prepare( + task: str, + output_folder: str, + tokenizer_type: str, + tokenizer_path: str, + length: int, + dataset_size: int, +) -> None: + cfg = TASK_CONFIGS[task] + subprocess.run( + [ + "python", "-m", cfg["module"], + "--output_folder", output_folder, + "--tokenizer_type", tokenizer_type, + "--tokenizer_path", tokenizer_path, + "--max_seq_length", str(length), + "--num_samples", str(dataset_size), + "--random_seed", "42", + *cfg["extra_args"], + ], + check=True, + )

This eliminates ~350 lines of boilerplate and makes adding a new task a one-liner dict entry instead of a whole new function.

🧰 Tools

🪛 Ruff (0.15.0)

[error] 26-26: subprocess call: check for execution of untrusted input

(S603)

[error] 27-57: Starting a process with a partial executable path

(S607)

[error] 63-63: subprocess call: check for execution of untrusted input

(S603)

[error] 64-92: Starting a process with a partial executable path

(S607)

[error] 98-98: subprocess call: check for execution of untrusted input

(S603)

[error] 99-127: Starting a process with a partial executable path

(S607)

[error] 133-133: subprocess call: check for execution of untrusted input

(S603)

[error] 134-162: Starting a process with a partial executable path

(S607)

[error] 168-168: subprocess call: check for execution of untrusted input

(S603)

[error] 169-199: Starting a process with a partial executable path

(S607)

[error] 205-205: subprocess call: check for execution of untrusted input

(S603)

[error] 206-234: Starting a process with a partial executable path

(S607)

[error] 240-240: subprocess call: check for execution of untrusted input

(S603)

[error] 241-269: Starting a process with a partial executable path

(S607)

[error] 275-275: subprocess call: check for execution of untrusted input

(S603)

[error] 276-304: Starting a process with a partial executable path

(S607)

[error] 310-310: subprocess call: check for execution of untrusted input

(S603)

[error] 311-337: Starting a process with a partial executable path

(S607)

[error] 343-343: subprocess call: check for execution of untrusted input

(S603)

[error] 344-370: Starting a process with a partial executable path

(S607)

[error] 376-376: subprocess call: check for execution of untrusted input

(S603)

[error] 377-403: Starting a process with a partial executable path

(S607)

[error] 409-409: subprocess call: check for execution of untrusted input

(S603)

[error] 410-436: Starting a process with a partial executable path

(S607)

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/ruler2/prepare_data.py` around lines 25 - 438, Many near-duplicate functions (e.g., prepare_mk_niah_basic, prepare_mk_niah_easy, prepare_mv_niah_basic, prepare_qa_hard, etc.) repeat subprocess.run logic with only module name (prepare_niah, prepare_mmlu, prepare_qa) and a few flag values changing; replace them with one generic runner (e.g., run_prepare_task) that accepts a task key and common params (output_folder, tokenizer_type, tokenizer_path, length, dataset_size) and use a declarative dict mapping task keys to module name plus per-task flag/value dicts; the runner should build the argv list by starting with ["python","-m", module] then iterating the flag dict to append "--flag", str(value), call subprocess.run(..., check=True), and replace all prepare_* callers with a single call to run_prepare_task(task_key, ...).

coderabbitai · 2026-02-12T23:55:25Z

nemo_skills/dataset/ruler2/prepare_data.py

+if __name__ == "__main__":
+    parser = build_prepare_parser(description="Prepare RULER2 dataset data.")
+    args, unknown = parse_known_args(parser)
+    _ = unknown
+    prepare_dataset(
+        args.tasks,
+        args.setup,
+        args.max_seq_length,
+        args.tokenizer_type,
+        args.tokenizer_path,
+        args.dataset_size,
+    )
+    print("RULER2 data preparation completed.")


⚠️ Potential issue | 🟠 Major

Unknown CLI arguments are silently discarded.

Line 482 assigns unknown args to _, meaning any typo or unsupported flag passed by the user is silently ignored. This violates the coding guideline about not silently ignoring unused user-passed parameters. Use parser.parse_args() instead to reject unrecognized arguments, or at minimum raise an error if unknown is non-empty.

Proposed fix

- args, unknown = parse_known_args(parser) - _ = unknown + args = parser.parse_args()

If parse_known_args is needed for other callers, add a guard here:

args, unknown = parse_known_args(parser) - _ = unknown + if unknown: + parser.error(f"Unrecognized arguments: {unknown}")

As per coding guidelines: "Avoid silently ignoring unused user-passed parameters; the code should fail if a required argument is not specified or if unsupported arguments are provided."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if __name__ == "__main__":

parser = build_prepare_parser(description="Prepare RULER2 dataset data.")

args, unknown = parse_known_args(parser)

_ = unknown

prepare_dataset(

args.tasks,

args.setup,

args.max_seq_length,

args.tokenizer_type,

args.tokenizer_path,

args.dataset_size,

)

print("RULER2 data preparation completed.")

if __name__ == "__main__":

parser = build_prepare_parser(description="Prepare RULER2 dataset data.")

args = parser.parse_args()

prepare_dataset(

args.tasks,

args.setup,

args.max_seq_length,

args.tokenizer_type,

args.tokenizer_path,

args.dataset_size,

)

print("RULER2 data preparation completed.")

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/ruler2/prepare_data.py` around lines 479 - 491, The current main block uses parse_known_args and assigns unknown to `_`, silently discarding unrecognized CLI flags; change this to either call parser.parse_args() to let argparse reject unknown args, or keep parse_known_args but immediately check the returned unknown list and raise a clear error (or call parser.error()) if it's non-empty; update the block around build_prepare_parser, parse_known_args, and the call site before prepare_dataset to perform this validation so typos/unsupported flags are not ignored.

coderabbitai · 2026-02-12T23:55:25Z

nemo_skills/evaluation/evaluator/__init__.py

+    if is_class:
+        evaluator = obj(eval_config)
        print(f"evaluator: {evaluator}")
        return asyncio.run(evaluator.eval_full())


⚠️ Potential issue | 🟡 Minor

Remove debug print statement.

Line 176 (print(f"evaluator: {evaluator}")) appears to be a leftover debug statement. It will produce noisy output on every class-based evaluation run. Remove it or replace with LOG.debug(...) if the information is useful.

Proposed fix

if is_class: evaluator = obj(eval_config) - print(f"evaluator: {evaluator}") return asyncio.run(evaluator.eval_full())

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if is_class:

evaluator = obj(eval_config)

print(f"evaluator: {evaluator}")

return asyncio.run(evaluator.eval_full())

if is_class:

evaluator = obj(eval_config)

return asyncio.run(evaluator.eval_full())

🤖 Prompt for AI Agents

In `@nemo_skills/evaluation/evaluator/__init__.py` around lines 174 - 177, The debug print in the class-instantiation branch should be removed or converted to a proper logger call; replace the line `print(f"evaluator: {evaluator}")` with either nothing or `LOG.debug("evaluator: %s", evaluator)` (or equivalent) inside the block where `is_class` is true after `evaluator = obj(eval_config)` so you don't emit noisy stdout during `evaluator.eval_full()` runs.

coderabbitai · 2026-02-12T23:55:26Z

nemo_skills/pipeline/eval.py

+                    f"    --metric_type={metric_type or benchmark_args.metrics_type} "
                )


⚠️ Potential issue | 🟡 Minor

--metric_type=None will be passed as a literal string if both values are None.

If metric_type is None and benchmark_args.metrics_type is also None (e.g., a benchmark __init__.py that doesn't define METRICS_TYPE), the f-string produces --metric_type=None — the string "None" — which will likely cause a downstream error in summarize_results.

Consider guarding this:

Proposed fix

- f" --metric_type={metric_type or benchmark_args.metrics_type} " + f" --metric_type={metric_type or benchmark_args.metrics_type or ''} "

Or better, conditionally append the flag only when a value exists:

effective_metric_type = metric_type or benchmark_args.metrics_type ... if effective_metric_type: command += f" --metric_type={effective_metric_type} "

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

f" --metric_type={metric_type or benchmark_args.metrics_type} "

)

f" --metric_type={metric_type or benchmark_args.metrics_type or ''} "

)

🤖 Prompt for AI Agents

In `@nemo_skills/pipeline/eval.py` around lines 745 - 746, The current f-string will inject the literal "None" when both metric_type and benchmark_args.metrics_type are None; compute an effective value (e.g., effective_metric_type = metric_type or benchmark_args.metrics_type) and only append the "--metric_type=..." flag to the command when effective_metric_type is truthy, so summarize_results never receives the string "None" as a metric type.

coderabbitai · 2026-02-12T23:55:26Z

nemo_skills/pipeline/utils/packager.py

+            dataset_dir = nemo_skills_dir / "dataset"
+            for f in dataset_dir.rglob("*.jsonl"):
+                include_patterns.append(str(f))
                include_pattern_relative_paths.append(str(nemo_skills_dir.parent))


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Count how many .jsonl files exist under dataset/ find . -name "*.jsonl" -type f | head -20 echo "---" find . -name "*.jsonl" -type f | wc -l

Repository: NVIDIA-NeMo/Skills

Length of output: 314

🏁 Script executed:

# Examine the packager.py file to understand the context wc -l nemo_skills/pipeline/utils/packager.py head -50 nemo_skills/pipeline/utils/packager.py

Repository: NVIDIA-NeMo/Skills

Length of output: 1564

🏁 Script executed:

# Look at the specific lines mentioned (191-194) and surrounding context sed -n '170,220p' nemo_skills/pipeline/utils/packager.py

Repository: NVIDIA-NeMo/Skills

Length of output: 2295

🏁 Script executed:

# Search for how include_patterns and include_pattern_relative_paths are used rg "include_patterns|include_pattern_relative_paths" nemo_skills/pipeline/utils/packager.py -B 2 -A 2

Repository: NVIDIA-NeMo/Skills

Length of output: 1487

Extract the relative path append from the loop to avoid duplicate entries.

The current code appends str(nemo_skills_dir.parent) once per JSONL file, resulting in duplicate identical entries in include_pattern_relative_paths. The relative path should be appended once outside the loop since it's the same for all files. Consider moving include_pattern_relative_paths.append(str(nemo_skills_dir.parent)) outside the for loop.

🤖 Prompt for AI Agents

In `@nemo_skills/pipeline/utils/packager.py` around lines 191 - 194, The loop that builds include_patterns iterates over dataset_dir.rglob("*.jsonl") and currently calls include_pattern_relative_paths.append(str(nemo_skills_dir.parent)) inside that loop, creating duplicate entries; move the append call outside the for f in dataset_dir.rglob("*.jsonl") loop (or append once conditionally after detecting at least one JSONL) so include_pattern_relative_paths only adds str(nemo_skills_dir.parent) a single time when dataset files exist; update the block around dataset_dir, include_patterns and include_pattern_relative_paths in packager.py accordingly.

Signed-off-by: Igor Gitman <[email protected]>

Kipok added 23 commits February 10, 2026 16:31

Explicitly add bfcl init files as they are static

f5ebde1

Signed-off-by: Igor Gitman <[email protected]>

Refactor ruler into prepare_data + prepare_init

c28972e

Signed-off-by: Igor Gitman <[email protected]>

Basic split into init and data prepare in pipeline

f49dc2a

Signed-off-by: Igor Gitman <[email protected]>

Partial refactoring - remove module download logic

e3688de

Signed-off-by: Igor Gitman <[email protected]>

Always specify metrics type in eval

edab4db

Signed-off-by: Igor Gitman <[email protected]>

Basic support for specifying external path

27f24fb

Signed-off-by: Igor Gitman <[email protected]>

Add basic support for extra dataset map

deb17ad

Signed-off-by: Igor Gitman <[email protected]>

Add dynamic data prep

b22930b

Signed-off-by: Igor Gitman <[email protected]>

Remove data group from inits

dfb40d2

Signed-off-by: Igor Gitman <[email protected]>

Clean up data groups and properly track external datasets

ec5353f

Signed-off-by: Igor Gitman <[email protected]>

Remove logic to add lean headers

2b853ee

Signed-off-by: Igor Gitman <[email protected]>

Add packaging for jsonl files in external repos

506d909

Signed-off-by: Igor Gitman <[email protected]>

Add explicit jsonl files for packaging

52cedd9

Signed-off-by: Igor Gitman <[email protected]>

Fix packaging for external repos

98a7623

Signed-off-by: Igor Gitman <[email protected]>

Tmp remove scicode

bb25cfe

Signed-off-by: Igor Gitman <[email protected]>

Add a way to pass evaluator as a string

147a459

Signed-off-by: Igor Gitman <[email protected]>

Update to rglob for main datasets

cc2d744

Signed-off-by: Igor Gitman <[email protected]>

Add relative path resolution

d8d5444

Signed-off-by: Igor Gitman <[email protected]>

Fix data_dir usage

f5d79f8

Signed-off-by: Igor Gitman <[email protected]>

Refactor prepare data logic to fix data dir issue

643438f

Signed-off-by: Igor Gitman <[email protected]>

Unconditional trigger for init

3b33b36

Signed-off-by: Igor Gitman <[email protected]>

Fix resolve

fbef304

Signed-off-by: Igor Gitman <[email protected]>

Add docs

72a2c82

Signed-off-by: Igor Gitman <[email protected]>

Revert remove scicode

8fac5a8

Signed-off-by: Igor Gitman <[email protected]>

coderabbitai bot reviewed Feb 12, 2026

View reviewed changes

Kipok added 3 commits February 12, 2026 16:06

Small doc update

bd0bada

Signed-off-by: Igor Gitman <[email protected]>

Add tests

6fa989d

Signed-off-by: Igor Gitman <[email protected]>

Add license

0c946e2

Signed-off-by: Igor Gitman <[email protected]>

Kipok added 4 commits February 12, 2026 16:47

Simplify test

00b98c9

Signed-off-by: Igor Gitman <[email protected]>

Update api to add prompt format

e9280d2

Signed-off-by: Igor Gitman <[email protected]>

Fix test

04f5452

Signed-off-by: Igor Gitman <[email protected]>

Working through test issues

ba8b4d0

Signed-off-by: Igor Gitman <[email protected]>

		f" --metric_type={metric_type or benchmark_args.metrics_type} "
		)

Adding an option to store benchmarks in external repo #1240

Are you sure you want to change the base?

Adding an option to store benchmarks in external repo #1240

Uh oh!

Conversation

Kipok commented Feb 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

greptile-apps bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Kipok commented Feb 12, 2026 •

edited by coderabbitai bot

Loading

greptile-apps bot commented Feb 12, 2026 •

edited

Loading

coderabbitai bot commented Feb 12, 2026 •

edited

Loading