Skip to content

Conversation

@Kipok
Copy link
Collaborator

@Kipok Kipok commented Feb 12, 2026

Will update in a bit

Will also add tests in a bit, but core logic should be good

Summary by CodeRabbit

  • New Features

    • Support for external/custom benchmarks (prepare, package, evaluate) and improved dataset path resolution.
    • Dynamic evaluator/metrics dispatch to load external implementations.
    • Optional prompt_format passthrough for generation/eval tasks.
  • Improvements

    • Simplified dataset preparation flow and packaging with clearer data-dir handling and container path resolution.
    • Reduced CLI surface for eval/summarize; streamlined metric initialization.
  • Documentation

    • Added a guide for creating and integrating external benchmarks.
  • Tests

    • New integration tests covering external benchmark prepare+eval.

Kipok added 23 commits February 10, 2026 16:31
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 12, 2026

Too many files changed for review. (173 files found, 100 file limit)

Signed-off-by: Igor Gitman <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 12, 2026

📝 Walkthrough

Walkthrough

Refactors dataset configuration (removes DATASET_GROUP widely; adds REQUIRES_DATA_DIR / HAS_DYNAMIC_INIT), modularizes RULER/RULER2 prepare flow, adds external benchmark support and tests, enhances evaluator dispatch and dataset resolution utilities, updates pipelines/packaging, and documents custom benchmarks.

Changes

Cohort / File(s) Summary
Dataset config removals / flags
nemo_skills/dataset/.../__init__.py (many files), nemo_skills/dataset/ruler2/__init__.py, nemo_skills/dataset/ruler/__init__.py
Removed DATASET_GROUP across ~70+ datasets; several modules now expose REQUIRES_DATA_DIR = True and/or HAS_DYNAMIC_INIT = True instead.
BFCL constants added
nemo_skills/dataset/bfcl_v3/.../__init__.py, nemo_skills/dataset/bfcl_v4/.../__init__.py, nemo_skills/dataset/bfcl_v3/prepare.py
Added METRICS_TYPE, GENERATION_ARGS, GENERATION_MODULE to many bfcl v3/v4 subpackages; removed DEFAULT_SETTINGS and stopped auto-writing init in bfcl_v3 prepare.
RULER modularization
nemo_skills/dataset/ruler/prepare.py, .../prepare_common.py, .../prepare_data.py, .../prepare_init.py
Split monolithic RULER prepare into smaller modules: common arg parsing, data preparation, and init generation; top-level prepare delegates to new scripts.
RULER2 modularization
nemo_skills/dataset/ruler2/prepare.py, .../prepare_common.py, .../prepare_data.py, .../prepare_init.py
Similar refactor for ruler2: replaced large prepare with delegated prepare_init.py and prepare_data.py plus helpers.
Dataset utils & external benchmarks
nemo_skills/dataset/utils.py
Introduced get_dataset_path, extra benchmark map loading, _load_external_dataset, and new dataset resolution logic supporting path/map/built-in lookups; removed cluster-based helpers.
Evaluation dispatch
nemo_skills/evaluation/evaluator/__init__.py
Added _resolve_eval_type supporting module/path imports and function/class dispatch; improved error messages and register_evaluator option.
Metrics & pipeline simplification
nemo_skills/evaluation/metrics/compute_metrics.py, nemo_skills/pipeline/summarize_results.py, nemo_skills/pipeline/eval.py
Removed data_dir/extra_datasets plumbing from ComputeMetrics and pipeline commands; summarize_results and eval no longer accept extra_datasets options.
Prepare flow & CLI
nemo_skills/dataset/prepare.py, nemo_skills/pipeline/prepare_data.py
Added parse_prepare_cli_arguments, changed prepare_datasets signature, and refactored prepare_data to handle external datasets, dynamic init, data_dir/container path resolution, and split vs non-split flows.
Generation / prompt_format plumbing
nemo_skills/inference/generate.py and many inference/eval modules (autoformalize.py, check_contamination.py, prover.py, inference/eval/*, recipes/*)
Added optional prompt_format parameter to fill_prompt/process_single_datapoint signatures and propagated through generation/eval call sites.
Packaging / container path helpers
nemo_skills/pipeline/utils/packager.py, nemo_skills/pipeline/utils/eval.py, nemo_skills/pipeline/utils/__init__.py
Added resolve_external_data_path, updated register_external_repo to support ignore-on-register, added local_data_path and metrics_type handling in benchmark args, and re-exported resolve_external_data_path.
External benchmark test fixtures & implementations
tests/data/dummy_external_benchmark/**, tests/gpu-tests/test_external_benchmark_eval.py, tests/test_external_benchmarks.py
Added a dummy external-benchmark repo (prepare, dataset init.py, evaluator, metrics, prompts, benchmark_map.json) and comprehensive tests including a GPU/container integration test for external benchmark prepare+eval.
Docs, mkdocs, .gitignore
docs/evaluation/custom-benchmarks.md, docs/evaluation/index.md, mkdocs.yml, .gitignore
Added new custom-benchmarks doc and nav entry; updated .gitignore to keep ruler2 path and remove bfcl_v3/v4 ignore patterns.
Tests & cleanup
tests/test_datasets.py, tests/test_configs.py
Removed dataset DATASET_GROUP validation test; added small test config change (METRICS_TYPE in mock) and many new external-benchmark tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Add RULERv2 #1106 — touches RULER/ruler2 prepare scripts and dataset init logic (closely related to RULER modularization here).
  • BFCLv4 support #908 — BFCL-related changes and prepare logic overlaps with bfcl_v3/v4 edits in this PR.
  • Add compute eval #1158 — evaluator registration/dispatch changes; overlaps with dynamic evaluator resolution added here.

Suggested labels

enhancement, run GPU tests

Suggested reviewers

  • gwarmstrong
  • activatedgeek
  • ekmb
🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.61% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (128 files):

⚔️ .github/workflows/tests.yml (content)
⚔️ .gitignore (content)
⚔️ dockerfiles/Dockerfile.sandbox (content)
⚔️ docs/evaluation/index.md (content)
⚔️ docs/evaluation/scientific-knowledge.md (content)
⚔️ docs/index.md (content)
⚔️ docs/releases/openreasoning/training.md (content)
⚔️ mkdocs.yml (content)
⚔️ nemo_skills/dataset/aai/__init__.py (content)
⚔️ nemo_skills/dataset/aalcr/__init__.py (content)
⚔️ nemo_skills/dataset/aime24/__init__.py (content)
⚔️ nemo_skills/dataset/aime25/__init__.py (content)
⚔️ nemo_skills/dataset/algebra222/__init__.py (content)
⚔️ nemo_skills/dataset/amc23/__init__.py (content)
⚔️ nemo_skills/dataset/answer-judge/__init__.py (content)
⚔️ nemo_skills/dataset/apex-shortlist/__init__.py (content)
⚔️ nemo_skills/dataset/arena-hard-v2/__init__.py (content)
⚔️ nemo_skills/dataset/arena-hard/__init__.py (content)
⚔️ nemo_skills/dataset/asdiv/__init__.py (content)
⚔️ nemo_skills/dataset/asr-leaderboard/__init__.py (content)
⚔️ nemo_skills/dataset/audiobench/__init__.py (content)
⚔️ nemo_skills/dataset/audiobench/judge/__init__.py (content)
⚔️ nemo_skills/dataset/audiobench/nonjudge/__init__.py (content)
⚔️ nemo_skills/dataset/beyond-aime/__init__.py (content)
⚔️ nemo_skills/dataset/bfcl_v3/__init__.py (content)
⚔️ nemo_skills/dataset/bfcl_v3/prepare.py (content)
⚔️ nemo_skills/dataset/bfcl_v4/__init__.py (content)
⚔️ nemo_skills/dataset/bigcodebench/__init__.py (content)
⚔️ nemo_skills/dataset/birdbench/__init__.py (content)
⚔️ nemo_skills/dataset/brumo25/__init__.py (content)
⚔️ nemo_skills/dataset/challenge19/__init__.py (content)
⚔️ nemo_skills/dataset/college_math/__init__.py (content)
⚔️ nemo_skills/dataset/comp-math-24-25/__init__.py (content)
⚔️ nemo_skills/dataset/compute-eval/__init__.py (content)
⚔️ nemo_skills/dataset/flores200/__init__.py (content)
⚔️ nemo_skills/dataset/frontierscience-olympiad/__init__.py (content)
⚔️ nemo_skills/dataset/gaokao2023en/__init__.py (content)
⚔️ nemo_skills/dataset/gpqa/__init__.py (content)
⚔️ nemo_skills/dataset/gsm-plus/__init__.py (content)
⚔️ nemo_skills/dataset/gsm8k/__init__.py (content)
⚔️ nemo_skills/dataset/hendrycks_math/__init__.py (content)
⚔️ nemo_skills/dataset/hle/__init__.py (content)
⚔️ nemo_skills/dataset/hmmt_feb25/__init__.py (content)
⚔️ nemo_skills/dataset/hmmt_nov25/__init__.py (content)
⚔️ nemo_skills/dataset/human-eval-infilling/__init__.py (content)
⚔️ nemo_skills/dataset/human-eval/__init__.py (content)
⚔️ nemo_skills/dataset/icpc/__init__.py (content)
⚔️ nemo_skills/dataset/ifbench/__init__.py (content)
⚔️ nemo_skills/dataset/ifeval/__init__.py (content)
⚔️ nemo_skills/dataset/imo-answerbench/__init__.py (content)
⚔️ nemo_skills/dataset/imo-gradingbench/__init__.py (content)
⚔️ nemo_skills/dataset/imo-proofbench/__init__.py (content)
⚔️ nemo_skills/dataset/ioi/__init__.py (content)
⚔️ nemo_skills/dataset/librispeech-pc/__init__.py (content)
⚔️ nemo_skills/dataset/livebench-coding/__init__.py (content)
⚔️ nemo_skills/dataset/livecodebench-cpp/__init__.py (content)
⚔️ nemo_skills/dataset/livecodebench-cpp/prepare.py (content)
⚔️ nemo_skills/dataset/livecodebench-pro/__init__.py (content)
⚔️ nemo_skills/dataset/livecodebench-pro/prepare.py (content)
⚔️ nemo_skills/dataset/livecodebench/__init__.py (content)
⚔️ nemo_skills/dataset/livecodebench/prepare.py (content)
⚔️ nemo_skills/dataset/math-500/__init__.py (content)
⚔️ nemo_skills/dataset/math-odyssey/__init__.py (content)
⚔️ nemo_skills/dataset/mawps/__init__.py (content)
⚔️ nemo_skills/dataset/mbpp/__init__.py (content)
⚔️ nemo_skills/dataset/minerva_math/__init__.py (content)
⚔️ nemo_skills/dataset/minif2f/__init__.py (content)
⚔️ nemo_skills/dataset/mmau-pro/__init__.py (content)
⚔️ nemo_skills/dataset/mmau-pro/closed_form/__init__.py (content)
⚔️ nemo_skills/dataset/mmlu-pro/__init__.py (content)
⚔️ nemo_skills/dataset/mmlu-prox/__init__.py (content)
⚔️ nemo_skills/dataset/mmlu-redux/__init__.py (content)
⚔️ nemo_skills/dataset/mmlu/__init__.py (content)
⚔️ nemo_skills/dataset/mmmu-pro/__init__.py (content)
⚔️ nemo_skills/dataset/mobench/__init__.py (content)
⚔️ nemo_skills/dataset/mrcr/__init__.py (content)
⚔️ nemo_skills/dataset/musan/__init__.py (content)
⚔️ nemo_skills/dataset/olympiadbench/__init__.py (content)
⚔️ nemo_skills/dataset/omni-math/__init__.py (content)
⚔️ nemo_skills/dataset/omniscience/__init__.py (content)
⚔️ nemo_skills/dataset/open-proof-corpus-judge/__init__.py (content)
⚔️ nemo_skills/dataset/prepare.py (content)
⚔️ nemo_skills/dataset/proof-arena-judge/__init__.py (content)
⚔️ nemo_skills/dataset/proof-bench-judge/__init__.py (content)
⚔️ nemo_skills/dataset/proofnet/__init__.py (content)
⚔️ nemo_skills/dataset/putnam-bench/__init__.py (content)
⚔️ nemo_skills/dataset/ruler/__init__.py (content)
⚔️ nemo_skills/dataset/ruler/prepare.py (content)
⚔️ nemo_skills/dataset/ruler2/__init__.py (content)
⚔️ nemo_skills/dataset/ruler2/prepare.py (content)
⚔️ nemo_skills/dataset/scicode/__init__.py (content)
⚔️ nemo_skills/dataset/simpleqa/__init__.py (content)
⚔️ nemo_skills/dataset/supergpqa/__init__.py (content)
⚔️ nemo_skills/dataset/svamp/__init__.py (content)
⚔️ nemo_skills/dataset/swe-bench-multilingual/__init__.py (content)
⚔️ nemo_skills/dataset/swe-bench/__init__.py (content)
⚔️ nemo_skills/dataset/swe-rebench/__init__.py (content)
⚔️ nemo_skills/dataset/utils.py (content)
⚔️ nemo_skills/dataset/wmt24pp/__init__.py (content)
⚔️ nemo_skills/evaluation/evaluator/__init__.py (content)
⚔️ nemo_skills/evaluation/evaluator/livecodebench.py (content)
⚔️ nemo_skills/evaluation/metrics/compute_metrics.py (content)
⚔️ nemo_skills/evaluation/metrics/map_metrics.py (content)
⚔️ nemo_skills/evaluation/metrics/math_metrics.py (content)
⚔️ nemo_skills/inference/autoformalize.py (content)
⚔️ nemo_skills/inference/check_contamination.py (content)
⚔️ nemo_skills/inference/eval/arena_judge.py (content)
⚔️ nemo_skills/inference/eval/bfcl.py (content)
⚔️ nemo_skills/inference/eval/compute_eval.py (content)
⚔️ nemo_skills/inference/eval/scicode.py (content)
⚔️ nemo_skills/inference/eval/swebench.py (content)
⚔️ nemo_skills/inference/generate.py (content)
⚔️ nemo_skills/inference/prover.py (content)
⚔️ nemo_skills/pipeline/eval.py (content)
⚔️ nemo_skills/pipeline/prepare_data.py (content)
⚔️ nemo_skills/pipeline/summarize_results.py (content)
⚔️ nemo_skills/pipeline/utils/__init__.py (content)
⚔️ nemo_skills/pipeline/utils/eval.py (content)
⚔️ nemo_skills/pipeline/utils/packager.py (content)
⚔️ nemo_skills/prompt/config/gpt-oss/livecodebench.yaml (content)
⚔️ nemo_skills/prompt/config/judge/imo_answerbench.yaml (content)
⚔️ recipes/asr_tts/riva_generate.py (content)
⚔️ recipes/proof-gen-verification/scripts/script_generation.py (content)
⚔️ requirements/code_execution.txt (content)
⚔️ requirements/main.txt (content)
⚔️ tests/gpu-tests/run_qwen.sh (content)
⚔️ tests/test_configs.py (content)
⚔️ tests/test_datasets.py (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title 'Adding an option to store benchmarks in external repo' clearly describes the primary change: enabling external repository support for benchmark storage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch igitman/benchmarks-plugin
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch igitman/benchmarks-plugin
  • Create stacked PR with resolved conflicts
  • Post resolved changes as copyable diffs in a comment

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
docs/evaluation/custom-benchmarks.md (1)

319-323: Indented code block on line 322 doesn't match the fenced style used elsewhere.

The note block content on line 322 uses an indented code block. This is flagged by markdownlint (MD046) for inconsistent code block style. Consider using a fenced block instead for consistency.

tests/data/dummy_external_benchmark/my_benchmarks/metrics/word_count.py (1)

19-20: Use direct dictionary access for is_correct per coding guidelines.

Line 20 uses prediction.get("is_correct", False) — if is_correct is expected to be present on predictions (set by the evaluator), use direct access to fail fast on malformed data. Line 34 already uses direct access for predicted_answer.

Suggested fix
     def _get_score_dict(self, prediction):
-        return {"is_correct": prediction.get("is_correct", False)}
+        return {"is_correct": prediction["is_correct"]}

As per coding guidelines: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."

tests/gpu-tests/test_external_benchmark_eval.py (1)

112-129: Consider also asserting output for the path-based benchmark.

The eval runs two benchmarks (my_simple_bench and simple_bench_path) but only the map-name variant's output and metrics are validated. Adding a check for the path-based benchmark output would strengthen coverage.

tests/test_external_benchmarks.py (2)

240-245: Use a raw string for the match regex pattern.

The .* in the match string contains regex metacharacters. While it works, using a raw string is more explicit and avoids the RUF043 warning.

Suggested fix
-        with pytest.raises(RuntimeError, match="Expected .* to exist"):
+        with pytest.raises(RuntimeError, match=r"Expected .* to exist"):

201-220: Nit: prefix unused data_path with _.

Multiple tests unpack data_path but never use it. Prefix with _ to signal intent and silence linter warnings (RUF059).

Example fix (apply similarly to lines 208, 214, 219)
-        module, data_path = get_dataset_module(word_count_path)
+        module, _data_path = get_dataset_module(word_count_path)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/inference/generate.py (1)

583-591: ⚠️ Potential issue | 🔴 Critical

Breaking change: ArenaJudge will crash when calling super().process_single_datapoint().

ArenaJudge.process_single_datapoint calls super().process_single_datapoint(gen_base_data, all_data) at lines 152-153. This invokes the base class's process_single_datapoint, which then calls self.fill_prompt(data_point, all_data, prompt_format) as a positional call (line 692). Since self is an ArenaJudge instance, this attempts to pass 3 positional arguments to ArenaJudge.fill_prompt, which only accepts 2 parameters. This will raise TypeError: fill_prompt() takes 3 positional arguments but 4 were given.

To fix this, either:

  1. Pass prompt_format as a keyword argument in the base class:
-"prompt": self.fill_prompt(data_point, all_data, prompt_format),
+"prompt": self.fill_prompt(data_point, all_data, prompt_format=prompt_format),
  1. Update ArenaJudge.fill_prompt to accept the new parameter:
-def fill_prompt(self, data_point, data):
+def fill_prompt(self, data_point, data, prompt_format=None):

Both changes are needed for full compatibility.

🤖 Fix all issues with AI agents
In `@docs/evaluation/custom-benchmarks.md`:
- Around line 198-208: The example calls LOG.info in the generate function but
never defines LOG; add the logger setup by importing logging and get_logger_name
(from nemo_skills.utils) and creating LOG =
logging.getLogger(get_logger_name(__file__)); ensure these imports and the LOG
definition appear near the top of the example so LOG is available when
generate(cfg: WordCountGenerationConfig) calls LOG.info.
- Around line 264-277: In the update method, fix the NameError by passing the
correct variable name to _compute_pass_at_k: replace the incorrect
predicted_answer with the defined predicted_answers; specifically, in the
update(self, predictions) body ensure the call to _compute_pass_at_k uses
predicted_answers (the list defined from predictions) so it matches the later
_compute_majority_at_k call and the local variable name.

In `@nemo_skills/dataset/bfcl_v3/parallel/__init__.py`:
- Around line 1-3: The file is missing the standard NVIDIA Apache 2.0 license
header; add the exact repo-standard copyright/license header comment block to
the top of this __init__.py (and any other new bfcl_v3 __init__.py files) above
the existing constants METRICS_TYPE, GENERATION_ARGS, and GENERATION_MODULE so
the file matches other files in the PR.

In `@nemo_skills/dataset/prepare.py`:
- Around line 36-49: The CLI flag --retries currently defaults to 0 while the
prepare_datasets function signature defaults to 3, causing inconsistent
behavior; pick one canonical default and make both match (either change the
parser add_argument call for "--retries" to default=3 or change the
prepare_datasets signature to retries=0) so callers get the same retry behavior
whether invoked from the CLI or as a library function; update the parser
add_argument("--retries", ...) and/or the prepare_datasets(...) retries
parameter accordingly.

In `@nemo_skills/dataset/ruler/prepare_common.py`:
- Around line 33-39: The message MISSING_RULER_ARGS_MESSAGE currently begins
with "ERROR:" but parse_args_and_prepare_args exits with SystemExit(0),
producing a success status while signaling an error; fix by making them
consistent: either remove the "ERROR:" prefix from MISSING_RULER_ARGS_MESSAGE if
skipping is intentional, or change the exit in parse_args_and_prepare_args to
SystemExit(1) to indicate failure; locate and update the symbols
MISSING_RULER_ARGS_MESSAGE and the exit call in parse_args_and_prepare_args
(also consider how prepare.py's subprocess `check=True` will interpret the
chosen exit code) so the log text and exit code match the intended behavior.

In `@nemo_skills/dataset/ruler/prepare_data.py`:
- Around line 49-57: The git-LFS check in the "if 'cwe' in tasks" block
currently only catches subprocess.CalledProcessError but will crash with a
FileNotFoundError when git is not installed; update the exception handling
around the subprocess.run(["git", "lfs", "--version"]) call in prepare_data.py
to catch both subprocess.CalledProcessError and FileNotFoundError (or a broad
OSError) and then print the existing friendly message and exit(1) so missing git
or missing git-lfs both produce the same helpful output.

In `@nemo_skills/dataset/ruler2/prepare_data.py`:
- Around line 25-438: Many near-duplicate functions (e.g.,
prepare_mk_niah_basic, prepare_mk_niah_easy, prepare_mv_niah_basic,
prepare_qa_hard, etc.) repeat subprocess.run logic with only module name
(prepare_niah, prepare_mmlu, prepare_qa) and a few flag values changing; replace
them with one generic runner (e.g., run_prepare_task) that accepts a task key
and common params (output_folder, tokenizer_type, tokenizer_path, length,
dataset_size) and use a declarative dict mapping task keys to module name plus
per-task flag/value dicts; the runner should build the argv list by starting
with ["python","-m", module] then iterating the flag dict to append "--flag",
str(value), call subprocess.run(..., check=True), and replace all prepare_*
callers with a single call to run_prepare_task(task_key, ...).
- Around line 479-491: The current main block uses parse_known_args and assigns
unknown to `_`, silently discarding unrecognized CLI flags; change this to
either call parser.parse_args() to let argparse reject unknown args, or keep
parse_known_args but immediately check the returned unknown list and raise a
clear error (or call parser.error()) if it's non-empty; update the block around
build_prepare_parser, parse_known_args, and the call site before prepare_dataset
to perform this validation so typos/unsupported flags are not ignored.

In `@nemo_skills/evaluation/evaluator/__init__.py`:
- Around line 174-177: The debug print in the class-instantiation branch should
be removed or converted to a proper logger call; replace the line
`print(f"evaluator: {evaluator}")` with either nothing or `LOG.debug("evaluator:
%s", evaluator)` (or equivalent) inside the block where `is_class` is true after
`evaluator = obj(eval_config)` so you don't emit noisy stdout during
`evaluator.eval_full()` runs.

In `@nemo_skills/pipeline/eval.py`:
- Around line 745-746: The current f-string will inject the literal "None" when
both metric_type and benchmark_args.metrics_type are None; compute an effective
value (e.g., effective_metric_type = metric_type or benchmark_args.metrics_type)
and only append the "--metric_type=..." flag to the command when
effective_metric_type is truthy, so summarize_results never receives the string
"None" as a metric type.

In `@nemo_skills/pipeline/utils/packager.py`:
- Around line 191-194: The loop that builds include_patterns iterates over
dataset_dir.rglob("*.jsonl") and currently calls
include_pattern_relative_paths.append(str(nemo_skills_dir.parent)) inside that
loop, creating duplicate entries; move the append call outside the for f in
dataset_dir.rglob("*.jsonl") loop (or append once conditionally after detecting
at least one JSONL) so include_pattern_relative_paths only adds
str(nemo_skills_dir.parent) a single time when dataset files exist; update the
block around dataset_dir, include_patterns and include_pattern_relative_paths in
packager.py accordingly.
🧹 Nitpick comments (17)
nemo_skills/inference/generate.py (1)

583-585: Add type hints for the new prompt_format parameters.

Per coding guidelines, simple types should have type hints.

-    def fill_prompt(self, data_point, data, prompt_format=None):
+    def fill_prompt(self, data_point, data, prompt_format: str | None = None):
-    async def process_single_datapoint(self, data_point, all_data, prompt_format=None):
+    async def process_single_datapoint(self, data_point, all_data, prompt_format: str | None = None):

As per coding guidelines: "Use type hints for simple types (dict, list, int, float, existing classes) in Python code"

Also applies to: 680-681

nemo_skills/dataset/ruler/prepare_common.py (1)

89-94: Add return type hint.

Per project guidelines, use type hints for simple types.

Suggested fix
-def parse_args_and_prepare_args(parser: argparse.ArgumentParser):
+def parse_args_and_prepare_args(parser: argparse.ArgumentParser) -> tuple[argparse.Namespace, str]:
nemo_skills/dataset/ruler2/prepare_common.py (1)

75-77: Add return type hint.

Same as the ruler counterpart — add a return type hint for consistency.

Suggested fix
-def parse_known_args(parser: argparse.ArgumentParser):
+def parse_known_args(parser: argparse.ArgumentParser) -> tuple[argparse.Namespace, list[str]]:
nemo_skills/dataset/ruler/prepare_data.py (2)

60-60: Pass a string instead of a single-element list when using shell=True.

When shell=True, passing a list is misleading — the first element becomes the shell command string, and subsequent elements become arguments to the shell itself (not the command). Use a plain string here.

Suggested fix
-    subprocess.run(["pip install wonderwords html2text tenacity"], check=True, shell=True)
+    subprocess.run("pip install wonderwords html2text tenacity", check=True, shell=True)

62-69: Manual __enter__/__exit__ on TemporaryDirectory is fragile.

Calling dunder methods directly bypasses the context manager protocol's guarantees. Consider restructuring so the conditional temp directory uses a proper context manager or a simpler pattern:

Suggested refactor
-    if tmp_data_dir is not None:
-        tmpdirname = tmp_data_dir
-        Path(tmpdirname).mkdir(parents=True, exist_ok=True)
-        tmpdir_context = None
-    else:
-        tmpdir_context = tempfile.TemporaryDirectory()
-        tmpdirname = tmpdir_context.__enter__()
-
-    try:
+    with tempfile.TemporaryDirectory() as _tmpdir:
+        if tmp_data_dir is not None:
+            tmpdirname = tmp_data_dir
+            Path(tmpdirname).mkdir(parents=True, exist_ok=True)
+        else:
+            tmpdirname = _tmpdir
+

This removes the manual __enter__/__exit__ and the try/finally block entirely. The unused TemporaryDirectory is cheap when tmp_data_dir is provided.

nemo_skills/dataset/ruler2/prepare_init.py (1)

61-69: Silently discarding unknown CLI arguments may hide user errors.

Line 64 discards unknown args. Since prepare.py forwards all sys.argv to both prepare_init.py and prepare_data.py, this is understandable — init doesn't need data-prep args. However, if run standalone, typos or unsupported flags will be silently ignored.

Consider at minimum logging the discarded args for debuggability, or documenting that this script is intended to be invoked via prepare.py.

nemo_skills/dataset/ruler2/prepare_data.py (3)

459-460: Remove commented-out code.

The commented-out pip install on line 460 is dead code. If the dependency installation is needed, it should be documented or handled in a setup step, not left as a comment in runtime code.


441-476: No validation of task names — KeyError with no helpful message.

If a user passes an invalid task name, prepare_task[task] on line 466 raises a bare KeyError. Consider validating upfront with a clear error message listing available tasks.

Proposed fix
+    invalid_tasks = set(tasks) - set(prepare_task.keys())
+    if invalid_tasks:
+        raise ValueError(f"Unknown tasks: {invalid_tasks}. Available: {list(prepare_task.keys())}")
+
     with concurrent.futures.ThreadPoolExecutor() as executor:

441-441: Add type hints to function signatures.

All public functions in this file lack type hints. At minimum, prepare_dataset and the individual prepare_* functions should annotate their parameters with basic types (str, int, list[str], etc.). As per coding guidelines: "Use type hints for simple types (dict, list, int, float, existing classes) in Python code."

nemo_skills/pipeline/utils/packager.py (1)

82-129: Potential relative_to failure when repo_path is a subdirectory of the git root.

Line 106 validates that local_data_path is relative to repo_meta.path, but line 119 calls local_data_path.relative_to(effective_root) where effective_root is the git root. If repo_meta.path is registered as a parent of the actual git root (unlikely but not prevented by RepoMetadata), or if the repo structure has symlinks that break the ancestor chain, this relative_to call would raise an unhandled ValueError.

The happy path (repo_path is at or below git_root) is safe because git_root is always an ancestor of repo_path, hence also of local_data_path. Just something to be aware of in edge cases.

nemo_skills/pipeline/prepare_data.py (2)

213-225: --prepare_entrypoint prepare_data.py is appended after _build_command which already joined prepare_unknown_args.

Line 225 appends --prepare_entrypoint prepare_data.py after the unknown args were already joined on line 113 (inside _build_command). The resulting command would look like:

python -m nemo_skills.dataset.prepare <datasets> <unknown_args> --prepare_entrypoint prepare_data.py

This works because argparse parses named arguments positionally-independently, but the command string looks a bit odd. Consider passing prepare_entrypoint through the unknown args or appending it inside _build_command for clarity.


227-232: When executor == "none", _get_container_dataset_path returns container paths that won't exist locally.

After tracing the control flow, it appears that data_dir being set always implies containerized execution (lines 173-178 require cluster when data_dir is set, and line 181-185 only drops to executor="none" when data_dir is absent). So this path is safe.

However, this invariant is implicit and fragile — a future change to the early-exit logic could silently break the cp commands. Consider adding a defensive assertion or comment.

💡 Suggested comment for future maintainers
     if data_dir:
+        # data_dir implies containerized execution (executor != "none"),
+        # so container paths are valid for cp commands below.
         command += f" && mkdir -p {data_dir}"
nemo_skills/evaluation/evaluator/__init__.py (2)

92-117: _resolve_eval_type — clean centralized dispatch.

The dual-format support (built-in keys and :: path format) is well-structured. One minor note: getattr(module, attr_str) on line 109 will raise an AttributeError with a generic message if the attribute doesn't exist. Consider wrapping it to provide a more descriptive error mentioning the eval_type string.


147-154: supports_single_eval instantiates the evaluator just to check a capability flag.

obj(config) (line 153) constructs a full evaluator instance solely to call supports_single_eval(). If the constructor has side effects or is expensive, this is wasteful. Since supports_single_eval in BaseEvaluator only checks whether eval_single is overridden (a class-level property), this could be a static/class-level check instead. Low priority given this matches the previous pattern.

docs/evaluation/custom-benchmarks.md (1)

27-27: Add language specifiers to fenced code blocks.

Lines 27, 135, 240, and 281 have fenced code blocks without a language identifier. Consider adding text or the appropriate language (e.g., bash) to satisfy linting and improve rendering.

nemo_skills/dataset/utils.py (2)

62-75: Consider caching get_extra_benchmark_map() to avoid repeated file I/O.

get_extra_benchmark_map() is called here and again inside get_dataset_module(). Each call re-reads and parses the JSON file. If these functions are called in a loop over multiple datasets, the map file will be loaded on every iteration. A simple @functools.lru_cache on get_extra_benchmark_map (keyed on the env var value) would eliminate redundant reads.


55-59: Add type hints to new functions.

Per coding guidelines, simple types should be annotated. All new public functions (get_dataset_name, get_dataset_path, get_extra_benchmark_map, get_dataset_module, get_default_dataset_module) and the internal _load_external_dataset lack parameter and return type hints.

For example:

-def get_dataset_name(dataset):
+def get_dataset_name(dataset: str) -> str:
-def get_dataset_path(dataset):
+def get_dataset_path(dataset: str) -> Path:
-def get_extra_benchmark_map():
+def get_extra_benchmark_map() -> dict[str, str]:

As per coding guidelines: "Use type hints for simple types (dict, list, int, float, existing classes) in Python code."

Also applies to: 62-75, 95-104, 107-111, 114-155

Comment on lines +198 to +208
@hydra.main(version_base=None, config_name="base_generation_config")
def generate(cfg: WordCountGenerationConfig):
cfg = WordCountGenerationConfig(_init_nested=True, **cfg)
LOG.info("Config used: %s", cfg)
task = WordCountGenerationTask(cfg)
task.generate()


if __name__ == "__main__":
generate()
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

LOG is not defined in this code example.

Line 201 uses LOG.info(...) but the example doesn't import or define LOG. Add the import for completeness:

from nemo_skills.utils import get_logger_name
import logging
LOG = logging.getLogger(get_logger_name(__file__))
🤖 Prompt for AI Agents
In `@docs/evaluation/custom-benchmarks.md` around lines 198 - 208, The example
calls LOG.info in the generate function but never defines LOG; add the logger
setup by importing logging and get_logger_name (from nemo_skills.utils) and
creating LOG = logging.getLogger(get_logger_name(__file__)); ensure these
imports and the LOG definition appear near the top of the example so LOG is
available when generate(cfg: WordCountGenerationConfig) calls LOG.info.

Comment on lines +264 to +277
def update(self, predictions):
# base class provides convenient helpers for calculating
# common metrics like majority / pass
super().update(predictions)
predicted_answers = [pred["predicted_answer"] for pred in predictions]
self._compute_pass_at_k(
predictions=predictions,
predicted_answers=predicted_answer,
)
self._compute_majority_at_k(
predictions=predictions,
predicted_answers=predicted_answers,
)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Bug in example code: predicted_answer should be predicted_answers.

Line 271 references predicted_answer (singular) but the variable defined on line 268 is predicted_answers (plural). Users copying this example will get a NameError.

Proposed fix
         self._compute_pass_at_k(
             predictions=predictions,
-            predicted_answers=predicted_answer,
+            predicted_answers=predicted_answers,
         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def update(self, predictions):
# base class provides convenient helpers for calculating
# common metrics like majority / pass
super().update(predictions)
predicted_answers = [pred["predicted_answer"] for pred in predictions]
self._compute_pass_at_k(
predictions=predictions,
predicted_answers=predicted_answer,
)
self._compute_majority_at_k(
predictions=predictions,
predicted_answers=predicted_answers,
)
```
def update(self, predictions):
# base class provides convenient helpers for calculating
# common metrics like majority / pass
super().update(predictions)
predicted_answers = [pred["predicted_answer"] for pred in predictions]
self._compute_pass_at_k(
predictions=predictions,
predicted_answers=predicted_answers,
)
self._compute_majority_at_k(
predictions=predictions,
predicted_answers=predicted_answers,
)
🤖 Prompt for AI Agents
In `@docs/evaluation/custom-benchmarks.md` around lines 264 - 277, In the update
method, fix the NameError by passing the correct variable name to
_compute_pass_at_k: replace the incorrect predicted_answer with the defined
predicted_answers; specifically, in the update(self, predictions) body ensure
the call to _compute_pass_at_k uses predicted_answers (the list defined from
predictions) so it matches the later _compute_majority_at_k call and the local
variable name.

Comment on lines +1 to +3
METRICS_TYPE = "bfcl"
GENERATION_ARGS = "++eval_type=bfcl"
GENERATION_MODULE = "nemo_skills.inference.eval.bfcl"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing copyright/license header.

Every other file in this PR includes the standard NVIDIA copyright and Apache 2.0 license header. This new file is missing it. The same likely applies to the other new bfcl_v3 __init__.py files in this PR.

Proposed fix
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 METRICS_TYPE = "bfcl"
 GENERATION_ARGS = "++eval_type=bfcl"
 GENERATION_MODULE = "nemo_skills.inference.eval.bfcl"
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/bfcl_v3/parallel/__init__.py` around lines 1 - 3, The
file is missing the standard NVIDIA Apache 2.0 license header; add the exact
repo-standard copyright/license header comment block to the top of this
__init__.py (and any other new bfcl_v3 __init__.py files) above the existing
constants METRICS_TYPE, GENERATION_ARGS, and GENERATION_MODULE so the file
matches other files in the PR.

Comment on lines +36 to 49
"--retries",
type=int,
default=0,
help="Number of retries per dataset if preparation fails",
)
return parser.parse_known_args(args)


def prepare_datasets(
datasets=None,
dataset_groups=None,
add_lean4_header=False,
prepare_entrypoint="prepare.py",
extra_args="",
parallelism=20,
retries=3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent default for retries: 0 in CLI vs 3 in function signature.

The CLI parser (line 38) defaults --retries to 0, but prepare_datasets (line 49) defaults retries to 3. When invoked from the command line, the CLI default wins (0 retries). When called as a library function without specifying retries, the function default wins (3 retries). This silent inconsistency is likely to confuse callers.

Proposed fix — align the defaults
 def prepare_datasets(
     datasets=None,
     prepare_entrypoint="prepare.py",
     extra_args="",
     parallelism=20,
-    retries=3,
+    retries=0,
 ):

Or change the CLI default to 3 if retries are desired by default.

🤖 Prompt for AI Agents
In `@nemo_skills/dataset/prepare.py` around lines 36 - 49, The CLI flag --retries
currently defaults to 0 while the prepare_datasets function signature defaults
to 3, causing inconsistent behavior; pick one canonical default and make both
match (either change the parser add_argument call for "--retries" to default=3
or change the prepare_datasets signature to retries=0) so callers get the same
retry behavior whether invoked from the CLI or as a library function; update the
parser add_argument("--retries", ...) and/or the prepare_datasets(...) retries
parameter accordingly.

Comment on lines +33 to +39
MISSING_RULER_ARGS_MESSAGE = (
"ERROR: Can't prepare ruler without arguments provided! "
"Skipping the preparation step.\n"
"Example ruler prepare command:\n"
"ns prepare_data ruler --setup llama_128k "
"--tokenizer_path meta-llama/Llama-3.1-8B-Instruct --max_seq_length 131072"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Message says "ERROR" but SystemExit(0) signals success.

The MISSING_RULER_ARGS_MESSAGE starts with "ERROR:" but parse_args_and_prepare_args exits with code 0. The calling prepare.py uses check=True on subprocess, so this "error" will be silently treated as success and the next script will still run. Either drop the "ERROR" prefix (if skipping is intentional) or use SystemExit(1) (if it's a real failure).

🤖 Prompt for AI Agents
In `@nemo_skills/dataset/ruler/prepare_common.py` around lines 33 - 39, The
message MISSING_RULER_ARGS_MESSAGE currently begins with "ERROR:" but
parse_args_and_prepare_args exits with SystemExit(0), producing a success status
while signaling an error; fix by making them consistent: either remove the
"ERROR:" prefix from MISSING_RULER_ARGS_MESSAGE if skipping is intentional, or
change the exit in parse_args_and_prepare_args to SystemExit(1) to indicate
failure; locate and update the symbols MISSING_RULER_ARGS_MESSAGE and the exit
call in parse_args_and_prepare_args (also consider how prepare.py's subprocess
`check=True` will interpret the chosen exit code) so the log text and exit code
match the intended behavior.

Comment on lines +25 to +438
def prepare_mk_niah_basic(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
subprocess.run(
[
"python",
"-m",
"nemo_skills.dataset.ruler2.prepare_niah",
"--output_folder",
output_folder,
"--tokenizer_type",
tokenizer_type,
"--tokenizer_path",
tokenizer_path,
"--max_seq_length",
str(length),
"--num_samples",
str(dataset_size),
"--random_seed",
"42",
"--num_needle_k",
"1",
"--num_needle_v",
"1",
"--num_needle_q",
"1",
"--type_haystack",
"needle",
"--type_needle_k",
"words",
"--type_needle_v",
"numbers",
"--num_digits_v",
"10",
],
check=True,
)


def prepare_mk_niah_easy(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
subprocess.run(
[
"python",
"-m",
"nemo_skills.dataset.ruler2.prepare_mmlu",
"--output_folder",
output_folder,
"--tokenizer_type",
tokenizer_type,
"--tokenizer_path",
tokenizer_path,
"--max_seq_length",
str(length),
"--num_samples",
str(dataset_size),
"--random_seed",
"42",
"--dataset",
"mmlu",
"--fewshot",
"0",
"--prompt_type",
"instruct",
"--num_order",
"0",
"--task_type",
"retrieve",
"--algo_type",
"single",
],
check=True,
)


def prepare_mk_niah_medium(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
subprocess.run(
[
"python",
"-m",
"nemo_skills.dataset.ruler2.prepare_mmlu",
"--output_folder",
output_folder,
"--tokenizer_type",
tokenizer_type,
"--tokenizer_path",
tokenizer_path,
"--max_seq_length",
str(length),
"--num_samples",
str(dataset_size),
"--random_seed",
"42",
"--dataset",
"mmlu",
"--fewshot",
"5",
"--prompt_type",
"instruct",
"--num_order",
"0",
"--task_type",
"solve",
"--algo_type",
"2steps",
],
check=True,
)


def prepare_mk_niah_hard(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
subprocess.run(
[
"python",
"-m",
"nemo_skills.dataset.ruler2.prepare_mmlu",
"--output_folder",
output_folder,
"--tokenizer_type",
tokenizer_type,
"--tokenizer_path",
tokenizer_path,
"--max_seq_length",
str(length),
"--num_samples",
str(dataset_size),
"--random_seed",
"42",
"--dataset",
"mmlu",
"--fewshot",
"5",
"--prompt_type",
"instruct",
"--num_order",
"0",
"--task_type",
"solve",
"--algo_type",
"single",
],
check=True,
)


def prepare_mv_niah_basic(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
subprocess.run(
[
"python",
"-m",
"nemo_skills.dataset.ruler2.prepare_niah",
"--output_folder",
output_folder,
"--tokenizer_type",
tokenizer_type,
"--tokenizer_path",
tokenizer_path,
"--max_seq_length",
str(length),
"--num_samples",
str(dataset_size),
"--random_seed",
"42",
"--num_needle_k",
"1",
"--num_needle_v",
"4",
"--num_needle_q",
"1",
"--type_haystack",
"needle",
"--type_needle_k",
"words",
"--type_needle_v",
"numbers",
"--num_digits_v",
"10",
],
check=True,
)


def prepare_mv_niah_easy(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
subprocess.run(
[
"python",
"-m",
"nemo_skills.dataset.ruler2.prepare_mmlu",
"--output_folder",
output_folder,
"--tokenizer_type",
tokenizer_type,
"--tokenizer_path",
tokenizer_path,
"--max_seq_length",
str(length),
"--num_samples",
str(dataset_size),
"--random_seed",
"42",
"--dataset",
"mmlu",
"--fewshot",
"0",
"--prompt_type",
"instruct",
"--num_order",
"4",
"--task_type",
"niah",
"--algo_type",
"single",
],
check=True,
)


def prepare_mv_niah_medium(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
subprocess.run(
[
"python",
"-m",
"nemo_skills.dataset.ruler2.prepare_mmlu",
"--output_folder",
output_folder,
"--tokenizer_type",
tokenizer_type,
"--tokenizer_path",
tokenizer_path,
"--max_seq_length",
str(length),
"--num_samples",
str(dataset_size),
"--random_seed",
"42",
"--dataset",
"mmlu",
"--fewshot",
"0",
"--prompt_type",
"instruct",
"--num_order",
"4",
"--task_type",
"retrieve",
"--algo_type",
"2steps",
],
check=True,
)


def prepare_mv_niah_hard(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
subprocess.run(
[
"python",
"-m",
"nemo_skills.dataset.ruler2.prepare_mmlu",
"--output_folder",
output_folder,
"--tokenizer_type",
tokenizer_type,
"--tokenizer_path",
tokenizer_path,
"--max_seq_length",
str(length),
"--num_samples",
str(dataset_size),
"--random_seed",
"42",
"--dataset",
"mmlu",
"--fewshot",
"0",
"--prompt_type",
"instruct",
"--num_order",
"4",
"--task_type",
"retrieve",
"--algo_type",
"single",
],
check=True,
)


def prepare_qa_basic(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
subprocess.run(
[
"python",
"-m",
"nemo_skills.dataset.ruler2.prepare_qa",
"--output_folder",
output_folder,
"--tokenizer_type",
tokenizer_type,
"--tokenizer_path",
tokenizer_path,
"--max_seq_length",
str(length),
"--num_samples",
str(dataset_size),
"--random_seed",
"42",
"--dataset",
"hotpotqa",
"--fewshot",
"0",
"--prompt_type",
"instruct",
"--task_type",
"retrieve",
"--query_type",
"doc",
],
check=True,
)


def prepare_qa_easy(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
subprocess.run(
[
"python",
"-m",
"nemo_skills.dataset.ruler2.prepare_qa",
"--output_folder",
output_folder,
"--tokenizer_type",
tokenizer_type,
"--tokenizer_path",
tokenizer_path,
"--max_seq_length",
str(length),
"--num_samples",
str(dataset_size),
"--random_seed",
"42",
"--dataset",
"hotpotqa",
"--fewshot",
"0",
"--prompt_type",
"instruct",
"--task_type",
"retrieve",
"--query_type",
"question",
],
check=True,
)


def prepare_qa_medium(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
subprocess.run(
[
"python",
"-m",
"nemo_skills.dataset.ruler2.prepare_qa",
"--output_folder",
output_folder,
"--tokenizer_type",
tokenizer_type,
"--tokenizer_path",
tokenizer_path,
"--max_seq_length",
str(length),
"--num_samples",
str(dataset_size),
"--random_seed",
"42",
"--dataset",
"hotpotqa",
"--fewshot",
"0",
"--prompt_type",
"instruct",
"--task_type",
"solve",
"--algo_type",
"2steps",
],
check=True,
)


def prepare_qa_hard(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
subprocess.run(
[
"python",
"-m",
"nemo_skills.dataset.ruler2.prepare_qa",
"--output_folder",
output_folder,
"--tokenizer_type",
tokenizer_type,
"--tokenizer_path",
tokenizer_path,
"--max_seq_length",
str(length),
"--num_samples",
str(dataset_size),
"--random_seed",
"42",
"--dataset",
"hotpotqa",
"--fewshot",
"0",
"--prompt_type",
"instruct",
"--task_type",
"solve",
"--algo_type",
"single",
],
check=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Massive DRY violation — 12 near-identical functions should be collapsed into a data-driven dispatcher.

Every prepare_* function follows the same pattern: build a list of subprocess args from a fixed module name plus task-specific parameters, then call subprocess.run. The only differences are the target module (prepare_niah, prepare_mmlu, prepare_qa) and a handful of flag values. This results in ~400 lines that could be ~40.

♻️ Suggested refactor — declarative task config + single runner
-def prepare_mk_niah_basic(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
-    subprocess.run(
-        [
-            "python",
-            "-m",
-            "nemo_skills.dataset.ruler2.prepare_niah",
-            "--output_folder",
-            output_folder,
-            ...
-        ],
-        check=True,
-    )
-
-# ... (11 more nearly identical functions)
+TASK_CONFIGS: dict[str, dict] = {
+    "mk_niah_basic": {
+        "module": "nemo_skills.dataset.ruler2.prepare_niah",
+        "extra_args": [
+            "--num_needle_k", "1", "--num_needle_v", "1", "--num_needle_q", "1",
+            "--type_haystack", "needle", "--type_needle_k", "words",
+            "--type_needle_v", "numbers", "--num_digits_v", "10",
+        ],
+    },
+    "mk_niah_easy": {
+        "module": "nemo_skills.dataset.ruler2.prepare_mmlu",
+        "extra_args": [
+            "--dataset", "mmlu", "--fewshot", "0", "--prompt_type", "instruct",
+            "--num_order", "0", "--task_type", "retrieve", "--algo_type", "single",
+        ],
+    },
+    # ... one entry per task ...
+}
+
+
+def _run_prepare(
+    task: str,
+    output_folder: str,
+    tokenizer_type: str,
+    tokenizer_path: str,
+    length: int,
+    dataset_size: int,
+) -> None:
+    cfg = TASK_CONFIGS[task]
+    subprocess.run(
+        [
+            "python", "-m", cfg["module"],
+            "--output_folder", output_folder,
+            "--tokenizer_type", tokenizer_type,
+            "--tokenizer_path", tokenizer_path,
+            "--max_seq_length", str(length),
+            "--num_samples", str(dataset_size),
+            "--random_seed", "42",
+            *cfg["extra_args"],
+        ],
+        check=True,
+    )

This eliminates ~350 lines of boilerplate and makes adding a new task a one-liner dict entry instead of a whole new function.

🧰 Tools
🪛 Ruff (0.15.0)

[error] 26-26: subprocess call: check for execution of untrusted input

(S603)


[error] 27-57: Starting a process with a partial executable path

(S607)


[error] 63-63: subprocess call: check for execution of untrusted input

(S603)


[error] 64-92: Starting a process with a partial executable path

(S607)


[error] 98-98: subprocess call: check for execution of untrusted input

(S603)


[error] 99-127: Starting a process with a partial executable path

(S607)


[error] 133-133: subprocess call: check for execution of untrusted input

(S603)


[error] 134-162: Starting a process with a partial executable path

(S607)


[error] 168-168: subprocess call: check for execution of untrusted input

(S603)


[error] 169-199: Starting a process with a partial executable path

(S607)


[error] 205-205: subprocess call: check for execution of untrusted input

(S603)


[error] 206-234: Starting a process with a partial executable path

(S607)


[error] 240-240: subprocess call: check for execution of untrusted input

(S603)


[error] 241-269: Starting a process with a partial executable path

(S607)


[error] 275-275: subprocess call: check for execution of untrusted input

(S603)


[error] 276-304: Starting a process with a partial executable path

(S607)


[error] 310-310: subprocess call: check for execution of untrusted input

(S603)


[error] 311-337: Starting a process with a partial executable path

(S607)


[error] 343-343: subprocess call: check for execution of untrusted input

(S603)


[error] 344-370: Starting a process with a partial executable path

(S607)


[error] 376-376: subprocess call: check for execution of untrusted input

(S603)


[error] 377-403: Starting a process with a partial executable path

(S607)


[error] 409-409: subprocess call: check for execution of untrusted input

(S603)


[error] 410-436: Starting a process with a partial executable path

(S607)

🤖 Prompt for AI Agents
In `@nemo_skills/dataset/ruler2/prepare_data.py` around lines 25 - 438, Many
near-duplicate functions (e.g., prepare_mk_niah_basic, prepare_mk_niah_easy,
prepare_mv_niah_basic, prepare_qa_hard, etc.) repeat subprocess.run logic with
only module name (prepare_niah, prepare_mmlu, prepare_qa) and a few flag values
changing; replace them with one generic runner (e.g., run_prepare_task) that
accepts a task key and common params (output_folder, tokenizer_type,
tokenizer_path, length, dataset_size) and use a declarative dict mapping task
keys to module name plus per-task flag/value dicts; the runner should build the
argv list by starting with ["python","-m", module] then iterating the flag dict
to append "--flag", str(value), call subprocess.run(..., check=True), and
replace all prepare_* callers with a single call to run_prepare_task(task_key,
...).

Comment on lines +479 to +491
if __name__ == "__main__":
parser = build_prepare_parser(description="Prepare RULER2 dataset data.")
args, unknown = parse_known_args(parser)
_ = unknown
prepare_dataset(
args.tasks,
args.setup,
args.max_seq_length,
args.tokenizer_type,
args.tokenizer_path,
args.dataset_size,
)
print("RULER2 data preparation completed.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Unknown CLI arguments are silently discarded.

Line 482 assigns unknown args to _, meaning any typo or unsupported flag passed by the user is silently ignored. This violates the coding guideline about not silently ignoring unused user-passed parameters. Use parser.parse_args() instead to reject unrecognized arguments, or at minimum raise an error if unknown is non-empty.

Proposed fix
-    args, unknown = parse_known_args(parser)
-    _ = unknown
+    args = parser.parse_args()

If parse_known_args is needed for other callers, add a guard here:

     args, unknown = parse_known_args(parser)
-    _ = unknown
+    if unknown:
+        parser.error(f"Unrecognized arguments: {unknown}")

As per coding guidelines: "Avoid silently ignoring unused user-passed parameters; the code should fail if a required argument is not specified or if unsupported arguments are provided."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if __name__ == "__main__":
parser = build_prepare_parser(description="Prepare RULER2 dataset data.")
args, unknown = parse_known_args(parser)
_ = unknown
prepare_dataset(
args.tasks,
args.setup,
args.max_seq_length,
args.tokenizer_type,
args.tokenizer_path,
args.dataset_size,
)
print("RULER2 data preparation completed.")
if __name__ == "__main__":
parser = build_prepare_parser(description="Prepare RULER2 dataset data.")
args = parser.parse_args()
prepare_dataset(
args.tasks,
args.setup,
args.max_seq_length,
args.tokenizer_type,
args.tokenizer_path,
args.dataset_size,
)
print("RULER2 data preparation completed.")
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/ruler2/prepare_data.py` around lines 479 - 491, The
current main block uses parse_known_args and assigns unknown to `_`, silently
discarding unrecognized CLI flags; change this to either call
parser.parse_args() to let argparse reject unknown args, or keep
parse_known_args but immediately check the returned unknown list and raise a
clear error (or call parser.error()) if it's non-empty; update the block around
build_prepare_parser, parse_known_args, and the call site before prepare_dataset
to perform this validation so typos/unsupported flags are not ignored.

Comment on lines +174 to 177
if is_class:
evaluator = obj(eval_config)
print(f"evaluator: {evaluator}")
return asyncio.run(evaluator.eval_full())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove debug print statement.

Line 176 (print(f"evaluator: {evaluator}")) appears to be a leftover debug statement. It will produce noisy output on every class-based evaluation run. Remove it or replace with LOG.debug(...) if the information is useful.

Proposed fix
     if is_class:
         evaluator = obj(eval_config)
-        print(f"evaluator: {evaluator}")
         return asyncio.run(evaluator.eval_full())
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if is_class:
evaluator = obj(eval_config)
print(f"evaluator: {evaluator}")
return asyncio.run(evaluator.eval_full())
if is_class:
evaluator = obj(eval_config)
return asyncio.run(evaluator.eval_full())
🤖 Prompt for AI Agents
In `@nemo_skills/evaluation/evaluator/__init__.py` around lines 174 - 177, The
debug print in the class-instantiation branch should be removed or converted to
a proper logger call; replace the line `print(f"evaluator: {evaluator}")` with
either nothing or `LOG.debug("evaluator: %s", evaluator)` (or equivalent) inside
the block where `is_class` is true after `evaluator = obj(eval_config)` so you
don't emit noisy stdout during `evaluator.eval_full()` runs.

Comment on lines +745 to 746
f" --metric_type={metric_type or benchmark_args.metrics_type} "
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

--metric_type=None will be passed as a literal string if both values are None.

If metric_type is None and benchmark_args.metrics_type is also None (e.g., a benchmark __init__.py that doesn't define METRICS_TYPE), the f-string produces --metric_type=None — the string "None" — which will likely cause a downstream error in summarize_results.

Consider guarding this:

Proposed fix
-                    f"    --metric_type={metric_type or benchmark_args.metrics_type} "
+                    f"    --metric_type={metric_type or benchmark_args.metrics_type or ''} "

Or better, conditionally append the flag only when a value exists:

effective_metric_type = metric_type or benchmark_args.metrics_type
...
if effective_metric_type:
    command += f" --metric_type={effective_metric_type} "
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
f" --metric_type={metric_type or benchmark_args.metrics_type} "
)
f" --metric_type={metric_type or benchmark_args.metrics_type or ''} "
)
🤖 Prompt for AI Agents
In `@nemo_skills/pipeline/eval.py` around lines 745 - 746, The current f-string
will inject the literal "None" when both metric_type and
benchmark_args.metrics_type are None; compute an effective value (e.g.,
effective_metric_type = metric_type or benchmark_args.metrics_type) and only
append the "--metric_type=..." flag to the command when effective_metric_type is
truthy, so summarize_results never receives the string "None" as a metric type.

Comment on lines +191 to 194
dataset_dir = nemo_skills_dir / "dataset"
for f in dataset_dir.rglob("*.jsonl"):
include_patterns.append(str(f))
include_pattern_relative_paths.append(str(nemo_skills_dir.parent))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Count how many .jsonl files exist under dataset/
find . -name "*.jsonl" -type f | head -20
echo "---"
find . -name "*.jsonl" -type f | wc -l

Repository: NVIDIA-NeMo/Skills

Length of output: 314


🏁 Script executed:

# Examine the packager.py file to understand the context
wc -l nemo_skills/pipeline/utils/packager.py
head -50 nemo_skills/pipeline/utils/packager.py

Repository: NVIDIA-NeMo/Skills

Length of output: 1564


🏁 Script executed:

# Look at the specific lines mentioned (191-194) and surrounding context
sed -n '170,220p' nemo_skills/pipeline/utils/packager.py

Repository: NVIDIA-NeMo/Skills

Length of output: 2295


🏁 Script executed:

# Search for how include_patterns and include_pattern_relative_paths are used
rg "include_patterns|include_pattern_relative_paths" nemo_skills/pipeline/utils/packager.py -B 2 -A 2

Repository: NVIDIA-NeMo/Skills

Length of output: 1487


Extract the relative path append from the loop to avoid duplicate entries.

The current code appends str(nemo_skills_dir.parent) once per JSONL file, resulting in duplicate identical entries in include_pattern_relative_paths. The relative path should be appended once outside the loop since it's the same for all files. Consider moving include_pattern_relative_paths.append(str(nemo_skills_dir.parent)) outside the for loop.

🤖 Prompt for AI Agents
In `@nemo_skills/pipeline/utils/packager.py` around lines 191 - 194, The loop that
builds include_patterns iterates over dataset_dir.rglob("*.jsonl") and currently
calls include_pattern_relative_paths.append(str(nemo_skills_dir.parent)) inside
that loop, creating duplicate entries; move the append call outside the for f in
dataset_dir.rglob("*.jsonl") loop (or append once conditionally after detecting
at least one JSONL) so include_pattern_relative_paths only adds
str(nemo_skills_dir.parent) a single time when dataset files exist; update the
block around dataset_dir, include_patterns and include_pattern_relative_paths in
packager.py accordingly.

Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant