-
Notifications
You must be signed in to change notification settings - Fork 147
Adding an option to store benchmarks in external repo #1240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
|
Too many files changed for review. ( |
Signed-off-by: Igor Gitman <[email protected]>
📝 WalkthroughWalkthroughRefactors dataset configuration (removes DATASET_GROUP widely; adds REQUIRES_DATA_DIR / HAS_DYNAMIC_INIT), modularizes RULER/RULER2 prepare flow, adds external benchmark support and tests, enhances evaluator dispatch and dataset resolution utilities, updates pipelines/packaging, and documents custom benchmarks. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/inference/generate.py (1)
583-591:⚠️ Potential issue | 🔴 CriticalBreaking change: ArenaJudge will crash when calling
super().process_single_datapoint().ArenaJudge.process_single_datapoint calls
super().process_single_datapoint(gen_base_data, all_data)at lines 152-153. This invokes the base class's process_single_datapoint, which then callsself.fill_prompt(data_point, all_data, prompt_format)as a positional call (line 692). Sinceselfis an ArenaJudge instance, this attempts to pass 3 positional arguments to ArenaJudge.fill_prompt, which only accepts 2 parameters. This will raiseTypeError: fill_prompt() takes 3 positional arguments but 4 were given.To fix this, either:
- Pass
prompt_formatas a keyword argument in the base class:-"prompt": self.fill_prompt(data_point, all_data, prompt_format), +"prompt": self.fill_prompt(data_point, all_data, prompt_format=prompt_format),
- Update ArenaJudge.fill_prompt to accept the new parameter:
-def fill_prompt(self, data_point, data): +def fill_prompt(self, data_point, data, prompt_format=None):Both changes are needed for full compatibility.
🤖 Fix all issues with AI agents
In `@docs/evaluation/custom-benchmarks.md`:
- Around line 198-208: The example calls LOG.info in the generate function but
never defines LOG; add the logger setup by importing logging and get_logger_name
(from nemo_skills.utils) and creating LOG =
logging.getLogger(get_logger_name(__file__)); ensure these imports and the LOG
definition appear near the top of the example so LOG is available when
generate(cfg: WordCountGenerationConfig) calls LOG.info.
- Around line 264-277: In the update method, fix the NameError by passing the
correct variable name to _compute_pass_at_k: replace the incorrect
predicted_answer with the defined predicted_answers; specifically, in the
update(self, predictions) body ensure the call to _compute_pass_at_k uses
predicted_answers (the list defined from predictions) so it matches the later
_compute_majority_at_k call and the local variable name.
In `@nemo_skills/dataset/bfcl_v3/parallel/__init__.py`:
- Around line 1-3: The file is missing the standard NVIDIA Apache 2.0 license
header; add the exact repo-standard copyright/license header comment block to
the top of this __init__.py (and any other new bfcl_v3 __init__.py files) above
the existing constants METRICS_TYPE, GENERATION_ARGS, and GENERATION_MODULE so
the file matches other files in the PR.
In `@nemo_skills/dataset/prepare.py`:
- Around line 36-49: The CLI flag --retries currently defaults to 0 while the
prepare_datasets function signature defaults to 3, causing inconsistent
behavior; pick one canonical default and make both match (either change the
parser add_argument call for "--retries" to default=3 or change the
prepare_datasets signature to retries=0) so callers get the same retry behavior
whether invoked from the CLI or as a library function; update the parser
add_argument("--retries", ...) and/or the prepare_datasets(...) retries
parameter accordingly.
In `@nemo_skills/dataset/ruler/prepare_common.py`:
- Around line 33-39: The message MISSING_RULER_ARGS_MESSAGE currently begins
with "ERROR:" but parse_args_and_prepare_args exits with SystemExit(0),
producing a success status while signaling an error; fix by making them
consistent: either remove the "ERROR:" prefix from MISSING_RULER_ARGS_MESSAGE if
skipping is intentional, or change the exit in parse_args_and_prepare_args to
SystemExit(1) to indicate failure; locate and update the symbols
MISSING_RULER_ARGS_MESSAGE and the exit call in parse_args_and_prepare_args
(also consider how prepare.py's subprocess `check=True` will interpret the
chosen exit code) so the log text and exit code match the intended behavior.
In `@nemo_skills/dataset/ruler/prepare_data.py`:
- Around line 49-57: The git-LFS check in the "if 'cwe' in tasks" block
currently only catches subprocess.CalledProcessError but will crash with a
FileNotFoundError when git is not installed; update the exception handling
around the subprocess.run(["git", "lfs", "--version"]) call in prepare_data.py
to catch both subprocess.CalledProcessError and FileNotFoundError (or a broad
OSError) and then print the existing friendly message and exit(1) so missing git
or missing git-lfs both produce the same helpful output.
In `@nemo_skills/dataset/ruler2/prepare_data.py`:
- Around line 25-438: Many near-duplicate functions (e.g.,
prepare_mk_niah_basic, prepare_mk_niah_easy, prepare_mv_niah_basic,
prepare_qa_hard, etc.) repeat subprocess.run logic with only module name
(prepare_niah, prepare_mmlu, prepare_qa) and a few flag values changing; replace
them with one generic runner (e.g., run_prepare_task) that accepts a task key
and common params (output_folder, tokenizer_type, tokenizer_path, length,
dataset_size) and use a declarative dict mapping task keys to module name plus
per-task flag/value dicts; the runner should build the argv list by starting
with ["python","-m", module] then iterating the flag dict to append "--flag",
str(value), call subprocess.run(..., check=True), and replace all prepare_*
callers with a single call to run_prepare_task(task_key, ...).
- Around line 479-491: The current main block uses parse_known_args and assigns
unknown to `_`, silently discarding unrecognized CLI flags; change this to
either call parser.parse_args() to let argparse reject unknown args, or keep
parse_known_args but immediately check the returned unknown list and raise a
clear error (or call parser.error()) if it's non-empty; update the block around
build_prepare_parser, parse_known_args, and the call site before prepare_dataset
to perform this validation so typos/unsupported flags are not ignored.
In `@nemo_skills/evaluation/evaluator/__init__.py`:
- Around line 174-177: The debug print in the class-instantiation branch should
be removed or converted to a proper logger call; replace the line
`print(f"evaluator: {evaluator}")` with either nothing or `LOG.debug("evaluator:
%s", evaluator)` (or equivalent) inside the block where `is_class` is true after
`evaluator = obj(eval_config)` so you don't emit noisy stdout during
`evaluator.eval_full()` runs.
In `@nemo_skills/pipeline/eval.py`:
- Around line 745-746: The current f-string will inject the literal "None" when
both metric_type and benchmark_args.metrics_type are None; compute an effective
value (e.g., effective_metric_type = metric_type or benchmark_args.metrics_type)
and only append the "--metric_type=..." flag to the command when
effective_metric_type is truthy, so summarize_results never receives the string
"None" as a metric type.
In `@nemo_skills/pipeline/utils/packager.py`:
- Around line 191-194: The loop that builds include_patterns iterates over
dataset_dir.rglob("*.jsonl") and currently calls
include_pattern_relative_paths.append(str(nemo_skills_dir.parent)) inside that
loop, creating duplicate entries; move the append call outside the for f in
dataset_dir.rglob("*.jsonl") loop (or append once conditionally after detecting
at least one JSONL) so include_pattern_relative_paths only adds
str(nemo_skills_dir.parent) a single time when dataset files exist; update the
block around dataset_dir, include_patterns and include_pattern_relative_paths in
packager.py accordingly.
🧹 Nitpick comments (17)
nemo_skills/inference/generate.py (1)
583-585: Add type hints for the newprompt_formatparameters.Per coding guidelines, simple types should have type hints.
- def fill_prompt(self, data_point, data, prompt_format=None): + def fill_prompt(self, data_point, data, prompt_format: str | None = None):- async def process_single_datapoint(self, data_point, all_data, prompt_format=None): + async def process_single_datapoint(self, data_point, all_data, prompt_format: str | None = None):As per coding guidelines: "Use type hints for simple types (dict, list, int, float, existing classes) in Python code"
Also applies to: 680-681
nemo_skills/dataset/ruler/prepare_common.py (1)
89-94: Add return type hint.Per project guidelines, use type hints for simple types.
Suggested fix
-def parse_args_and_prepare_args(parser: argparse.ArgumentParser): +def parse_args_and_prepare_args(parser: argparse.ArgumentParser) -> tuple[argparse.Namespace, str]:nemo_skills/dataset/ruler2/prepare_common.py (1)
75-77: Add return type hint.Same as the ruler counterpart — add a return type hint for consistency.
Suggested fix
-def parse_known_args(parser: argparse.ArgumentParser): +def parse_known_args(parser: argparse.ArgumentParser) -> tuple[argparse.Namespace, list[str]]:nemo_skills/dataset/ruler/prepare_data.py (2)
60-60: Pass a string instead of a single-element list when usingshell=True.When
shell=True, passing a list is misleading — the first element becomes the shell command string, and subsequent elements become arguments to the shell itself (not the command). Use a plain string here.Suggested fix
- subprocess.run(["pip install wonderwords html2text tenacity"], check=True, shell=True) + subprocess.run("pip install wonderwords html2text tenacity", check=True, shell=True)
62-69: Manual__enter__/__exit__onTemporaryDirectoryis fragile.Calling dunder methods directly bypasses the context manager protocol's guarantees. Consider restructuring so the conditional temp directory uses a proper context manager or a simpler pattern:
Suggested refactor
- if tmp_data_dir is not None: - tmpdirname = tmp_data_dir - Path(tmpdirname).mkdir(parents=True, exist_ok=True) - tmpdir_context = None - else: - tmpdir_context = tempfile.TemporaryDirectory() - tmpdirname = tmpdir_context.__enter__() - - try: + with tempfile.TemporaryDirectory() as _tmpdir: + if tmp_data_dir is not None: + tmpdirname = tmp_data_dir + Path(tmpdirname).mkdir(parents=True, exist_ok=True) + else: + tmpdirname = _tmpdir +This removes the manual
__enter__/__exit__and thetry/finallyblock entirely. The unusedTemporaryDirectoryis cheap whentmp_data_diris provided.nemo_skills/dataset/ruler2/prepare_init.py (1)
61-69: Silently discarding unknown CLI arguments may hide user errors.Line 64 discards
unknownargs. Sinceprepare.pyforwards allsys.argvto bothprepare_init.pyandprepare_data.py, this is understandable — init doesn't need data-prep args. However, if run standalone, typos or unsupported flags will be silently ignored.Consider at minimum logging the discarded args for debuggability, or documenting that this script is intended to be invoked via
prepare.py.nemo_skills/dataset/ruler2/prepare_data.py (3)
459-460: Remove commented-out code.The commented-out
pip installon line 460 is dead code. If the dependency installation is needed, it should be documented or handled in a setup step, not left as a comment in runtime code.
441-476: No validation of task names —KeyErrorwith no helpful message.If a user passes an invalid task name,
prepare_task[task]on line 466 raises a bareKeyError. Consider validating upfront with a clear error message listing available tasks.Proposed fix
+ invalid_tasks = set(tasks) - set(prepare_task.keys()) + if invalid_tasks: + raise ValueError(f"Unknown tasks: {invalid_tasks}. Available: {list(prepare_task.keys())}") + with concurrent.futures.ThreadPoolExecutor() as executor:
441-441: Add type hints to function signatures.All public functions in this file lack type hints. At minimum,
prepare_datasetand the individualprepare_*functions should annotate their parameters with basic types (str,int,list[str], etc.). As per coding guidelines: "Use type hints for simple types (dict, list, int, float, existing classes) in Python code."nemo_skills/pipeline/utils/packager.py (1)
82-129: Potentialrelative_tofailure whenrepo_pathis a subdirectory of the git root.Line 106 validates that
local_data_pathis relative torepo_meta.path, but line 119 callslocal_data_path.relative_to(effective_root)whereeffective_rootis the git root. Ifrepo_meta.pathis registered as a parent of the actual git root (unlikely but not prevented byRepoMetadata), or if the repo structure has symlinks that break the ancestor chain, thisrelative_tocall would raise an unhandledValueError.The happy path (repo_path is at or below git_root) is safe because git_root is always an ancestor of repo_path, hence also of local_data_path. Just something to be aware of in edge cases.
nemo_skills/pipeline/prepare_data.py (2)
213-225:--prepare_entrypoint prepare_data.pyis appended after_build_commandwhich already joinedprepare_unknown_args.Line 225 appends
--prepare_entrypoint prepare_data.pyafter the unknown args were already joined on line 113 (inside_build_command). The resulting command would look like:python -m nemo_skills.dataset.prepare <datasets> <unknown_args> --prepare_entrypoint prepare_data.pyThis works because argparse parses named arguments positionally-independently, but the command string looks a bit odd. Consider passing
prepare_entrypointthrough the unknown args or appending it inside_build_commandfor clarity.
227-232: Whenexecutor == "none",_get_container_dataset_pathreturns container paths that won't exist locally.After tracing the control flow, it appears that
data_dirbeing set always implies containerized execution (lines 173-178 requireclusterwhendata_diris set, and line 181-185 only drops toexecutor="none"whendata_diris absent). So this path is safe.However, this invariant is implicit and fragile — a future change to the early-exit logic could silently break the
cpcommands. Consider adding a defensive assertion or comment.💡 Suggested comment for future maintainers
if data_dir: + # data_dir implies containerized execution (executor != "none"), + # so container paths are valid for cp commands below. command += f" && mkdir -p {data_dir}"nemo_skills/evaluation/evaluator/__init__.py (2)
92-117:_resolve_eval_type— clean centralized dispatch.The dual-format support (built-in keys and
::path format) is well-structured. One minor note:getattr(module, attr_str)on line 109 will raise anAttributeErrorwith a generic message if the attribute doesn't exist. Consider wrapping it to provide a more descriptive error mentioning the eval_type string.
147-154:supports_single_evalinstantiates the evaluator just to check a capability flag.
obj(config)(line 153) constructs a full evaluator instance solely to callsupports_single_eval(). If the constructor has side effects or is expensive, this is wasteful. Sincesupports_single_evalinBaseEvaluatoronly checks whethereval_singleis overridden (a class-level property), this could be a static/class-level check instead. Low priority given this matches the previous pattern.docs/evaluation/custom-benchmarks.md (1)
27-27: Add language specifiers to fenced code blocks.Lines 27, 135, 240, and 281 have fenced code blocks without a language identifier. Consider adding
textor the appropriate language (e.g.,bash) to satisfy linting and improve rendering.nemo_skills/dataset/utils.py (2)
62-75: Consider cachingget_extra_benchmark_map()to avoid repeated file I/O.
get_extra_benchmark_map()is called here and again insideget_dataset_module(). Each call re-reads and parses the JSON file. If these functions are called in a loop over multiple datasets, the map file will be loaded on every iteration. A simple@functools.lru_cacheonget_extra_benchmark_map(keyed on the env var value) would eliminate redundant reads.
55-59: Add type hints to new functions.Per coding guidelines, simple types should be annotated. All new public functions (
get_dataset_name,get_dataset_path,get_extra_benchmark_map,get_dataset_module,get_default_dataset_module) and the internal_load_external_datasetlack parameter and return type hints.For example:
-def get_dataset_name(dataset): +def get_dataset_name(dataset: str) -> str:-def get_dataset_path(dataset): +def get_dataset_path(dataset: str) -> Path:-def get_extra_benchmark_map(): +def get_extra_benchmark_map() -> dict[str, str]:As per coding guidelines: "Use type hints for simple types (dict, list, int, float, existing classes) in Python code."
Also applies to: 62-75, 95-104, 107-111, 114-155
| @hydra.main(version_base=None, config_name="base_generation_config") | ||
| def generate(cfg: WordCountGenerationConfig): | ||
| cfg = WordCountGenerationConfig(_init_nested=True, **cfg) | ||
| LOG.info("Config used: %s", cfg) | ||
| task = WordCountGenerationTask(cfg) | ||
| task.generate() | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| generate() | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOG is not defined in this code example.
Line 201 uses LOG.info(...) but the example doesn't import or define LOG. Add the import for completeness:
from nemo_skills.utils import get_logger_name
import logging
LOG = logging.getLogger(get_logger_name(__file__))🤖 Prompt for AI Agents
In `@docs/evaluation/custom-benchmarks.md` around lines 198 - 208, The example
calls LOG.info in the generate function but never defines LOG; add the logger
setup by importing logging and get_logger_name (from nemo_skills.utils) and
creating LOG = logging.getLogger(get_logger_name(__file__)); ensure these
imports and the LOG definition appear near the top of the example so LOG is
available when generate(cfg: WordCountGenerationConfig) calls LOG.info.
| def update(self, predictions): | ||
| # base class provides convenient helpers for calculating | ||
| # common metrics like majority / pass | ||
| super().update(predictions) | ||
| predicted_answers = [pred["predicted_answer"] for pred in predictions] | ||
| self._compute_pass_at_k( | ||
| predictions=predictions, | ||
| predicted_answers=predicted_answer, | ||
| ) | ||
| self._compute_majority_at_k( | ||
| predictions=predictions, | ||
| predicted_answers=predicted_answers, | ||
| ) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug in example code: predicted_answer should be predicted_answers.
Line 271 references predicted_answer (singular) but the variable defined on line 268 is predicted_answers (plural). Users copying this example will get a NameError.
Proposed fix
self._compute_pass_at_k(
predictions=predictions,
- predicted_answers=predicted_answer,
+ predicted_answers=predicted_answers,
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def update(self, predictions): | |
| # base class provides convenient helpers for calculating | |
| # common metrics like majority / pass | |
| super().update(predictions) | |
| predicted_answers = [pred["predicted_answer"] for pred in predictions] | |
| self._compute_pass_at_k( | |
| predictions=predictions, | |
| predicted_answers=predicted_answer, | |
| ) | |
| self._compute_majority_at_k( | |
| predictions=predictions, | |
| predicted_answers=predicted_answers, | |
| ) | |
| ``` | |
| def update(self, predictions): | |
| # base class provides convenient helpers for calculating | |
| # common metrics like majority / pass | |
| super().update(predictions) | |
| predicted_answers = [pred["predicted_answer"] for pred in predictions] | |
| self._compute_pass_at_k( | |
| predictions=predictions, | |
| predicted_answers=predicted_answers, | |
| ) | |
| self._compute_majority_at_k( | |
| predictions=predictions, | |
| predicted_answers=predicted_answers, | |
| ) |
🤖 Prompt for AI Agents
In `@docs/evaluation/custom-benchmarks.md` around lines 264 - 277, In the update
method, fix the NameError by passing the correct variable name to
_compute_pass_at_k: replace the incorrect predicted_answer with the defined
predicted_answers; specifically, in the update(self, predictions) body ensure
the call to _compute_pass_at_k uses predicted_answers (the list defined from
predictions) so it matches the later _compute_majority_at_k call and the local
variable name.
| METRICS_TYPE = "bfcl" | ||
| GENERATION_ARGS = "++eval_type=bfcl" | ||
| GENERATION_MODULE = "nemo_skills.inference.eval.bfcl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing copyright/license header.
Every other file in this PR includes the standard NVIDIA copyright and Apache 2.0 license header. This new file is missing it. The same likely applies to the other new bfcl_v3 __init__.py files in this PR.
Proposed fix
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
METRICS_TYPE = "bfcl"
GENERATION_ARGS = "++eval_type=bfcl"
GENERATION_MODULE = "nemo_skills.inference.eval.bfcl"🤖 Prompt for AI Agents
In `@nemo_skills/dataset/bfcl_v3/parallel/__init__.py` around lines 1 - 3, The
file is missing the standard NVIDIA Apache 2.0 license header; add the exact
repo-standard copyright/license header comment block to the top of this
__init__.py (and any other new bfcl_v3 __init__.py files) above the existing
constants METRICS_TYPE, GENERATION_ARGS, and GENERATION_MODULE so the file
matches other files in the PR.
| "--retries", | ||
| type=int, | ||
| default=0, | ||
| help="Number of retries per dataset if preparation fails", | ||
| ) | ||
| return parser.parse_known_args(args) | ||
|
|
||
|
|
||
| def prepare_datasets( | ||
| datasets=None, | ||
| dataset_groups=None, | ||
| add_lean4_header=False, | ||
| prepare_entrypoint="prepare.py", | ||
| extra_args="", | ||
| parallelism=20, | ||
| retries=3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent default for retries: 0 in CLI vs 3 in function signature.
The CLI parser (line 38) defaults --retries to 0, but prepare_datasets (line 49) defaults retries to 3. When invoked from the command line, the CLI default wins (0 retries). When called as a library function without specifying retries, the function default wins (3 retries). This silent inconsistency is likely to confuse callers.
Proposed fix — align the defaults
def prepare_datasets(
datasets=None,
prepare_entrypoint="prepare.py",
extra_args="",
parallelism=20,
- retries=3,
+ retries=0,
):Or change the CLI default to 3 if retries are desired by default.
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/prepare.py` around lines 36 - 49, The CLI flag --retries
currently defaults to 0 while the prepare_datasets function signature defaults
to 3, causing inconsistent behavior; pick one canonical default and make both
match (either change the parser add_argument call for "--retries" to default=3
or change the prepare_datasets signature to retries=0) so callers get the same
retry behavior whether invoked from the CLI or as a library function; update the
parser add_argument("--retries", ...) and/or the prepare_datasets(...) retries
parameter accordingly.
| MISSING_RULER_ARGS_MESSAGE = ( | ||
| "ERROR: Can't prepare ruler without arguments provided! " | ||
| "Skipping the preparation step.\n" | ||
| "Example ruler prepare command:\n" | ||
| "ns prepare_data ruler --setup llama_128k " | ||
| "--tokenizer_path meta-llama/Llama-3.1-8B-Instruct --max_seq_length 131072" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Message says "ERROR" but SystemExit(0) signals success.
The MISSING_RULER_ARGS_MESSAGE starts with "ERROR:" but parse_args_and_prepare_args exits with code 0. The calling prepare.py uses check=True on subprocess, so this "error" will be silently treated as success and the next script will still run. Either drop the "ERROR" prefix (if skipping is intentional) or use SystemExit(1) (if it's a real failure).
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/ruler/prepare_common.py` around lines 33 - 39, The
message MISSING_RULER_ARGS_MESSAGE currently begins with "ERROR:" but
parse_args_and_prepare_args exits with SystemExit(0), producing a success status
while signaling an error; fix by making them consistent: either remove the
"ERROR:" prefix from MISSING_RULER_ARGS_MESSAGE if skipping is intentional, or
change the exit in parse_args_and_prepare_args to SystemExit(1) to indicate
failure; locate and update the symbols MISSING_RULER_ARGS_MESSAGE and the exit
call in parse_args_and_prepare_args (also consider how prepare.py's subprocess
`check=True` will interpret the chosen exit code) so the log text and exit code
match the intended behavior.
| def prepare_mk_niah_basic(output_folder, tokenizer_type, tokenizer_path, length, dataset_size): | ||
| subprocess.run( | ||
| [ | ||
| "python", | ||
| "-m", | ||
| "nemo_skills.dataset.ruler2.prepare_niah", | ||
| "--output_folder", | ||
| output_folder, | ||
| "--tokenizer_type", | ||
| tokenizer_type, | ||
| "--tokenizer_path", | ||
| tokenizer_path, | ||
| "--max_seq_length", | ||
| str(length), | ||
| "--num_samples", | ||
| str(dataset_size), | ||
| "--random_seed", | ||
| "42", | ||
| "--num_needle_k", | ||
| "1", | ||
| "--num_needle_v", | ||
| "1", | ||
| "--num_needle_q", | ||
| "1", | ||
| "--type_haystack", | ||
| "needle", | ||
| "--type_needle_k", | ||
| "words", | ||
| "--type_needle_v", | ||
| "numbers", | ||
| "--num_digits_v", | ||
| "10", | ||
| ], | ||
| check=True, | ||
| ) | ||
|
|
||
|
|
||
| def prepare_mk_niah_easy(output_folder, tokenizer_type, tokenizer_path, length, dataset_size): | ||
| subprocess.run( | ||
| [ | ||
| "python", | ||
| "-m", | ||
| "nemo_skills.dataset.ruler2.prepare_mmlu", | ||
| "--output_folder", | ||
| output_folder, | ||
| "--tokenizer_type", | ||
| tokenizer_type, | ||
| "--tokenizer_path", | ||
| tokenizer_path, | ||
| "--max_seq_length", | ||
| str(length), | ||
| "--num_samples", | ||
| str(dataset_size), | ||
| "--random_seed", | ||
| "42", | ||
| "--dataset", | ||
| "mmlu", | ||
| "--fewshot", | ||
| "0", | ||
| "--prompt_type", | ||
| "instruct", | ||
| "--num_order", | ||
| "0", | ||
| "--task_type", | ||
| "retrieve", | ||
| "--algo_type", | ||
| "single", | ||
| ], | ||
| check=True, | ||
| ) | ||
|
|
||
|
|
||
| def prepare_mk_niah_medium(output_folder, tokenizer_type, tokenizer_path, length, dataset_size): | ||
| subprocess.run( | ||
| [ | ||
| "python", | ||
| "-m", | ||
| "nemo_skills.dataset.ruler2.prepare_mmlu", | ||
| "--output_folder", | ||
| output_folder, | ||
| "--tokenizer_type", | ||
| tokenizer_type, | ||
| "--tokenizer_path", | ||
| tokenizer_path, | ||
| "--max_seq_length", | ||
| str(length), | ||
| "--num_samples", | ||
| str(dataset_size), | ||
| "--random_seed", | ||
| "42", | ||
| "--dataset", | ||
| "mmlu", | ||
| "--fewshot", | ||
| "5", | ||
| "--prompt_type", | ||
| "instruct", | ||
| "--num_order", | ||
| "0", | ||
| "--task_type", | ||
| "solve", | ||
| "--algo_type", | ||
| "2steps", | ||
| ], | ||
| check=True, | ||
| ) | ||
|
|
||
|
|
||
| def prepare_mk_niah_hard(output_folder, tokenizer_type, tokenizer_path, length, dataset_size): | ||
| subprocess.run( | ||
| [ | ||
| "python", | ||
| "-m", | ||
| "nemo_skills.dataset.ruler2.prepare_mmlu", | ||
| "--output_folder", | ||
| output_folder, | ||
| "--tokenizer_type", | ||
| tokenizer_type, | ||
| "--tokenizer_path", | ||
| tokenizer_path, | ||
| "--max_seq_length", | ||
| str(length), | ||
| "--num_samples", | ||
| str(dataset_size), | ||
| "--random_seed", | ||
| "42", | ||
| "--dataset", | ||
| "mmlu", | ||
| "--fewshot", | ||
| "5", | ||
| "--prompt_type", | ||
| "instruct", | ||
| "--num_order", | ||
| "0", | ||
| "--task_type", | ||
| "solve", | ||
| "--algo_type", | ||
| "single", | ||
| ], | ||
| check=True, | ||
| ) | ||
|
|
||
|
|
||
| def prepare_mv_niah_basic(output_folder, tokenizer_type, tokenizer_path, length, dataset_size): | ||
| subprocess.run( | ||
| [ | ||
| "python", | ||
| "-m", | ||
| "nemo_skills.dataset.ruler2.prepare_niah", | ||
| "--output_folder", | ||
| output_folder, | ||
| "--tokenizer_type", | ||
| tokenizer_type, | ||
| "--tokenizer_path", | ||
| tokenizer_path, | ||
| "--max_seq_length", | ||
| str(length), | ||
| "--num_samples", | ||
| str(dataset_size), | ||
| "--random_seed", | ||
| "42", | ||
| "--num_needle_k", | ||
| "1", | ||
| "--num_needle_v", | ||
| "4", | ||
| "--num_needle_q", | ||
| "1", | ||
| "--type_haystack", | ||
| "needle", | ||
| "--type_needle_k", | ||
| "words", | ||
| "--type_needle_v", | ||
| "numbers", | ||
| "--num_digits_v", | ||
| "10", | ||
| ], | ||
| check=True, | ||
| ) | ||
|
|
||
|
|
||
| def prepare_mv_niah_easy(output_folder, tokenizer_type, tokenizer_path, length, dataset_size): | ||
| subprocess.run( | ||
| [ | ||
| "python", | ||
| "-m", | ||
| "nemo_skills.dataset.ruler2.prepare_mmlu", | ||
| "--output_folder", | ||
| output_folder, | ||
| "--tokenizer_type", | ||
| tokenizer_type, | ||
| "--tokenizer_path", | ||
| tokenizer_path, | ||
| "--max_seq_length", | ||
| str(length), | ||
| "--num_samples", | ||
| str(dataset_size), | ||
| "--random_seed", | ||
| "42", | ||
| "--dataset", | ||
| "mmlu", | ||
| "--fewshot", | ||
| "0", | ||
| "--prompt_type", | ||
| "instruct", | ||
| "--num_order", | ||
| "4", | ||
| "--task_type", | ||
| "niah", | ||
| "--algo_type", | ||
| "single", | ||
| ], | ||
| check=True, | ||
| ) | ||
|
|
||
|
|
||
| def prepare_mv_niah_medium(output_folder, tokenizer_type, tokenizer_path, length, dataset_size): | ||
| subprocess.run( | ||
| [ | ||
| "python", | ||
| "-m", | ||
| "nemo_skills.dataset.ruler2.prepare_mmlu", | ||
| "--output_folder", | ||
| output_folder, | ||
| "--tokenizer_type", | ||
| tokenizer_type, | ||
| "--tokenizer_path", | ||
| tokenizer_path, | ||
| "--max_seq_length", | ||
| str(length), | ||
| "--num_samples", | ||
| str(dataset_size), | ||
| "--random_seed", | ||
| "42", | ||
| "--dataset", | ||
| "mmlu", | ||
| "--fewshot", | ||
| "0", | ||
| "--prompt_type", | ||
| "instruct", | ||
| "--num_order", | ||
| "4", | ||
| "--task_type", | ||
| "retrieve", | ||
| "--algo_type", | ||
| "2steps", | ||
| ], | ||
| check=True, | ||
| ) | ||
|
|
||
|
|
||
| def prepare_mv_niah_hard(output_folder, tokenizer_type, tokenizer_path, length, dataset_size): | ||
| subprocess.run( | ||
| [ | ||
| "python", | ||
| "-m", | ||
| "nemo_skills.dataset.ruler2.prepare_mmlu", | ||
| "--output_folder", | ||
| output_folder, | ||
| "--tokenizer_type", | ||
| tokenizer_type, | ||
| "--tokenizer_path", | ||
| tokenizer_path, | ||
| "--max_seq_length", | ||
| str(length), | ||
| "--num_samples", | ||
| str(dataset_size), | ||
| "--random_seed", | ||
| "42", | ||
| "--dataset", | ||
| "mmlu", | ||
| "--fewshot", | ||
| "0", | ||
| "--prompt_type", | ||
| "instruct", | ||
| "--num_order", | ||
| "4", | ||
| "--task_type", | ||
| "retrieve", | ||
| "--algo_type", | ||
| "single", | ||
| ], | ||
| check=True, | ||
| ) | ||
|
|
||
|
|
||
| def prepare_qa_basic(output_folder, tokenizer_type, tokenizer_path, length, dataset_size): | ||
| subprocess.run( | ||
| [ | ||
| "python", | ||
| "-m", | ||
| "nemo_skills.dataset.ruler2.prepare_qa", | ||
| "--output_folder", | ||
| output_folder, | ||
| "--tokenizer_type", | ||
| tokenizer_type, | ||
| "--tokenizer_path", | ||
| tokenizer_path, | ||
| "--max_seq_length", | ||
| str(length), | ||
| "--num_samples", | ||
| str(dataset_size), | ||
| "--random_seed", | ||
| "42", | ||
| "--dataset", | ||
| "hotpotqa", | ||
| "--fewshot", | ||
| "0", | ||
| "--prompt_type", | ||
| "instruct", | ||
| "--task_type", | ||
| "retrieve", | ||
| "--query_type", | ||
| "doc", | ||
| ], | ||
| check=True, | ||
| ) | ||
|
|
||
|
|
||
| def prepare_qa_easy(output_folder, tokenizer_type, tokenizer_path, length, dataset_size): | ||
| subprocess.run( | ||
| [ | ||
| "python", | ||
| "-m", | ||
| "nemo_skills.dataset.ruler2.prepare_qa", | ||
| "--output_folder", | ||
| output_folder, | ||
| "--tokenizer_type", | ||
| tokenizer_type, | ||
| "--tokenizer_path", | ||
| tokenizer_path, | ||
| "--max_seq_length", | ||
| str(length), | ||
| "--num_samples", | ||
| str(dataset_size), | ||
| "--random_seed", | ||
| "42", | ||
| "--dataset", | ||
| "hotpotqa", | ||
| "--fewshot", | ||
| "0", | ||
| "--prompt_type", | ||
| "instruct", | ||
| "--task_type", | ||
| "retrieve", | ||
| "--query_type", | ||
| "question", | ||
| ], | ||
| check=True, | ||
| ) | ||
|
|
||
|
|
||
| def prepare_qa_medium(output_folder, tokenizer_type, tokenizer_path, length, dataset_size): | ||
| subprocess.run( | ||
| [ | ||
| "python", | ||
| "-m", | ||
| "nemo_skills.dataset.ruler2.prepare_qa", | ||
| "--output_folder", | ||
| output_folder, | ||
| "--tokenizer_type", | ||
| tokenizer_type, | ||
| "--tokenizer_path", | ||
| tokenizer_path, | ||
| "--max_seq_length", | ||
| str(length), | ||
| "--num_samples", | ||
| str(dataset_size), | ||
| "--random_seed", | ||
| "42", | ||
| "--dataset", | ||
| "hotpotqa", | ||
| "--fewshot", | ||
| "0", | ||
| "--prompt_type", | ||
| "instruct", | ||
| "--task_type", | ||
| "solve", | ||
| "--algo_type", | ||
| "2steps", | ||
| ], | ||
| check=True, | ||
| ) | ||
|
|
||
|
|
||
| def prepare_qa_hard(output_folder, tokenizer_type, tokenizer_path, length, dataset_size): | ||
| subprocess.run( | ||
| [ | ||
| "python", | ||
| "-m", | ||
| "nemo_skills.dataset.ruler2.prepare_qa", | ||
| "--output_folder", | ||
| output_folder, | ||
| "--tokenizer_type", | ||
| tokenizer_type, | ||
| "--tokenizer_path", | ||
| tokenizer_path, | ||
| "--max_seq_length", | ||
| str(length), | ||
| "--num_samples", | ||
| str(dataset_size), | ||
| "--random_seed", | ||
| "42", | ||
| "--dataset", | ||
| "hotpotqa", | ||
| "--fewshot", | ||
| "0", | ||
| "--prompt_type", | ||
| "instruct", | ||
| "--task_type", | ||
| "solve", | ||
| "--algo_type", | ||
| "single", | ||
| ], | ||
| check=True, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Massive DRY violation — 12 near-identical functions should be collapsed into a data-driven dispatcher.
Every prepare_* function follows the same pattern: build a list of subprocess args from a fixed module name plus task-specific parameters, then call subprocess.run. The only differences are the target module (prepare_niah, prepare_mmlu, prepare_qa) and a handful of flag values. This results in ~400 lines that could be ~40.
♻️ Suggested refactor — declarative task config + single runner
-def prepare_mk_niah_basic(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
- subprocess.run(
- [
- "python",
- "-m",
- "nemo_skills.dataset.ruler2.prepare_niah",
- "--output_folder",
- output_folder,
- ...
- ],
- check=True,
- )
-
-# ... (11 more nearly identical functions)
+TASK_CONFIGS: dict[str, dict] = {
+ "mk_niah_basic": {
+ "module": "nemo_skills.dataset.ruler2.prepare_niah",
+ "extra_args": [
+ "--num_needle_k", "1", "--num_needle_v", "1", "--num_needle_q", "1",
+ "--type_haystack", "needle", "--type_needle_k", "words",
+ "--type_needle_v", "numbers", "--num_digits_v", "10",
+ ],
+ },
+ "mk_niah_easy": {
+ "module": "nemo_skills.dataset.ruler2.prepare_mmlu",
+ "extra_args": [
+ "--dataset", "mmlu", "--fewshot", "0", "--prompt_type", "instruct",
+ "--num_order", "0", "--task_type", "retrieve", "--algo_type", "single",
+ ],
+ },
+ # ... one entry per task ...
+}
+
+
+def _run_prepare(
+ task: str,
+ output_folder: str,
+ tokenizer_type: str,
+ tokenizer_path: str,
+ length: int,
+ dataset_size: int,
+) -> None:
+ cfg = TASK_CONFIGS[task]
+ subprocess.run(
+ [
+ "python", "-m", cfg["module"],
+ "--output_folder", output_folder,
+ "--tokenizer_type", tokenizer_type,
+ "--tokenizer_path", tokenizer_path,
+ "--max_seq_length", str(length),
+ "--num_samples", str(dataset_size),
+ "--random_seed", "42",
+ *cfg["extra_args"],
+ ],
+ check=True,
+ )This eliminates ~350 lines of boilerplate and makes adding a new task a one-liner dict entry instead of a whole new function.
🧰 Tools
🪛 Ruff (0.15.0)
[error] 26-26: subprocess call: check for execution of untrusted input
(S603)
[error] 27-57: Starting a process with a partial executable path
(S607)
[error] 63-63: subprocess call: check for execution of untrusted input
(S603)
[error] 64-92: Starting a process with a partial executable path
(S607)
[error] 98-98: subprocess call: check for execution of untrusted input
(S603)
[error] 99-127: Starting a process with a partial executable path
(S607)
[error] 133-133: subprocess call: check for execution of untrusted input
(S603)
[error] 134-162: Starting a process with a partial executable path
(S607)
[error] 168-168: subprocess call: check for execution of untrusted input
(S603)
[error] 169-199: Starting a process with a partial executable path
(S607)
[error] 205-205: subprocess call: check for execution of untrusted input
(S603)
[error] 206-234: Starting a process with a partial executable path
(S607)
[error] 240-240: subprocess call: check for execution of untrusted input
(S603)
[error] 241-269: Starting a process with a partial executable path
(S607)
[error] 275-275: subprocess call: check for execution of untrusted input
(S603)
[error] 276-304: Starting a process with a partial executable path
(S607)
[error] 310-310: subprocess call: check for execution of untrusted input
(S603)
[error] 311-337: Starting a process with a partial executable path
(S607)
[error] 343-343: subprocess call: check for execution of untrusted input
(S603)
[error] 344-370: Starting a process with a partial executable path
(S607)
[error] 376-376: subprocess call: check for execution of untrusted input
(S603)
[error] 377-403: Starting a process with a partial executable path
(S607)
[error] 409-409: subprocess call: check for execution of untrusted input
(S603)
[error] 410-436: Starting a process with a partial executable path
(S607)
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/ruler2/prepare_data.py` around lines 25 - 438, Many
near-duplicate functions (e.g., prepare_mk_niah_basic, prepare_mk_niah_easy,
prepare_mv_niah_basic, prepare_qa_hard, etc.) repeat subprocess.run logic with
only module name (prepare_niah, prepare_mmlu, prepare_qa) and a few flag values
changing; replace them with one generic runner (e.g., run_prepare_task) that
accepts a task key and common params (output_folder, tokenizer_type,
tokenizer_path, length, dataset_size) and use a declarative dict mapping task
keys to module name plus per-task flag/value dicts; the runner should build the
argv list by starting with ["python","-m", module] then iterating the flag dict
to append "--flag", str(value), call subprocess.run(..., check=True), and
replace all prepare_* callers with a single call to run_prepare_task(task_key,
...).
| if __name__ == "__main__": | ||
| parser = build_prepare_parser(description="Prepare RULER2 dataset data.") | ||
| args, unknown = parse_known_args(parser) | ||
| _ = unknown | ||
| prepare_dataset( | ||
| args.tasks, | ||
| args.setup, | ||
| args.max_seq_length, | ||
| args.tokenizer_type, | ||
| args.tokenizer_path, | ||
| args.dataset_size, | ||
| ) | ||
| print("RULER2 data preparation completed.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unknown CLI arguments are silently discarded.
Line 482 assigns unknown args to _, meaning any typo or unsupported flag passed by the user is silently ignored. This violates the coding guideline about not silently ignoring unused user-passed parameters. Use parser.parse_args() instead to reject unrecognized arguments, or at minimum raise an error if unknown is non-empty.
Proposed fix
- args, unknown = parse_known_args(parser)
- _ = unknown
+ args = parser.parse_args()If parse_known_args is needed for other callers, add a guard here:
args, unknown = parse_known_args(parser)
- _ = unknown
+ if unknown:
+ parser.error(f"Unrecognized arguments: {unknown}")As per coding guidelines: "Avoid silently ignoring unused user-passed parameters; the code should fail if a required argument is not specified or if unsupported arguments are provided."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if __name__ == "__main__": | |
| parser = build_prepare_parser(description="Prepare RULER2 dataset data.") | |
| args, unknown = parse_known_args(parser) | |
| _ = unknown | |
| prepare_dataset( | |
| args.tasks, | |
| args.setup, | |
| args.max_seq_length, | |
| args.tokenizer_type, | |
| args.tokenizer_path, | |
| args.dataset_size, | |
| ) | |
| print("RULER2 data preparation completed.") | |
| if __name__ == "__main__": | |
| parser = build_prepare_parser(description="Prepare RULER2 dataset data.") | |
| args = parser.parse_args() | |
| prepare_dataset( | |
| args.tasks, | |
| args.setup, | |
| args.max_seq_length, | |
| args.tokenizer_type, | |
| args.tokenizer_path, | |
| args.dataset_size, | |
| ) | |
| print("RULER2 data preparation completed.") |
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/ruler2/prepare_data.py` around lines 479 - 491, The
current main block uses parse_known_args and assigns unknown to `_`, silently
discarding unrecognized CLI flags; change this to either call
parser.parse_args() to let argparse reject unknown args, or keep
parse_known_args but immediately check the returned unknown list and raise a
clear error (or call parser.error()) if it's non-empty; update the block around
build_prepare_parser, parse_known_args, and the call site before prepare_dataset
to perform this validation so typos/unsupported flags are not ignored.
| if is_class: | ||
| evaluator = obj(eval_config) | ||
| print(f"evaluator: {evaluator}") | ||
| return asyncio.run(evaluator.eval_full()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove debug print statement.
Line 176 (print(f"evaluator: {evaluator}")) appears to be a leftover debug statement. It will produce noisy output on every class-based evaluation run. Remove it or replace with LOG.debug(...) if the information is useful.
Proposed fix
if is_class:
evaluator = obj(eval_config)
- print(f"evaluator: {evaluator}")
return asyncio.run(evaluator.eval_full())📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if is_class: | |
| evaluator = obj(eval_config) | |
| print(f"evaluator: {evaluator}") | |
| return asyncio.run(evaluator.eval_full()) | |
| if is_class: | |
| evaluator = obj(eval_config) | |
| return asyncio.run(evaluator.eval_full()) |
🤖 Prompt for AI Agents
In `@nemo_skills/evaluation/evaluator/__init__.py` around lines 174 - 177, The
debug print in the class-instantiation branch should be removed or converted to
a proper logger call; replace the line `print(f"evaluator: {evaluator}")` with
either nothing or `LOG.debug("evaluator: %s", evaluator)` (or equivalent) inside
the block where `is_class` is true after `evaluator = obj(eval_config)` so you
don't emit noisy stdout during `evaluator.eval_full()` runs.
| f" --metric_type={metric_type or benchmark_args.metrics_type} " | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--metric_type=None will be passed as a literal string if both values are None.
If metric_type is None and benchmark_args.metrics_type is also None (e.g., a benchmark __init__.py that doesn't define METRICS_TYPE), the f-string produces --metric_type=None — the string "None" — which will likely cause a downstream error in summarize_results.
Consider guarding this:
Proposed fix
- f" --metric_type={metric_type or benchmark_args.metrics_type} "
+ f" --metric_type={metric_type or benchmark_args.metrics_type or ''} "Or better, conditionally append the flag only when a value exists:
effective_metric_type = metric_type or benchmark_args.metrics_type
...
if effective_metric_type:
command += f" --metric_type={effective_metric_type} "📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| f" --metric_type={metric_type or benchmark_args.metrics_type} " | |
| ) | |
| f" --metric_type={metric_type or benchmark_args.metrics_type or ''} " | |
| ) |
🤖 Prompt for AI Agents
In `@nemo_skills/pipeline/eval.py` around lines 745 - 746, The current f-string
will inject the literal "None" when both metric_type and
benchmark_args.metrics_type are None; compute an effective value (e.g.,
effective_metric_type = metric_type or benchmark_args.metrics_type) and only
append the "--metric_type=..." flag to the command when effective_metric_type is
truthy, so summarize_results never receives the string "None" as a metric type.
| dataset_dir = nemo_skills_dir / "dataset" | ||
| for f in dataset_dir.rglob("*.jsonl"): | ||
| include_patterns.append(str(f)) | ||
| include_pattern_relative_paths.append(str(nemo_skills_dir.parent)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Count how many .jsonl files exist under dataset/
find . -name "*.jsonl" -type f | head -20
echo "---"
find . -name "*.jsonl" -type f | wc -lRepository: NVIDIA-NeMo/Skills
Length of output: 314
🏁 Script executed:
# Examine the packager.py file to understand the context
wc -l nemo_skills/pipeline/utils/packager.py
head -50 nemo_skills/pipeline/utils/packager.pyRepository: NVIDIA-NeMo/Skills
Length of output: 1564
🏁 Script executed:
# Look at the specific lines mentioned (191-194) and surrounding context
sed -n '170,220p' nemo_skills/pipeline/utils/packager.pyRepository: NVIDIA-NeMo/Skills
Length of output: 2295
🏁 Script executed:
# Search for how include_patterns and include_pattern_relative_paths are used
rg "include_patterns|include_pattern_relative_paths" nemo_skills/pipeline/utils/packager.py -B 2 -A 2Repository: NVIDIA-NeMo/Skills
Length of output: 1487
Extract the relative path append from the loop to avoid duplicate entries.
The current code appends str(nemo_skills_dir.parent) once per JSONL file, resulting in duplicate identical entries in include_pattern_relative_paths. The relative path should be appended once outside the loop since it's the same for all files. Consider moving include_pattern_relative_paths.append(str(nemo_skills_dir.parent)) outside the for loop.
🤖 Prompt for AI Agents
In `@nemo_skills/pipeline/utils/packager.py` around lines 191 - 194, The loop that
builds include_patterns iterates over dataset_dir.rglob("*.jsonl") and currently
calls include_pattern_relative_paths.append(str(nemo_skills_dir.parent)) inside
that loop, creating duplicate entries; move the append call outside the for f in
dataset_dir.rglob("*.jsonl") loop (or append once conditionally after detecting
at least one JSONL) so include_pattern_relative_paths only adds
str(nemo_skills_dir.parent) a single time when dataset files exist; update the
block around dataset_dir, include_patterns and include_pattern_relative_paths in
packager.py accordingly.
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Will update in a bit
Will also add tests in a bit, but core logic should be good
Summary by CodeRabbit
New Features
Improvements
Documentation
Tests