Eval kit support #1239

Jorjeous · 2026-02-12T17:59:36Z

VLMEvalKit's MultiModalMCore inference engine:

eval_kit: uses VLMEvalKit's native dataset/evaluation pipeline with NeMo Skills output formatting
mcore_skills: reads NeMo Skills JSONL data, runs inference through MultiModalMCore in-process with data parallelism, computes metrics using VLMEvalKit functions (e.g. asr_wer), and writes results in NeMo Skills format

Key changes:

New inference modules: eval_kit.py, mcore_skills.py
New dataset module: eval_kit/ for VLMEvalKit benchmark resolution
New metrics class: EvalKitMetrics for pre-computed VLMEvalKit metrics
Pipeline extensions: self-contained task support, METRICS_TYPE_OVERRIDE, eval_kit container/mounts handling
Lazy imports for sacrebleu to avoid missing-dep crashes

Summary by CodeRabbit

Release Notes

New Features
- Added VLMEvalKit benchmark evaluation support with automatic metrics computation.
- Introduced in-process generation modes (mcore_skills, eval_kit) as alternatives to server-based workflows.
- Enhanced audio evaluation with expanded task type coverage including ASR variants and translation types.
Improvements
- Improved generation command flexibility with optional input file handling.
- Enhanced multi-GPU task orchestration and per-benchmark GPU allocation.
- Made ComputeEvalEvaluator optional for graceful dependency handling.

VLMEvalKit's MultiModalMCore inference engine: - eval_kit: uses VLMEvalKit's native dataset/evaluation pipeline with NeMo Skills output formatting - mcore_skills: reads NeMo Skills JSONL data, runs inference through MultiModalMCore in-process with data parallelism, computes metrics using VLMEvalKit functions (e.g. asr_wer), and writes results in NeMo Skills format Key changes: - New inference modules: eval_kit.py, mcore_skills.py - New dataset module: eval_kit/ for VLMEvalKit benchmark resolution - New metrics class: EvalKitMetrics for pre-computed VLMEvalKit metrics - Pipeline extensions: self-contained task support, METRICS_TYPE_OVERRIDE, eval_kit container/mounts handling - Lazy imports for sacrebleu to avoid missing-dep crashes Signed-off-by: George Zelenfroind <[email protected]>

greptile-apps

_{17 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-12T18:03:04Z

nemo_skills/inference/mcore_skills.py

+        # avoid 8x duplicate output in logs.
+        _real_stdout = sys.stdout
+        if dp_rank != 0:
+            sys.stdout = open(os.devnull, "w")


opened file is never closed if exception occurs during generation

Suggested change

sys.stdout = open(os.devnull, "w")

sys.stdout = open(os.devnull, "w", encoding="utf-8")

greptile-apps · 2026-02-12T18:03:05Z

nemo_skills/inference/eval/eval_kit.py

+        if os.path.exists(result_file):
+            df = vlm_load(result_file)
+            with open(self.cfg.output_file, 'w', encoding='utf-8') as f:
+                for _, row in df.iterrows():
+                    entry = {
+                        "generation": str(row["prediction"]) if "prediction" in row.index else "",
+                        "expected_answer": str(row["answer"]) if "answer" in row.index else "",
+                        "question": str(row["question"]) if "question" in row.index else "",
+                    }
+                    f.write(json.dumps(entry) + "\n")
+            LOG.info("Wrote final ordered JSONL to %s (%d entries)", self.cfg.output_file, len(df))


violates data loss protection guideline from CONTRIBUTING.md - loads df then immediately opens output file for writing

if vlm_load() completes but code crashes before write finishes, async writer's partial data is lost

load all data into memory first, then write atomically:

if os.path.exists(result_file): df = vlm_load(result_file) entries = [] for _, row in df.iterrows(): entries.append({ "generation": str(row["prediction"]) if "prediction" in row.index else "", "expected_answer": str(row["answer"]) if "answer" in row.index else "", "question": str(row["question"]) if "question" in row.index else "", }) with open(self.cfg.output_file, 'w', encoding='utf-8') as f: for entry in entries: f.write(json.dumps(entry) + "\n") LOG.info("Wrote final ordered JSONL to %s (%d entries)", self.cfg.output_file, len(df))

nemo_skills/inference/mcore_skills.py

greptile-apps · 2026-02-12T18:03:07Z

nemo_skills/inference/mcore_skills.py

+            # Re-write output.jsonl with cleaned generations
+            with open(output_file, "w", encoding="utf-8") as fout:
+                for entry in entries:
+                    fout.write(json.dumps(entry) + "\n")


same data loss issue - loads entries then immediately opens file for writing

compute all entries first, then write atomically

coderabbitai · 2026-02-12T18:07:02Z

📝 Walkthrough

Walkthrough

This pull request introduces comprehensive VLMEvalKit integration into NeMo Skills, adding two new in-process generation modes (Megatron MCore and vLLM-based), metrics aggregation from pre-computed evaluation results, and pipeline orchestration for self-contained tasks with GPU allocation per benchmark.

Changes

Cohort / File(s)	Summary
VLMEvalKit Integration Core `nemo_skills/dataset/eval_kit/__init__.py`, `nemo_skills/inference/eval/eval_kit.py`	Adds eval_kit module with generation configuration constants and helper function for dataset name extraction. Introduces EvalKitConfig dataclass and EvalKitGenerationTask supporting mcore (in-process MultiModalMCore) and vllm (server-based) modes with async JSONL result writing, multi-process synchronization, dataset building, and evaluation pipeline with optional judge model support.
Megatron MCore In-Process Generation `nemo_skills/inference/mcore_skills.py`	Implements MegatronMCoreConfig and MegatronMCoreGenerationTask for end-to-end in-process generation from JSONL inputs with optional prompt filling, multi-GPU parallelization via DP ranks, OpenAI-format message construction, per-sample generation with thinking tag stripping, and optional in-process ASR metrics computation.
Metrics Infrastructure `nemo_skills/evaluation/metrics/eval_kit_metrics.py`, `nemo_skills/evaluation/metrics/map_metrics.py`	Introduces EvalKitMetrics class to read pre-computed VLMEvalKit aggregates from eval_kit_metrics.json, flatten nested values, and expose metrics in NeMo Skills format. Registers EvalKitMetrics in METRICS_MAP.
Dataset and Utility Updates `nemo_skills/dataset/utils.py`, `nemo_skills/evaluation/metrics/compute_metrics.py`, `nemo_skills/evaluation/metrics/translation_metrics.py`	Updates get_default_dataset_module and get_dataset_module to return (module, data_path, is_on_cluster) tuple with early-path handling for eval_kit datasets. Adjusts metrics compute_metrics flexible unpacking. Moves corpus_bleu import to local scope in translation_metrics.
Evaluator Changes `nemo_skills/evaluation/evaluator/__init__.py`, `nemo_skills/evaluation/evaluator/audio.py`	Makes ComputeEvalEvaluator import optional. Expands audio.py task type handling with internal _ASR_TYPES and _TRANSLATION_TYPES groupings, broadens ASR/translation coverage including ST- variants, adds MathQA exact string match branch, and refines generation presence gating.
Generation Framework Extensions `nemo_skills/inference/generate.py`, `nemo_skills/inference/factory.py`	Adds declarative task attributes (CONTAINER_KEY, USE_TORCHRUN) and class methods (is_self_contained, get_env_prefix, get_extra_package_dirs) to GenerationTask for self-managed models. Adds mcore_skills enum member to GenerationType and GENERATION_MODULE_MAP.
Pipeline Orchestration `nemo_skills/pipeline/eval.py`, `nemo_skills/pipeline/utils/eval.py`, `nemo_skills/pipeline/utils/generation.py`	Introduces _apply_task_overrides for environment/torchrun/container configuration. Extends BenchmarkArgs with metric_type, self_contained_task, num_gpus fields and task class resolution. Adds self-contained task detection to bypass server model and allocate per-benchmark GPUs. Updates command generation to conditionally include input_file and support uv fallback for venv creation.
Training Enhancements `nemo_skills/training/nemo_rl/start_grpo.py`	Implements dynamic Python executable selection from config and conditional unpacking of setup() results supporting both 10-element (main) and 11-element (Nano/Super) configurations.
Configuration `requirements/eval-kit.txt`	Adds explanatory notes that VLMEvalKit is installed at job start via installation_command.

Sequence Diagram(s)

sequenceDiagram
    participant User as Evaluation<br/>Pipeline
    participant Benchmark as Benchmark<br/>Module
    participant TaskFactory as Task<br/>Factory
    participant GenTask as EvalKit/<br/>MCore Task
    participant Dataset as VLMEvalKit<br/>Dataset
    participant Model as Generation<br/>Model
    participant Metrics as Metrics<br/>Collector
    participant Results as Result<br/>Writer

    User->>Benchmark: Load benchmark config
    Benchmark-->>User: GENERATION_MODULE, METRICS_TYPE, GENERATION_ARGS
    
    User->>TaskFactory: is_self_contained(extra_args)
    alt Self-Contained (eval_kit/mcore)
        TaskFactory-->>User: True - allocate GPUs per benchmark
        User->>GenTask: __init__(config)
        GenTask->>Dataset: Build with dataset kwargs
        Dataset-->>GenTask: Dataset prepared
        GenTask->>Model: Create (mcore/vllm)
        Model-->>GenTask: Model ready
    else Server-Based
        TaskFactory-->>User: False - use shared vLLM server
    end
    
    User->>GenTask: generate()
    GenTask->>GenTask: _start_async_writer()
    GenTask->>Dataset: Iterate samples
    loop Per Sample
        Dataset->>GenTask: Next sample
        GenTask->>Model: Inference (text/image/video)
        Model-->>GenTask: Generation result
        GenTask->>Results: Write incremental JSONL
    end
    GenTask->>GenTask: _stop_async_writer()
    
    GenTask->>Metrics: Evaluate results
    Metrics-->>GenTask: eval_kit_metrics.json
    GenTask->>Results: Write final results + done file
    Results-->>User: Evaluation complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

NVIDIA-NeMo/Skills#1103 — Modifies nemo_skills/evaluation/evaluator/audio.py task-type handling in the same file with overlapping logic for ASR and translation type checking.
NVIDIA-NeMo/Skills#1220 — Shares identical changes in nemo_skills/training/nemo_rl/start_grpo.py for dynamic Python executable selection and conditional setup() result unpacking (10 vs 11 items).
NVIDIA-NeMo/Skills#1158 — Adds benchmark integrations with overlapping changes to evaluator registration points and METRICS_MAP entries.

Suggested labels

enhancement

Suggested reviewers

gwarmstrong
melllinia
Kipok

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 59.68% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Eval kit support' is vague and generic; it does not convey the specific nature of the substantial changes (two new generation modules, dataset integration, metrics class, pipeline extensions).	Consider a more descriptive title that highlights the main contribution, such as 'Add VLMEvalKit integration with mcore and eval_kit generation modules' or 'Integrate VLMEvalKit MultiModalMCore for NeMo Skills evaluation'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch eval-kit

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/evaluation/evaluator/__init__.py (1)
131-131: ⚠️ Potential issue | 🟡 Minor

Remove debug print statement.

This looks like a leftover debug artifact. Either remove it or replace with proper logging.
-        print(f"evaluator: {evaluator}")

🤖 Fix all issues with AI agents

In `@nemo_skills/evaluation/evaluator/__init__.py`:
- Around line 30-33: The current top-level try/except hides import errors for
ComputeEvalEvaluator; instead capture the ImportError into a module-level
variable (e.g. _compute_eval_import_error) and set ComputeEvalEvaluator = None,
then in get_evaluator_class (or the evaluator registration lookup) check if
eval_type == "compute-eval" and if ComputeEvalEvaluator is None raise a clear
ImportError that includes _compute_eval_import_error; this defers the failure to
the point of use and gives an actionable message when someone requests the
"compute-eval" evaluator.

In `@nemo_skills/inference/eval/eval_kit.py`:
- Around line 264-275: In _flush_pkl_to_jsonl, replace the broad "except
Exception" around pickle.load with a narrower handler for the transient errors
that indicate a mid-write pickle (e.g., (EOFError, pickle.UnpicklingError,
BlockingIOError)) so those are safely skipped; for other unexpected errors
(e.g., PermissionError, MemoryError) let them propagate or log them explicitly
before re-raising so they aren't silently swallowed—update the exception clause
in _flush_pkl_to_jsonl accordingly and include a clear processLogger.error(...)
call when re-raising non-transient exceptions.

In `@nemo_skills/inference/mcore_skills.py`:
- Around line 467-522: _evaluate_results currently always imports
vlmeval.dataset.avlm.utils.asr_wer and writes eval_kit_metrics.json with only
WER which is wrong for non-ASR datasets; update the method to consult a config
flag (e.g., self.cfg.metrics_type or self.cfg.dataset_type) or a new
self.cfg.eval_function setting before importing/using asr_wer so you only
compute WER when the dataset is ASR-type, otherwise skip creating
eval_kit_metrics.json or call a configurable evaluator; reference the
METRICS_TYPE_OVERRIDE constant and the EvalKitMetrics consumer to ensure the
written file matches the selected evaluator, and add a clear guard in
_evaluate_results (or load evaluator via a registry/lookup) to avoid the broad
Exception path producing meaningless metrics.

In `@nemo_skills/pipeline/eval.py`:
- Around line 40-67: The code mixes direct access and .get() for
cluster_config["containers"] in _apply_task_overrides which is inconsistent;
update the membership check to use direct access so failures are explicit.
Replace cluster_config.get("containers", {}) with cluster_config["containers"]
in the for-loop that selects container (i.e., change if key and key in
cluster_config.get("containers", {}): to if key and key in
cluster_config["containers"]:) while keeping the initial container =
cluster_config["containers"]["nemo-skills"] and the references to CONTAINER_KEY
and task_classes unchanged.

In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 32-42: The _resolve_generation_task_class function currently
swallows all exceptions; change its error handling to only catch ImportError and
ModuleNotFoundError (so syntax/runtime errors in the module propagate) and, when
catching these import-related errors, log a warning including the exception
details and the module_name; keep the rest of the logic (import_from_path vs
importlib.import_module and returning getattr(..., "GENERATION_TASK_CLASS",
None)) unchanged and use the module logger (e.g., logging.getLogger(__name__))
or an existing logger in the file for the warning message.

🧹 Nitpick comments (13)

nemo_skills/evaluation/evaluator/audio.py (2)
499-516: Mutable _TRANSLATION_TYPES set is fragile if ever hoisted to module scope.

_TRANSLATION_TYPES is mutated on line 504 (_TRANSLATION_TYPES.add(task_type)). This works correctly today because it's re-created on every call, but the _ALL_CAPS naming convention strongly suggests a module-level constant. If a future refactor moves it to module scope (like _FAILURE_RESPONSES or VALID_NORMALIZATION_MODES above), the set would accumulate task types across calls — a subtle, hard-to-detect bug.

Consider making the intent clearer by either:

Keeping the sets local but using lowercase naming (asr_types, translation_types), or

Making the module-level sets truly immutable (frozensets) and building the union inline.
Option 2: immutable module-level sets + inline union

Define at module level:
_ASR_TYPES = frozenset({"ASR", "ASR-ZH", "ASR-PC", "ASR_LEADERBOARD"})
_TRANSLATION_TYPES = frozenset({"AST", "Translation"})
Then inside evaluate_sample:
-    _ASR_TYPES = {"ASR", "ASR-ZH", "ASR-PC", "ASR_LEADERBOARD"}
-    _TRANSLATION_TYPES = {"AST", "Translation"}
-    # AudioBench speech translation types: ST-{src}-{tgt}
-    if task_type.startswith("ST-"):
-        _TRANSLATION_TYPES.add(task_type)
-
-    if task_type in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}) and not generation:
+    is_translation = task_type in _TRANSLATION_TYPES or task_type.startswith("ST-")
+    is_asr = task_type in _ASR_TYPES
+
+    if (is_asr or is_translation or task_type == "CER") and not generation:
And similarly replace task_type in _TRANSLATION_TYPES checks with is_translation.
557-562: MathQA exact-match may be too strict for numerical answers.

Pure string equality after lowercasing won't match equivalent representations like "3.0" vs "3", "1/2" vs "0.5", or whitespace/formatting differences in expressions. If the AudioBench MathQA dataset guarantees canonical answer forms, this is fine — but worth a brief comment in the code to document that assumption.
nemo_skills/inference/generate.py (1)

296-306: Consider extracting the shared get_env_prefix / get_extra_package_dirs logic into a mixin or the base class.

Both mcore_skills.py (lines 143–164) and eval_kit.py (lines 128–149) have identical implementations of get_env_prefix() and get_extra_package_dirs(). Since the base class already defines these hooks, the shared VLMEvalKit environment setup could live in a common mixin (e.g., VLMEvalKitMixin) or as a utility that both subclasses delegate to, avoiding the copy-paste.
nemo_skills/dataset/eval_kit/__init__.py (1)
33-42: Add type hints to the function signature.

As per coding guidelines, "Use type hints for simple types (dict, list, int, float, existing classes) in Python code".
Suggested fix
-def get_extra_generation_args(benchmark):
+def get_extra_generation_args(benchmark: str) -> str:
nemo_skills/evaluation/metrics/eval_kit_metrics.py (3)
42-44: **kwargs is silently swallowed — unsupported arguments won't raise errors.

get_metrics() in map_metrics.py forwards **kwargs from the user's metrics_kwargs. Silently discarding them here means typos or invalid metric options will be ignored rather than failing. Consider either validating that no unexpected kwargs are passed or removing **kwargs entirely if no extra arguments are needed. As per coding guidelines, "Avoid silently ignoring unused user-passed parameters".
Suggested fix
-    def __init__(self, **kwargs):
+    def __init__(self, **kwargs):
+        if kwargs:
+            LOG.warning("EvalKitMetrics ignores extra kwargs: %s", list(kwargs.keys()))
         super().__init__(compute_no_answer=False)
Or, if no extra kwargs should ever be passed:
-    def __init__(self, **kwargs):
+    def __init__(self):
         super().__init__(compute_no_answer=False)
56-58: update() skips super().update(predictions) — intentional but worth documenting inline.

BaseMetrics.update() handles token counting, timing stats, and max_k tracking. Skipping it means those metrics will be absent. This is correct for pre-computed VLMEvalKit aggregates, but a brief inline comment would help future maintainers understand the deliberate deviation.

38-40: Class-level mutable state _shared_metrics_file persists across test runs.

_shared_metrics_file is a class variable that survives across multiple instantiations and test cases. If tests create EvalKitMetrics instances for different benchmarks, a stale path from a prior test could leak. Consider resetting it in __init__ or setup() when an instance-level path is found, or documenting the expected lifecycle.
nemo_skills/pipeline/utils/eval.py (2)

365-379: Duplicate module import logic — _resolve_generation_task_class vs lines 461-474.

_resolve_generation_task_class (lines 32-42) performs the same import-and-get-GENERATION_TASK_CLASS as lines 461-474 later in prepare_eval_commands. The inline version raises on missing GENERATION_TASK_CLASS, while the helper silently returns None. Consider consolidating into one function with an option to raise or return None.

413-417: Forcing num_jobs = total_evals when self-contained tasks are present may over-parallelize mixed workloads.

If the benchmark list contains both self-contained and server-based benchmarks, this forces every benchmark into its own job. The comment at line 413-414 explains the rationale for self-contained tasks, but it may be an unexpected side effect for the non-self-contained benchmarks in the same eval run.

Consider documenting this behavior in the --num_jobs help text or logging which benchmarks are affected.
nemo_skills/inference/eval/eval_kit.py (2)
162-227: Dataset is built twice on rank 0 when world_size > 1.

Lines 218 and 221 both call build_dataset(cfg.vlm_dataset, **dataset_kwargs). The first is rank-0 only (for download), the second is on all ranks. Rank 0 redundantly builds the dataset a second time. If dataset construction is expensive (beyond just downloading), consider caching the result from the first call.
Suggested optimization
         if world_size > 1:
             import torch.distributed as dist

             if rank == 0:
-                build_dataset(cfg.vlm_dataset, **dataset_kwargs)
+                self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs)
             dist.barrier()
+            if rank != 0:
+                self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs)
+        else:
+            self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs)

-        self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs)
337-361: Hardcoded dataset name lists will silently become stale as VLMEvalKit evolves.

These lists mirror VLMEvalKit/run.py but must be manually kept in sync. If VLMEvalKit adds a new dataset that needs special kwargs (e.g., nframe), it won't get them here. Consider importing or referencing these lists from VLMEvalKit if possible, or adding a comment with the VLMEvalKit version/commit these were copied from for traceability.
nemo_skills/inference/mcore_skills.py (2)

126-165: get_env_prefix() and get_extra_package_dirs() are duplicated verbatim from eval_kit.py.

Both EvalKitGenerationTask (in eval_kit.py, lines 128-150) and MegatronMCoreGenerationTask share identical implementations for get_env_prefix() and get_extra_package_dirs(). Extract these into a shared mixin or utility to avoid drift.

406-428: Redirecting sys.stdout to /dev/null is fragile — prefer suppressing at the logging/tqdm level.

Replacing sys.stdout globally on non-primary DP ranks suppresses all stdout, including potential error messages or debug info that isn't from VLMEvalKit. This pattern can also confuse debuggers and profilers. Consider using contextlib.redirect_stdout for a cleaner scope, or configuring VLMEvalKit's verbosity directly if possible.

That said, the finally cleanup at lines 425-428 is correct and ensures sys.stdout is restored even on exceptions.

coderabbitai · 2026-02-12T18:07:05Z

nemo_skills/evaluation/evaluator/__init__.py

+try:
+    from nemo_skills.evaluation.evaluator.compute_eval import ComputeEvalEvaluator
+except ImportError:
+    ComputeEvalEvaluator = None


⚠️ Potential issue | 🟡 Minor

Silent import suppression will produce a confusing error when compute-eval is actually requested.

If the dependency is missing and a user requests the "compute-eval" evaluator, they'll see "Evaluator class not found for type: compute-eval" (from get_evaluator_class) instead of a clear message about the missing import. Consider deferring the error to the point of use so it's actionable.

Suggested approach

-try: - from nemo_skills.evaluation.evaluator.compute_eval import ComputeEvalEvaluator -except ImportError: - ComputeEvalEvaluator = None +_compute_eval_import_error = None +try: + from nemo_skills.evaluation.evaluator.compute_eval import ComputeEvalEvaluator +except ImportError as e: + ComputeEvalEvaluator = None + _compute_eval_import_error = e

Then in get_evaluator_class (or at registration lookup time), surface _compute_eval_import_error when eval_type == "compute-eval" and the class is None.

As per coding guidelines, "Do not catch exceptions when they are not normally expected to be raised; let code fail with clear errors instead of silently misbehaving."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

from nemo_skills.evaluation.evaluator.compute_eval import ComputeEvalEvaluator

except ImportError:

ComputeEvalEvaluator = None

_compute_eval_import_error = None

try:

from nemo_skills.evaluation.evaluator.compute_eval import ComputeEvalEvaluator

except ImportError as e:

ComputeEvalEvaluator = None

_compute_eval_import_error = e

🤖 Prompt for AI Agents

In `@nemo_skills/evaluation/evaluator/__init__.py` around lines 30 - 33, The current top-level try/except hides import errors for ComputeEvalEvaluator; instead capture the ImportError into a module-level variable (e.g. _compute_eval_import_error) and set ComputeEvalEvaluator = None, then in get_evaluator_class (or the evaluator registration lookup) check if eval_type == "compute-eval" and if ComputeEvalEvaluator is None raise a clear ImportError that includes _compute_eval_import_error; this defers the failure to the point of use and gives an actionable message when someone requests the "compute-eval" evaluator.

coderabbitai · 2026-02-12T18:07:05Z

nemo_skills/inference/eval/eval_kit.py

+    def _flush_pkl_to_jsonl(self, pkl_path, index_meta, output_path):
+        """Read the pkl, find new entries, append them to the JSONL file."""
+        if not os.path.exists(pkl_path):
+            return
+        try:
+            with open(pkl_path, "rb") as f:
+                data = pickle.load(f)
+        except Exception:
+            # pkl may be mid-write; skip this cycle
+            return
+        if not isinstance(data, dict):
+            return


⚠️ Potential issue | 🟡 Minor

Broad except Exception when loading pickle silently hides non-transient errors.

The comment says "pkl may be mid-write; skip this cycle", but this also swallows PermissionError, MemoryError, EOFError, etc. Consider narrowing to the specific exceptions that a mid-write pickle can raise.

Suggested fix

- except Exception: + except (OSError, EOFError, pickle.UnpicklingError): # pkl may be mid-write; skip this cycle return

🧰 Tools

🪛 Ruff (0.15.0)

[error] 270-270: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)

[warning] 271-271: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In `@nemo_skills/inference/eval/eval_kit.py` around lines 264 - 275, In _flush_pkl_to_jsonl, replace the broad "except Exception" around pickle.load with a narrower handler for the transient errors that indicate a mid-write pickle (e.g., (EOFError, pickle.UnpicklingError, BlockingIOError)) so those are safely skipped; for other unexpected errors (e.g., PermissionError, MemoryError) let them propagate or log them explicitly before re-raising so they aren't silently swallowed—update the exception clause in _flush_pkl_to_jsonl accordingly and include a clear processLogger.error(...) call when re-raising non-transient exceptions.

coderabbitai · 2026-02-12T18:07:05Z

nemo_skills/inference/mcore_skills.py

+    def _evaluate_results(self):
+        """Compute metrics using VLMEvalKit's evaluation functions.
+
+        Uses the same asr_wer() that eval_kit.py calls via dataset.evaluate(),
+        so metrics are identical.  Saves eval_kit_metrics.json (consumed by
+        EvalKitMetrics in the summarize step).
+        """
+        output_file = self.cfg.output_file
+        if not output_file or not Path(output_file).exists():
+            return
+
+        output_path = Path(output_file)
+
+        try:
+            from vlmeval.dataset.avlm.utils import asr_wer
+
+            # Read entries and build VLMEvalKit-format results list
+            entries = []
+            results = []
+            with open(output_file, "rt", encoding="utf-8") as fin:
+                for line in fin:
+                    entry = json.loads(line)
+                    # Strip leftover <think> tags (older runs may have them)
+                    gen = entry.get("generation", "")
+                    cleaned = self._strip_thinking_tags(gen)
+                    if cleaned != gen:
+                        entry["generation"] = cleaned
+                    entries.append(entry)
+                    results.append({
+                        "gt": entry.get("expected_answer", ""),
+                        "pred": entry["generation"],
+                    })
+
+            # Re-write output.jsonl with cleaned generations
+            with open(output_file, "w", encoding="utf-8") as fout:
+                for entry in entries:
+                    fout.write(json.dumps(entry) + "\n")
+
+            # Compute WER using VLMEvalKit (same function as eval_kit path)
+            wer_score = asr_wer(results)
+            LOG.info("ASR WER: %.2f%%", wer_score)
+
+            # Save as eval_kit_metrics.json (same format eval_kit.py writes)
+            metrics = {"wer": wer_score}
+            metrics_file = output_path.parent / "eval_kit_metrics.json"
+            with open(metrics_file, "w", encoding="utf-8") as f:
+                json.dump(metrics, f, indent=2)
+            LOG.info("Metrics saved to %s", metrics_file)
+
+        except ImportError:
+            LOG.warning(
+                "VLMEvalKit asr_wer not available — skipping eval-kit-style metrics. "
+                "The summarize_results job will compute metrics separately."
+            )
+        except Exception:
+            LOG.exception("Inline metrics computation failed")


⚠️ Potential issue | 🟠 Major

_evaluate_results assumes ASR WER metric — won't work for non-ASR benchmarks.

This method hardcodes from vlmeval.dataset.avlm.utils import asr_wer and always computes WER. For non-ASR benchmarks processed through mcore_skills.py, this will either fail (caught by the broad except Exception) or produce meaningless metrics. The METRICS_TYPE_OVERRIDE = "eval_kit" at line 136 means EvalKitMetrics will always read eval_kit_metrics.json, which will contain only WER.

Consider either:

Making the evaluation function configurable (e.g., via config), or

Guarding this with a dataset-type check, or

Documenting that mcore_skills.py is currently ASR-only.

🤖 Prompt for AI Agents

In `@nemo_skills/inference/mcore_skills.py` around lines 467 - 522, _evaluate_results currently always imports vlmeval.dataset.avlm.utils.asr_wer and writes eval_kit_metrics.json with only WER which is wrong for non-ASR datasets; update the method to consult a config flag (e.g., self.cfg.metrics_type or self.cfg.dataset_type) or a new self.cfg.eval_function setting before importing/using asr_wer so you only compute WER when the dataset is ASR-type, otherwise skip creating eval_kit_metrics.json or call a configurable evaluator; reference the METRICS_TYPE_OVERRIDE constant and the EvalKitMetrics consumer to ensure the written file matches the selected evaluator, and add a clear guard in _evaluate_results (or load evaluator via a registry/lookup) to avoid the broad Exception path producing meaningless metrics.

coderabbitai · 2026-02-12T18:07:05Z

nemo_skills/pipeline/eval.py

+def _apply_task_overrides(combined_cmd, task_classes, job_num_gpus, cluster_config):
+    """Apply env/torchrun/container overrides declared by generation task classes.
+
+    Returns (modified_cmd, container).
+    """
+    # Environment prefix (first non-empty wins; jobs are not mixed across task types)
+    for tc in task_classes:
+        prefix = tc.get_env_prefix() if hasattr(tc, "get_env_prefix") else ""
+        if prefix:
+            combined_cmd = f'{prefix}{combined_cmd}'
+            break
+
+    # Torchrun for multi-GPU data-parallel inference
+    if any(getattr(tc, "USE_TORCHRUN", False) for tc in task_classes):
+        if job_num_gpus and int(job_num_gpus) > 1:
+            combined_cmd = combined_cmd.replace(
+                'python -m ', f'torchrun --nproc_per_node {job_num_gpus} -m ', 1
+            )
+
+    # Container selection (task class CONTAINER_KEY, falling back to nemo-skills default)
+    container = cluster_config["containers"]["nemo-skills"]
+    for tc in task_classes:
+        key = getattr(tc, "CONTAINER_KEY", None)
+        if key and key in cluster_config.get("containers", {}):
+            container = cluster_config["containers"][key]
+            break
+
+    return combined_cmd, container


⚠️ Potential issue | 🟡 Minor

Inconsistent use of .get() vs direct access for cluster_config["containers"]

Line 60 accesses cluster_config["containers"] directly (which will raise KeyError if missing), but line 63 uses cluster_config.get("containers", {}). If "containers" is absent, line 60 will crash before line 63 is reached, making the .get() misleading. Use direct access consistently. As per coding guidelines, "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors".

Suggested fix

- if key and key in cluster_config.get("containers", {}): + if key and key in cluster_config["containers"]:

🤖 Prompt for AI Agents

In `@nemo_skills/pipeline/eval.py` around lines 40 - 67, The code mixes direct access and .get() for cluster_config["containers"] in _apply_task_overrides which is inconsistent; update the membership check to use direct access so failures are explicit. Replace cluster_config.get("containers", {}) with cluster_config["containers"] in the for-loop that selects container (i.e., change if key and key in cluster_config.get("containers", {}): to if key and key in cluster_config["containers"]:) while keeping the initial container = cluster_config["containers"]["nemo-skills"] and the references to CONTAINER_KEY and task_classes unchanged.

coderabbitai · 2026-02-12T18:07:05Z

nemo_skills/pipeline/utils/eval.py

+def _resolve_generation_task_class(module_name: str):
+    """Import a generation module and return its GENERATION_TASK_CLASS, or None."""
+    try:
+        if module_name.endswith(".py") or os.sep in module_name:
+            path_suffix = ".py" if not module_name.endswith(".py") else ""
+            mod = import_from_path(module_name + path_suffix)
+        else:
+            mod = importlib.import_module(module_name)
+        return getattr(mod, "GENERATION_TASK_CLASS", None)
+    except Exception:
+        return None


⚠️ Potential issue | 🟠 Major

Bare except Exception silently swallows import/syntax errors in generation modules.

If the generation module has a bug, misspelled path, or missing dependency, this returns None silently. Downstream code then treats the benchmark as having no task class, potentially skipping self-contained mode, env prefixes, extra package dirs, and metric type overrides — all without any error or warning. This violates the coding guideline: "Do not catch exceptions when they are not normally expected to be raised; let code fail with clear errors instead of silently misbehaving".

At minimum, log a warning so users know resolution failed. Better yet, only catch ImportError/ModuleNotFoundError and let other exceptions propagate.

Suggested fix

def _resolve_generation_task_class(module_name: str): """Import a generation module and return its GENERATION_TASK_CLASS, or None.""" try: if module_name.endswith(".py") or os.sep in module_name: path_suffix = ".py" if not module_name.endswith(".py") else "" mod = import_from_path(module_name + path_suffix) else: mod = importlib.import_module(module_name) return getattr(mod, "GENERATION_TASK_CLASS", None) - except Exception: + except (ImportError, ModuleNotFoundError): + LOG.debug("Could not resolve GENERATION_TASK_CLASS from %s", module_name) return None

🧰 Tools

🪛 Ruff (0.15.0)

[warning] 41-41: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In `@nemo_skills/pipeline/utils/eval.py` around lines 32 - 42, The _resolve_generation_task_class function currently swallows all exceptions; change its error handling to only catch ImportError and ModuleNotFoundError (so syntax/runtime errors in the module propagate) and, when catching these import-related errors, log a warning including the exception details and the module_name; keep the rest of the logic (import_from_path vs importlib.import_module and returning getattr(..., "GENERATION_TASK_CLASS", None)) unchanged and use the module logger (e.g., logging.getLogger(__name__)) or an existing logger in the file for the warning message.

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: George <[email protected]>

Jorjeous changed the title ~~Adds two new generation modules for running NeMo Skills benchmarks via~~ Eval kit support Feb 12, 2026

Jorjeous marked this pull request as draft February 12, 2026 18:01

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

coderabbitai bot reviewed Feb 12, 2026

View reviewed changes

Update nemo_skills/inference/mcore_skills.py

818c734

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: George <[email protected]>

	sys.stdout = open(os.devnull, "w")
	sys.stdout = open(os.devnull, "w", encoding="utf-8")

Eval kit support #1239

Are you sure you want to change the base?

Eval kit support #1239

Uh oh!

Conversation

Jorjeous commented Feb 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Feb 12, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Jorjeous commented Feb 12, 2026 •

edited by coderabbitai bot

Loading