Skip to content

Conversation

@Jorjeous
Copy link
Member

@Jorjeous Jorjeous commented Feb 12, 2026

VLMEvalKit's MultiModalMCore inference engine:

  • eval_kit: uses VLMEvalKit's native dataset/evaluation pipeline with NeMo Skills output formatting
  • mcore_skills: reads NeMo Skills JSONL data, runs inference through MultiModalMCore in-process with data parallelism, computes metrics using VLMEvalKit functions (e.g. asr_wer), and writes results in NeMo Skills format

Key changes:

  • New inference modules: eval_kit.py, mcore_skills.py
  • New dataset module: eval_kit/ for VLMEvalKit benchmark resolution
  • New metrics class: EvalKitMetrics for pre-computed VLMEvalKit metrics
  • Pipeline extensions: self-contained task support, METRICS_TYPE_OVERRIDE, eval_kit container/mounts handling
  • Lazy imports for sacrebleu to avoid missing-dep crashes

Summary by CodeRabbit

Release Notes

  • New Features

    • Added VLMEvalKit benchmark evaluation support with automatic metrics computation.
    • Introduced in-process generation modes (mcore_skills, eval_kit) as alternatives to server-based workflows.
    • Enhanced audio evaluation with expanded task type coverage including ASR variants and translation types.
  • Improvements

    • Improved generation command flexibility with optional input file handling.
    • Enhanced multi-GPU task orchestration and per-benchmark GPU allocation.
    • Made ComputeEvalEvaluator optional for graceful dependency handling.

VLMEvalKit's MultiModalMCore inference engine:

- eval_kit: uses VLMEvalKit's native dataset/evaluation pipeline with
  NeMo Skills output formatting
- mcore_skills: reads NeMo Skills JSONL data, runs inference through
  MultiModalMCore in-process with data parallelism, computes metrics
  using VLMEvalKit functions (e.g. asr_wer), and writes results in
  NeMo Skills format

Key changes:
- New inference modules: eval_kit.py, mcore_skills.py
- New dataset module: eval_kit/ for VLMEvalKit benchmark resolution
- New metrics class: EvalKitMetrics for pre-computed VLMEvalKit metrics
- Pipeline extensions: self-contained task support, METRICS_TYPE_OVERRIDE,
  eval_kit container/mounts handling
- Lazy imports for sacrebleu to avoid missing-dep crashes

Signed-off-by: George Zelenfroind <[email protected]>
@Jorjeous Jorjeous changed the title Adds two new generation modules for running NeMo Skills benchmarks via Eval kit support Feb 12, 2026
@Jorjeous Jorjeous marked this pull request as draft February 12, 2026 18:01
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

17 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

# avoid 8x duplicate output in logs.
_real_stdout = sys.stdout
if dp_rank != 0:
sys.stdout = open(os.devnull, "w")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened file is never closed if exception occurs during generation

Suggested change
sys.stdout = open(os.devnull, "w")
sys.stdout = open(os.devnull, "w", encoding="utf-8")

Comment on lines +476 to +486
if os.path.exists(result_file):
df = vlm_load(result_file)
with open(self.cfg.output_file, 'w', encoding='utf-8') as f:
for _, row in df.iterrows():
entry = {
"generation": str(row["prediction"]) if "prediction" in row.index else "",
"expected_answer": str(row["answer"]) if "answer" in row.index else "",
"question": str(row["question"]) if "question" in row.index else "",
}
f.write(json.dumps(entry) + "\n")
LOG.info("Wrote final ordered JSONL to %s (%d entries)", self.cfg.output_file, len(df))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

violates data loss protection guideline from CONTRIBUTING.md - loads df then immediately opens output file for writing

if vlm_load() completes but code crashes before write finishes, async writer's partial data is lost

load all data into memory first, then write atomically:

if os.path.exists(result_file):
    df = vlm_load(result_file)
    entries = []
    for _, row in df.iterrows():
        entries.append({
            "generation": str(row["prediction"]) if "prediction" in row.index else "",
            "expected_answer": str(row["answer"]) if "answer" in row.index else "",
            "question": str(row["question"]) if "question" in row.index else "",
        })
    with open(self.cfg.output_file, 'w', encoding='utf-8') as f:
        for entry in entries:
            f.write(json.dumps(entry) + "\n")
    LOG.info("Wrote final ordered JSONL to %s (%d entries)", self.cfg.output_file, len(df))

Comment on lines +500 to +503
# Re-write output.jsonl with cleaned generations
with open(output_file, "w", encoding="utf-8") as fout:
for entry in entries:
fout.write(json.dumps(entry) + "\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same data loss issue - loads entries then immediately opens file for writing

compute all entries first, then write atomically

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 12, 2026

📝 Walkthrough

Walkthrough

This pull request introduces comprehensive VLMEvalKit integration into NeMo Skills, adding two new in-process generation modes (Megatron MCore and vLLM-based), metrics aggregation from pre-computed evaluation results, and pipeline orchestration for self-contained tasks with GPU allocation per benchmark.

Changes

Cohort / File(s) Summary
VLMEvalKit Integration Core
nemo_skills/dataset/eval_kit/__init__.py, nemo_skills/inference/eval/eval_kit.py
Adds eval_kit module with generation configuration constants and helper function for dataset name extraction. Introduces EvalKitConfig dataclass and EvalKitGenerationTask supporting mcore (in-process MultiModalMCore) and vllm (server-based) modes with async JSONL result writing, multi-process synchronization, dataset building, and evaluation pipeline with optional judge model support.
Megatron MCore In-Process Generation
nemo_skills/inference/mcore_skills.py
Implements MegatronMCoreConfig and MegatronMCoreGenerationTask for end-to-end in-process generation from JSONL inputs with optional prompt filling, multi-GPU parallelization via DP ranks, OpenAI-format message construction, per-sample generation with thinking tag stripping, and optional in-process ASR metrics computation.
Metrics Infrastructure
nemo_skills/evaluation/metrics/eval_kit_metrics.py, nemo_skills/evaluation/metrics/map_metrics.py
Introduces EvalKitMetrics class to read pre-computed VLMEvalKit aggregates from eval_kit_metrics.json, flatten nested values, and expose metrics in NeMo Skills format. Registers EvalKitMetrics in METRICS_MAP.
Dataset and Utility Updates
nemo_skills/dataset/utils.py, nemo_skills/evaluation/metrics/compute_metrics.py, nemo_skills/evaluation/metrics/translation_metrics.py
Updates get_default_dataset_module and get_dataset_module to return (module, data_path, is_on_cluster) tuple with early-path handling for eval_kit datasets. Adjusts metrics compute_metrics flexible unpacking. Moves corpus_bleu import to local scope in translation_metrics.
Evaluator Changes
nemo_skills/evaluation/evaluator/__init__.py, nemo_skills/evaluation/evaluator/audio.py
Makes ComputeEvalEvaluator import optional. Expands audio.py task type handling with internal _ASR_TYPES and _TRANSLATION_TYPES groupings, broadens ASR/translation coverage including ST- variants, adds MathQA exact string match branch, and refines generation presence gating.
Generation Framework Extensions
nemo_skills/inference/generate.py, nemo_skills/inference/factory.py
Adds declarative task attributes (CONTAINER_KEY, USE_TORCHRUN) and class methods (is_self_contained, get_env_prefix, get_extra_package_dirs) to GenerationTask for self-managed models. Adds mcore_skills enum member to GenerationType and GENERATION_MODULE_MAP.
Pipeline Orchestration
nemo_skills/pipeline/eval.py, nemo_skills/pipeline/utils/eval.py, nemo_skills/pipeline/utils/generation.py
Introduces _apply_task_overrides for environment/torchrun/container configuration. Extends BenchmarkArgs with metric_type, self_contained_task, num_gpus fields and task class resolution. Adds self-contained task detection to bypass server model and allocate per-benchmark GPUs. Updates command generation to conditionally include input_file and support uv fallback for venv creation.
Training Enhancements
nemo_skills/training/nemo_rl/start_grpo.py
Implements dynamic Python executable selection from config and conditional unpacking of setup() results supporting both 10-element (main) and 11-element (Nano/Super) configurations.
Configuration
requirements/eval-kit.txt
Adds explanatory notes that VLMEvalKit is installed at job start via installation_command.

Sequence Diagram(s)

sequenceDiagram
    participant User as Evaluation<br/>Pipeline
    participant Benchmark as Benchmark<br/>Module
    participant TaskFactory as Task<br/>Factory
    participant GenTask as EvalKit/<br/>MCore Task
    participant Dataset as VLMEvalKit<br/>Dataset
    participant Model as Generation<br/>Model
    participant Metrics as Metrics<br/>Collector
    participant Results as Result<br/>Writer

    User->>Benchmark: Load benchmark config
    Benchmark-->>User: GENERATION_MODULE, METRICS_TYPE, GENERATION_ARGS
    
    User->>TaskFactory: is_self_contained(extra_args)
    alt Self-Contained (eval_kit/mcore)
        TaskFactory-->>User: True - allocate GPUs per benchmark
        User->>GenTask: __init__(config)
        GenTask->>Dataset: Build with dataset kwargs
        Dataset-->>GenTask: Dataset prepared
        GenTask->>Model: Create (mcore/vllm)
        Model-->>GenTask: Model ready
    else Server-Based
        TaskFactory-->>User: False - use shared vLLM server
    end
    
    User->>GenTask: generate()
    GenTask->>GenTask: _start_async_writer()
    GenTask->>Dataset: Iterate samples
    loop Per Sample
        Dataset->>GenTask: Next sample
        GenTask->>Model: Inference (text/image/video)
        Model-->>GenTask: Generation result
        GenTask->>Results: Write incremental JSONL
    end
    GenTask->>GenTask: _stop_async_writer()
    
    GenTask->>Metrics: Evaluate results
    Metrics-->>GenTask: eval_kit_metrics.json
    GenTask->>Results: Write final results + done file
    Results-->>User: Evaluation complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • NVIDIA-NeMo/Skills#1103 — Modifies nemo_skills/evaluation/evaluator/audio.py task-type handling in the same file with overlapping logic for ASR and translation type checking.
  • NVIDIA-NeMo/Skills#1220 — Shares identical changes in nemo_skills/training/nemo_rl/start_grpo.py for dynamic Python executable selection and conditional setup() result unpacking (10 vs 11 items).
  • NVIDIA-NeMo/Skills#1158 — Adds benchmark integrations with overlapping changes to evaluator registration points and METRICS_MAP entries.

Suggested labels

enhancement

Suggested reviewers

  • gwarmstrong
  • melllinia
  • Kipok
🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 59.68% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Eval kit support' is vague and generic; it does not convey the specific nature of the substantial changes (two new generation modules, dataset integration, metrics class, pipeline extensions). Consider a more descriptive title that highlights the main contribution, such as 'Add VLMEvalKit integration with mcore and eval_kit generation modules' or 'Integrate VLMEvalKit MultiModalMCore for NeMo Skills evaluation'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch eval-kit

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/evaluation/evaluator/__init__.py (1)

131-131: ⚠️ Potential issue | 🟡 Minor

Remove debug print statement.

This looks like a leftover debug artifact. Either remove it or replace with proper logging.

-        print(f"evaluator: {evaluator}")
🤖 Fix all issues with AI agents
In `@nemo_skills/evaluation/evaluator/__init__.py`:
- Around line 30-33: The current top-level try/except hides import errors for
ComputeEvalEvaluator; instead capture the ImportError into a module-level
variable (e.g. _compute_eval_import_error) and set ComputeEvalEvaluator = None,
then in get_evaluator_class (or the evaluator registration lookup) check if
eval_type == "compute-eval" and if ComputeEvalEvaluator is None raise a clear
ImportError that includes _compute_eval_import_error; this defers the failure to
the point of use and gives an actionable message when someone requests the
"compute-eval" evaluator.

In `@nemo_skills/inference/eval/eval_kit.py`:
- Around line 264-275: In _flush_pkl_to_jsonl, replace the broad "except
Exception" around pickle.load with a narrower handler for the transient errors
that indicate a mid-write pickle (e.g., (EOFError, pickle.UnpicklingError,
BlockingIOError)) so those are safely skipped; for other unexpected errors
(e.g., PermissionError, MemoryError) let them propagate or log them explicitly
before re-raising so they aren't silently swallowed—update the exception clause
in _flush_pkl_to_jsonl accordingly and include a clear processLogger.error(...)
call when re-raising non-transient exceptions.

In `@nemo_skills/inference/mcore_skills.py`:
- Around line 467-522: _evaluate_results currently always imports
vlmeval.dataset.avlm.utils.asr_wer and writes eval_kit_metrics.json with only
WER which is wrong for non-ASR datasets; update the method to consult a config
flag (e.g., self.cfg.metrics_type or self.cfg.dataset_type) or a new
self.cfg.eval_function setting before importing/using asr_wer so you only
compute WER when the dataset is ASR-type, otherwise skip creating
eval_kit_metrics.json or call a configurable evaluator; reference the
METRICS_TYPE_OVERRIDE constant and the EvalKitMetrics consumer to ensure the
written file matches the selected evaluator, and add a clear guard in
_evaluate_results (or load evaluator via a registry/lookup) to avoid the broad
Exception path producing meaningless metrics.

In `@nemo_skills/pipeline/eval.py`:
- Around line 40-67: The code mixes direct access and .get() for
cluster_config["containers"] in _apply_task_overrides which is inconsistent;
update the membership check to use direct access so failures are explicit.
Replace cluster_config.get("containers", {}) with cluster_config["containers"]
in the for-loop that selects container (i.e., change if key and key in
cluster_config.get("containers", {}): to if key and key in
cluster_config["containers"]:) while keeping the initial container =
cluster_config["containers"]["nemo-skills"] and the references to CONTAINER_KEY
and task_classes unchanged.

In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 32-42: The _resolve_generation_task_class function currently
swallows all exceptions; change its error handling to only catch ImportError and
ModuleNotFoundError (so syntax/runtime errors in the module propagate) and, when
catching these import-related errors, log a warning including the exception
details and the module_name; keep the rest of the logic (import_from_path vs
importlib.import_module and returning getattr(..., "GENERATION_TASK_CLASS",
None)) unchanged and use the module logger (e.g., logging.getLogger(__name__))
or an existing logger in the file for the warning message.
🧹 Nitpick comments (13)
nemo_skills/evaluation/evaluator/audio.py (2)

499-516: Mutable _TRANSLATION_TYPES set is fragile if ever hoisted to module scope.

_TRANSLATION_TYPES is mutated on line 504 (_TRANSLATION_TYPES.add(task_type)). This works correctly today because it's re-created on every call, but the _ALL_CAPS naming convention strongly suggests a module-level constant. If a future refactor moves it to module scope (like _FAILURE_RESPONSES or VALID_NORMALIZATION_MODES above), the set would accumulate task types across calls — a subtle, hard-to-detect bug.

Consider making the intent clearer by either:

  1. Keeping the sets local but using lowercase naming (asr_types, translation_types), or
  2. Making the module-level sets truly immutable (frozensets) and building the union inline.
Option 2: immutable module-level sets + inline union

Define at module level:

_ASR_TYPES = frozenset({"ASR", "ASR-ZH", "ASR-PC", "ASR_LEADERBOARD"})
_TRANSLATION_TYPES = frozenset({"AST", "Translation"})

Then inside evaluate_sample:

-    _ASR_TYPES = {"ASR", "ASR-ZH", "ASR-PC", "ASR_LEADERBOARD"}
-    _TRANSLATION_TYPES = {"AST", "Translation"}
-    # AudioBench speech translation types: ST-{src}-{tgt}
-    if task_type.startswith("ST-"):
-        _TRANSLATION_TYPES.add(task_type)
-
-    if task_type in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}) and not generation:
+    is_translation = task_type in _TRANSLATION_TYPES or task_type.startswith("ST-")
+    is_asr = task_type in _ASR_TYPES
+
+    if (is_asr or is_translation or task_type == "CER") and not generation:

And similarly replace task_type in _TRANSLATION_TYPES checks with is_translation.


557-562: MathQA exact-match may be too strict for numerical answers.

Pure string equality after lowercasing won't match equivalent representations like "3.0" vs "3", "1/2" vs "0.5", or whitespace/formatting differences in expressions. If the AudioBench MathQA dataset guarantees canonical answer forms, this is fine — but worth a brief comment in the code to document that assumption.

nemo_skills/inference/generate.py (1)

296-306: Consider extracting the shared get_env_prefix / get_extra_package_dirs logic into a mixin or the base class.

Both mcore_skills.py (lines 143–164) and eval_kit.py (lines 128–149) have identical implementations of get_env_prefix() and get_extra_package_dirs(). Since the base class already defines these hooks, the shared VLMEvalKit environment setup could live in a common mixin (e.g., VLMEvalKitMixin) or as a utility that both subclasses delegate to, avoiding the copy-paste.

nemo_skills/dataset/eval_kit/__init__.py (1)

33-42: Add type hints to the function signature.

As per coding guidelines, "Use type hints for simple types (dict, list, int, float, existing classes) in Python code".

Suggested fix
-def get_extra_generation_args(benchmark):
+def get_extra_generation_args(benchmark: str) -> str:
nemo_skills/evaluation/metrics/eval_kit_metrics.py (3)

42-44: **kwargs is silently swallowed — unsupported arguments won't raise errors.

get_metrics() in map_metrics.py forwards **kwargs from the user's metrics_kwargs. Silently discarding them here means typos or invalid metric options will be ignored rather than failing. Consider either validating that no unexpected kwargs are passed or removing **kwargs entirely if no extra arguments are needed. As per coding guidelines, "Avoid silently ignoring unused user-passed parameters".

Suggested fix
-    def __init__(self, **kwargs):
+    def __init__(self, **kwargs):
+        if kwargs:
+            LOG.warning("EvalKitMetrics ignores extra kwargs: %s", list(kwargs.keys()))
         super().__init__(compute_no_answer=False)

Or, if no extra kwargs should ever be passed:

-    def __init__(self, **kwargs):
+    def __init__(self):
         super().__init__(compute_no_answer=False)

56-58: update() skips super().update(predictions) — intentional but worth documenting inline.

BaseMetrics.update() handles token counting, timing stats, and max_k tracking. Skipping it means those metrics will be absent. This is correct for pre-computed VLMEvalKit aggregates, but a brief inline comment would help future maintainers understand the deliberate deviation.


38-40: Class-level mutable state _shared_metrics_file persists across test runs.

_shared_metrics_file is a class variable that survives across multiple instantiations and test cases. If tests create EvalKitMetrics instances for different benchmarks, a stale path from a prior test could leak. Consider resetting it in __init__ or setup() when an instance-level path is found, or documenting the expected lifecycle.

nemo_skills/pipeline/utils/eval.py (2)

365-379: Duplicate module import logic — _resolve_generation_task_class vs lines 461-474.

_resolve_generation_task_class (lines 32-42) performs the same import-and-get-GENERATION_TASK_CLASS as lines 461-474 later in prepare_eval_commands. The inline version raises on missing GENERATION_TASK_CLASS, while the helper silently returns None. Consider consolidating into one function with an option to raise or return None.


413-417: Forcing num_jobs = total_evals when self-contained tasks are present may over-parallelize mixed workloads.

If the benchmark list contains both self-contained and server-based benchmarks, this forces every benchmark into its own job. The comment at line 413-414 explains the rationale for self-contained tasks, but it may be an unexpected side effect for the non-self-contained benchmarks in the same eval run.

Consider documenting this behavior in the --num_jobs help text or logging which benchmarks are affected.

nemo_skills/inference/eval/eval_kit.py (2)

162-227: Dataset is built twice on rank 0 when world_size > 1.

Lines 218 and 221 both call build_dataset(cfg.vlm_dataset, **dataset_kwargs). The first is rank-0 only (for download), the second is on all ranks. Rank 0 redundantly builds the dataset a second time. If dataset construction is expensive (beyond just downloading), consider caching the result from the first call.

Suggested optimization
         if world_size > 1:
             import torch.distributed as dist

             if rank == 0:
-                build_dataset(cfg.vlm_dataset, **dataset_kwargs)
+                self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs)
             dist.barrier()
+            if rank != 0:
+                self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs)
+        else:
+            self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs)

-        self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs)

337-361: Hardcoded dataset name lists will silently become stale as VLMEvalKit evolves.

These lists mirror VLMEvalKit/run.py but must be manually kept in sync. If VLMEvalKit adds a new dataset that needs special kwargs (e.g., nframe), it won't get them here. Consider importing or referencing these lists from VLMEvalKit if possible, or adding a comment with the VLMEvalKit version/commit these were copied from for traceability.

nemo_skills/inference/mcore_skills.py (2)

126-165: get_env_prefix() and get_extra_package_dirs() are duplicated verbatim from eval_kit.py.

Both EvalKitGenerationTask (in eval_kit.py, lines 128-150) and MegatronMCoreGenerationTask share identical implementations for get_env_prefix() and get_extra_package_dirs(). Extract these into a shared mixin or utility to avoid drift.


406-428: Redirecting sys.stdout to /dev/null is fragile — prefer suppressing at the logging/tqdm level.

Replacing sys.stdout globally on non-primary DP ranks suppresses all stdout, including potential error messages or debug info that isn't from VLMEvalKit. This pattern can also confuse debuggers and profilers. Consider using contextlib.redirect_stdout for a cleaner scope, or configuring VLMEvalKit's verbosity directly if possible.

That said, the finally cleanup at lines 425-428 is correct and ensures sys.stdout is restored even on exceptions.

Comment on lines +30 to +33
try:
from nemo_skills.evaluation.evaluator.compute_eval import ComputeEvalEvaluator
except ImportError:
ComputeEvalEvaluator = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silent import suppression will produce a confusing error when compute-eval is actually requested.

If the dependency is missing and a user requests the "compute-eval" evaluator, they'll see "Evaluator class not found for type: compute-eval" (from get_evaluator_class) instead of a clear message about the missing import. Consider deferring the error to the point of use so it's actionable.

Suggested approach
-try:
-    from nemo_skills.evaluation.evaluator.compute_eval import ComputeEvalEvaluator
-except ImportError:
-    ComputeEvalEvaluator = None
+_compute_eval_import_error = None
+try:
+    from nemo_skills.evaluation.evaluator.compute_eval import ComputeEvalEvaluator
+except ImportError as e:
+    ComputeEvalEvaluator = None
+    _compute_eval_import_error = e

Then in get_evaluator_class (or at registration lookup time), surface _compute_eval_import_error when eval_type == "compute-eval" and the class is None.

As per coding guidelines, "Do not catch exceptions when they are not normally expected to be raised; let code fail with clear errors instead of silently misbehaving."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
from nemo_skills.evaluation.evaluator.compute_eval import ComputeEvalEvaluator
except ImportError:
ComputeEvalEvaluator = None
_compute_eval_import_error = None
try:
from nemo_skills.evaluation.evaluator.compute_eval import ComputeEvalEvaluator
except ImportError as e:
ComputeEvalEvaluator = None
_compute_eval_import_error = e
🤖 Prompt for AI Agents
In `@nemo_skills/evaluation/evaluator/__init__.py` around lines 30 - 33, The
current top-level try/except hides import errors for ComputeEvalEvaluator;
instead capture the ImportError into a module-level variable (e.g.
_compute_eval_import_error) and set ComputeEvalEvaluator = None, then in
get_evaluator_class (or the evaluator registration lookup) check if eval_type ==
"compute-eval" and if ComputeEvalEvaluator is None raise a clear ImportError
that includes _compute_eval_import_error; this defers the failure to the point
of use and gives an actionable message when someone requests the "compute-eval"
evaluator.

Comment on lines +264 to +275
def _flush_pkl_to_jsonl(self, pkl_path, index_meta, output_path):
"""Read the pkl, find new entries, append them to the JSONL file."""
if not os.path.exists(pkl_path):
return
try:
with open(pkl_path, "rb") as f:
data = pickle.load(f)
except Exception:
# pkl may be mid-write; skip this cycle
return
if not isinstance(data, dict):
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Broad except Exception when loading pickle silently hides non-transient errors.

The comment says "pkl may be mid-write; skip this cycle", but this also swallows PermissionError, MemoryError, EOFError, etc. Consider narrowing to the specific exceptions that a mid-write pickle can raise.

Suggested fix
-        except Exception:
+        except (OSError, EOFError, pickle.UnpicklingError):
             # pkl may be mid-write; skip this cycle
             return
🧰 Tools
🪛 Ruff (0.15.0)

[error] 270-270: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)


[warning] 271-271: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@nemo_skills/inference/eval/eval_kit.py` around lines 264 - 275, In
_flush_pkl_to_jsonl, replace the broad "except Exception" around pickle.load
with a narrower handler for the transient errors that indicate a mid-write
pickle (e.g., (EOFError, pickle.UnpicklingError, BlockingIOError)) so those are
safely skipped; for other unexpected errors (e.g., PermissionError, MemoryError)
let them propagate or log them explicitly before re-raising so they aren't
silently swallowed—update the exception clause in _flush_pkl_to_jsonl
accordingly and include a clear processLogger.error(...) call when re-raising
non-transient exceptions.

Comment on lines +467 to +522
def _evaluate_results(self):
"""Compute metrics using VLMEvalKit's evaluation functions.

Uses the same asr_wer() that eval_kit.py calls via dataset.evaluate(),
so metrics are identical. Saves eval_kit_metrics.json (consumed by
EvalKitMetrics in the summarize step).
"""
output_file = self.cfg.output_file
if not output_file or not Path(output_file).exists():
return

output_path = Path(output_file)

try:
from vlmeval.dataset.avlm.utils import asr_wer

# Read entries and build VLMEvalKit-format results list
entries = []
results = []
with open(output_file, "rt", encoding="utf-8") as fin:
for line in fin:
entry = json.loads(line)
# Strip leftover <think> tags (older runs may have them)
gen = entry.get("generation", "")
cleaned = self._strip_thinking_tags(gen)
if cleaned != gen:
entry["generation"] = cleaned
entries.append(entry)
results.append({
"gt": entry.get("expected_answer", ""),
"pred": entry["generation"],
})

# Re-write output.jsonl with cleaned generations
with open(output_file, "w", encoding="utf-8") as fout:
for entry in entries:
fout.write(json.dumps(entry) + "\n")

# Compute WER using VLMEvalKit (same function as eval_kit path)
wer_score = asr_wer(results)
LOG.info("ASR WER: %.2f%%", wer_score)

# Save as eval_kit_metrics.json (same format eval_kit.py writes)
metrics = {"wer": wer_score}
metrics_file = output_path.parent / "eval_kit_metrics.json"
with open(metrics_file, "w", encoding="utf-8") as f:
json.dump(metrics, f, indent=2)
LOG.info("Metrics saved to %s", metrics_file)

except ImportError:
LOG.warning(
"VLMEvalKit asr_wer not available — skipping eval-kit-style metrics. "
"The summarize_results job will compute metrics separately."
)
except Exception:
LOG.exception("Inline metrics computation failed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

_evaluate_results assumes ASR WER metric — won't work for non-ASR benchmarks.

This method hardcodes from vlmeval.dataset.avlm.utils import asr_wer and always computes WER. For non-ASR benchmarks processed through mcore_skills.py, this will either fail (caught by the broad except Exception) or produce meaningless metrics. The METRICS_TYPE_OVERRIDE = "eval_kit" at line 136 means EvalKitMetrics will always read eval_kit_metrics.json, which will contain only WER.

Consider either:

  1. Making the evaluation function configurable (e.g., via config), or
  2. Guarding this with a dataset-type check, or
  3. Documenting that mcore_skills.py is currently ASR-only.
🤖 Prompt for AI Agents
In `@nemo_skills/inference/mcore_skills.py` around lines 467 - 522,
_evaluate_results currently always imports vlmeval.dataset.avlm.utils.asr_wer
and writes eval_kit_metrics.json with only WER which is wrong for non-ASR
datasets; update the method to consult a config flag (e.g.,
self.cfg.metrics_type or self.cfg.dataset_type) or a new self.cfg.eval_function
setting before importing/using asr_wer so you only compute WER when the dataset
is ASR-type, otherwise skip creating eval_kit_metrics.json or call a
configurable evaluator; reference the METRICS_TYPE_OVERRIDE constant and the
EvalKitMetrics consumer to ensure the written file matches the selected
evaluator, and add a clear guard in _evaluate_results (or load evaluator via a
registry/lookup) to avoid the broad Exception path producing meaningless
metrics.

Comment on lines +40 to +67
def _apply_task_overrides(combined_cmd, task_classes, job_num_gpus, cluster_config):
"""Apply env/torchrun/container overrides declared by generation task classes.

Returns (modified_cmd, container).
"""
# Environment prefix (first non-empty wins; jobs are not mixed across task types)
for tc in task_classes:
prefix = tc.get_env_prefix() if hasattr(tc, "get_env_prefix") else ""
if prefix:
combined_cmd = f'{prefix}{combined_cmd}'
break

# Torchrun for multi-GPU data-parallel inference
if any(getattr(tc, "USE_TORCHRUN", False) for tc in task_classes):
if job_num_gpus and int(job_num_gpus) > 1:
combined_cmd = combined_cmd.replace(
'python -m ', f'torchrun --nproc_per_node {job_num_gpus} -m ', 1
)

# Container selection (task class CONTAINER_KEY, falling back to nemo-skills default)
container = cluster_config["containers"]["nemo-skills"]
for tc in task_classes:
key = getattr(tc, "CONTAINER_KEY", None)
if key and key in cluster_config.get("containers", {}):
container = cluster_config["containers"][key]
break

return combined_cmd, container
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent use of .get() vs direct access for cluster_config["containers"]

Line 60 accesses cluster_config["containers"] directly (which will raise KeyError if missing), but line 63 uses cluster_config.get("containers", {}). If "containers" is absent, line 60 will crash before line 63 is reached, making the .get() misleading. Use direct access consistently. As per coding guidelines, "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors".

Suggested fix
-        if key and key in cluster_config.get("containers", {}):
+        if key and key in cluster_config["containers"]:
🤖 Prompt for AI Agents
In `@nemo_skills/pipeline/eval.py` around lines 40 - 67, The code mixes direct
access and .get() for cluster_config["containers"] in _apply_task_overrides
which is inconsistent; update the membership check to use direct access so
failures are explicit. Replace cluster_config.get("containers", {}) with
cluster_config["containers"] in the for-loop that selects container (i.e.,
change if key and key in cluster_config.get("containers", {}): to if key and key
in cluster_config["containers"]:) while keeping the initial container =
cluster_config["containers"]["nemo-skills"] and the references to CONTAINER_KEY
and task_classes unchanged.

Comment on lines +32 to +42
def _resolve_generation_task_class(module_name: str):
"""Import a generation module and return its GENERATION_TASK_CLASS, or None."""
try:
if module_name.endswith(".py") or os.sep in module_name:
path_suffix = ".py" if not module_name.endswith(".py") else ""
mod = import_from_path(module_name + path_suffix)
else:
mod = importlib.import_module(module_name)
return getattr(mod, "GENERATION_TASK_CLASS", None)
except Exception:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bare except Exception silently swallows import/syntax errors in generation modules.

If the generation module has a bug, misspelled path, or missing dependency, this returns None silently. Downstream code then treats the benchmark as having no task class, potentially skipping self-contained mode, env prefixes, extra package dirs, and metric type overrides — all without any error or warning. This violates the coding guideline: "Do not catch exceptions when they are not normally expected to be raised; let code fail with clear errors instead of silently misbehaving".

At minimum, log a warning so users know resolution failed. Better yet, only catch ImportError/ModuleNotFoundError and let other exceptions propagate.

Suggested fix
 def _resolve_generation_task_class(module_name: str):
     """Import a generation module and return its GENERATION_TASK_CLASS, or None."""
     try:
         if module_name.endswith(".py") or os.sep in module_name:
             path_suffix = ".py" if not module_name.endswith(".py") else ""
             mod = import_from_path(module_name + path_suffix)
         else:
             mod = importlib.import_module(module_name)
         return getattr(mod, "GENERATION_TASK_CLASS", None)
-    except Exception:
+    except (ImportError, ModuleNotFoundError):
+        LOG.debug("Could not resolve GENERATION_TASK_CLASS from %s", module_name)
         return None
🧰 Tools
🪛 Ruff (0.15.0)

[warning] 41-41: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@nemo_skills/pipeline/utils/eval.py` around lines 32 - 42, The
_resolve_generation_task_class function currently swallows all exceptions;
change its error handling to only catch ImportError and ModuleNotFoundError (so
syntax/runtime errors in the module propagate) and, when catching these
import-related errors, log a warning including the exception details and the
module_name; keep the rest of the logic (import_from_path vs
importlib.import_module and returning getattr(..., "GENERATION_TASK_CLASS",
None)) unchanged and use the module logger (e.g., logging.getLogger(__name__))
or an existing logger in the file for the warning message.

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: George <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant