-
Notifications
You must be signed in to change notification settings - Fork 147
Eval kit support #1239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Eval kit support #1239
Conversation
VLMEvalKit's MultiModalMCore inference engine: - eval_kit: uses VLMEvalKit's native dataset/evaluation pipeline with NeMo Skills output formatting - mcore_skills: reads NeMo Skills JSONL data, runs inference through MultiModalMCore in-process with data parallelism, computes metrics using VLMEvalKit functions (e.g. asr_wer), and writes results in NeMo Skills format Key changes: - New inference modules: eval_kit.py, mcore_skills.py - New dataset module: eval_kit/ for VLMEvalKit benchmark resolution - New metrics class: EvalKitMetrics for pre-computed VLMEvalKit metrics - Pipeline extensions: self-contained task support, METRICS_TYPE_OVERRIDE, eval_kit container/mounts handling - Lazy imports for sacrebleu to avoid missing-dep crashes Signed-off-by: George Zelenfroind <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
17 files reviewed, 4 comments
| # avoid 8x duplicate output in logs. | ||
| _real_stdout = sys.stdout | ||
| if dp_rank != 0: | ||
| sys.stdout = open(os.devnull, "w") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opened file is never closed if exception occurs during generation
| sys.stdout = open(os.devnull, "w") | |
| sys.stdout = open(os.devnull, "w", encoding="utf-8") |
| if os.path.exists(result_file): | ||
| df = vlm_load(result_file) | ||
| with open(self.cfg.output_file, 'w', encoding='utf-8') as f: | ||
| for _, row in df.iterrows(): | ||
| entry = { | ||
| "generation": str(row["prediction"]) if "prediction" in row.index else "", | ||
| "expected_answer": str(row["answer"]) if "answer" in row.index else "", | ||
| "question": str(row["question"]) if "question" in row.index else "", | ||
| } | ||
| f.write(json.dumps(entry) + "\n") | ||
| LOG.info("Wrote final ordered JSONL to %s (%d entries)", self.cfg.output_file, len(df)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
violates data loss protection guideline from CONTRIBUTING.md - loads df then immediately opens output file for writing
if vlm_load() completes but code crashes before write finishes, async writer's partial data is lost
load all data into memory first, then write atomically:
if os.path.exists(result_file):
df = vlm_load(result_file)
entries = []
for _, row in df.iterrows():
entries.append({
"generation": str(row["prediction"]) if "prediction" in row.index else "",
"expected_answer": str(row["answer"]) if "answer" in row.index else "",
"question": str(row["question"]) if "question" in row.index else "",
})
with open(self.cfg.output_file, 'w', encoding='utf-8') as f:
for entry in entries:
f.write(json.dumps(entry) + "\n")
LOG.info("Wrote final ordered JSONL to %s (%d entries)", self.cfg.output_file, len(df))
| # Re-write output.jsonl with cleaned generations | ||
| with open(output_file, "w", encoding="utf-8") as fout: | ||
| for entry in entries: | ||
| fout.write(json.dumps(entry) + "\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same data loss issue - loads entries then immediately opens file for writing
compute all entries first, then write atomically
📝 WalkthroughWalkthroughThis pull request introduces comprehensive VLMEvalKit integration into NeMo Skills, adding two new in-process generation modes (Megatron MCore and vLLM-based), metrics aggregation from pre-computed evaluation results, and pipeline orchestration for self-contained tasks with GPU allocation per benchmark. Changes
Sequence Diagram(s)sequenceDiagram
participant User as Evaluation<br/>Pipeline
participant Benchmark as Benchmark<br/>Module
participant TaskFactory as Task<br/>Factory
participant GenTask as EvalKit/<br/>MCore Task
participant Dataset as VLMEvalKit<br/>Dataset
participant Model as Generation<br/>Model
participant Metrics as Metrics<br/>Collector
participant Results as Result<br/>Writer
User->>Benchmark: Load benchmark config
Benchmark-->>User: GENERATION_MODULE, METRICS_TYPE, GENERATION_ARGS
User->>TaskFactory: is_self_contained(extra_args)
alt Self-Contained (eval_kit/mcore)
TaskFactory-->>User: True - allocate GPUs per benchmark
User->>GenTask: __init__(config)
GenTask->>Dataset: Build with dataset kwargs
Dataset-->>GenTask: Dataset prepared
GenTask->>Model: Create (mcore/vllm)
Model-->>GenTask: Model ready
else Server-Based
TaskFactory-->>User: False - use shared vLLM server
end
User->>GenTask: generate()
GenTask->>GenTask: _start_async_writer()
GenTask->>Dataset: Iterate samples
loop Per Sample
Dataset->>GenTask: Next sample
GenTask->>Model: Inference (text/image/video)
Model-->>GenTask: Generation result
GenTask->>Results: Write incremental JSONL
end
GenTask->>GenTask: _stop_async_writer()
GenTask->>Metrics: Evaluate results
Metrics-->>GenTask: eval_kit_metrics.json
GenTask->>Results: Write final results + done file
Results-->>User: Evaluation complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/evaluation/evaluator/__init__.py (1)
131-131:⚠️ Potential issue | 🟡 MinorRemove debug
This looks like a leftover debug artifact. Either remove it or replace with proper logging.
- print(f"evaluator: {evaluator}")
🤖 Fix all issues with AI agents
In `@nemo_skills/evaluation/evaluator/__init__.py`:
- Around line 30-33: The current top-level try/except hides import errors for
ComputeEvalEvaluator; instead capture the ImportError into a module-level
variable (e.g. _compute_eval_import_error) and set ComputeEvalEvaluator = None,
then in get_evaluator_class (or the evaluator registration lookup) check if
eval_type == "compute-eval" and if ComputeEvalEvaluator is None raise a clear
ImportError that includes _compute_eval_import_error; this defers the failure to
the point of use and gives an actionable message when someone requests the
"compute-eval" evaluator.
In `@nemo_skills/inference/eval/eval_kit.py`:
- Around line 264-275: In _flush_pkl_to_jsonl, replace the broad "except
Exception" around pickle.load with a narrower handler for the transient errors
that indicate a mid-write pickle (e.g., (EOFError, pickle.UnpicklingError,
BlockingIOError)) so those are safely skipped; for other unexpected errors
(e.g., PermissionError, MemoryError) let them propagate or log them explicitly
before re-raising so they aren't silently swallowed—update the exception clause
in _flush_pkl_to_jsonl accordingly and include a clear processLogger.error(...)
call when re-raising non-transient exceptions.
In `@nemo_skills/inference/mcore_skills.py`:
- Around line 467-522: _evaluate_results currently always imports
vlmeval.dataset.avlm.utils.asr_wer and writes eval_kit_metrics.json with only
WER which is wrong for non-ASR datasets; update the method to consult a config
flag (e.g., self.cfg.metrics_type or self.cfg.dataset_type) or a new
self.cfg.eval_function setting before importing/using asr_wer so you only
compute WER when the dataset is ASR-type, otherwise skip creating
eval_kit_metrics.json or call a configurable evaluator; reference the
METRICS_TYPE_OVERRIDE constant and the EvalKitMetrics consumer to ensure the
written file matches the selected evaluator, and add a clear guard in
_evaluate_results (or load evaluator via a registry/lookup) to avoid the broad
Exception path producing meaningless metrics.
In `@nemo_skills/pipeline/eval.py`:
- Around line 40-67: The code mixes direct access and .get() for
cluster_config["containers"] in _apply_task_overrides which is inconsistent;
update the membership check to use direct access so failures are explicit.
Replace cluster_config.get("containers", {}) with cluster_config["containers"]
in the for-loop that selects container (i.e., change if key and key in
cluster_config.get("containers", {}): to if key and key in
cluster_config["containers"]:) while keeping the initial container =
cluster_config["containers"]["nemo-skills"] and the references to CONTAINER_KEY
and task_classes unchanged.
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 32-42: The _resolve_generation_task_class function currently
swallows all exceptions; change its error handling to only catch ImportError and
ModuleNotFoundError (so syntax/runtime errors in the module propagate) and, when
catching these import-related errors, log a warning including the exception
details and the module_name; keep the rest of the logic (import_from_path vs
importlib.import_module and returning getattr(..., "GENERATION_TASK_CLASS",
None)) unchanged and use the module logger (e.g., logging.getLogger(__name__))
or an existing logger in the file for the warning message.
🧹 Nitpick comments (13)
nemo_skills/evaluation/evaluator/audio.py (2)
499-516: Mutable_TRANSLATION_TYPESset is fragile if ever hoisted to module scope.
_TRANSLATION_TYPESis mutated on line 504 (_TRANSLATION_TYPES.add(task_type)). This works correctly today because it's re-created on every call, but the_ALL_CAPSnaming convention strongly suggests a module-level constant. If a future refactor moves it to module scope (like_FAILURE_RESPONSESorVALID_NORMALIZATION_MODESabove), the set would accumulate task types across calls — a subtle, hard-to-detect bug.Consider making the intent clearer by either:
- Keeping the sets local but using lowercase naming (
asr_types,translation_types), or- Making the module-level sets truly immutable (frozensets) and building the union inline.
Option 2: immutable module-level sets + inline union
Define at module level:
_ASR_TYPES = frozenset({"ASR", "ASR-ZH", "ASR-PC", "ASR_LEADERBOARD"}) _TRANSLATION_TYPES = frozenset({"AST", "Translation"})Then inside
evaluate_sample:- _ASR_TYPES = {"ASR", "ASR-ZH", "ASR-PC", "ASR_LEADERBOARD"} - _TRANSLATION_TYPES = {"AST", "Translation"} - # AudioBench speech translation types: ST-{src}-{tgt} - if task_type.startswith("ST-"): - _TRANSLATION_TYPES.add(task_type) - - if task_type in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}) and not generation: + is_translation = task_type in _TRANSLATION_TYPES or task_type.startswith("ST-") + is_asr = task_type in _ASR_TYPES + + if (is_asr or is_translation or task_type == "CER") and not generation:And similarly replace
task_type in _TRANSLATION_TYPESchecks withis_translation.
557-562: MathQA exact-match may be too strict for numerical answers.Pure string equality after lowercasing won't match equivalent representations like
"3.0"vs"3","1/2"vs"0.5", or whitespace/formatting differences in expressions. If the AudioBench MathQA dataset guarantees canonical answer forms, this is fine — but worth a brief comment in the code to document that assumption.nemo_skills/inference/generate.py (1)
296-306: Consider extracting the sharedget_env_prefix/get_extra_package_dirslogic into a mixin or the base class.Both
mcore_skills.py(lines 143–164) andeval_kit.py(lines 128–149) have identical implementations ofget_env_prefix()andget_extra_package_dirs(). Since the base class already defines these hooks, the shared VLMEvalKit environment setup could live in a common mixin (e.g.,VLMEvalKitMixin) or as a utility that both subclasses delegate to, avoiding the copy-paste.nemo_skills/dataset/eval_kit/__init__.py (1)
33-42: Add type hints to the function signature.As per coding guidelines, "Use type hints for simple types (dict, list, int, float, existing classes) in Python code".
Suggested fix
-def get_extra_generation_args(benchmark): +def get_extra_generation_args(benchmark: str) -> str:nemo_skills/evaluation/metrics/eval_kit_metrics.py (3)
42-44:**kwargsis silently swallowed — unsupported arguments won't raise errors.
get_metrics()inmap_metrics.pyforwards**kwargsfrom the user'smetrics_kwargs. Silently discarding them here means typos or invalid metric options will be ignored rather than failing. Consider either validating that no unexpected kwargs are passed or removing**kwargsentirely if no extra arguments are needed. As per coding guidelines, "Avoid silently ignoring unused user-passed parameters".Suggested fix
- def __init__(self, **kwargs): + def __init__(self, **kwargs): + if kwargs: + LOG.warning("EvalKitMetrics ignores extra kwargs: %s", list(kwargs.keys())) super().__init__(compute_no_answer=False)Or, if no extra kwargs should ever be passed:
- def __init__(self, **kwargs): + def __init__(self): super().__init__(compute_no_answer=False)
56-58:update()skipssuper().update(predictions)— intentional but worth documenting inline.
BaseMetrics.update()handles token counting, timing stats, andmax_ktracking. Skipping it means those metrics will be absent. This is correct for pre-computed VLMEvalKit aggregates, but a brief inline comment would help future maintainers understand the deliberate deviation.
38-40: Class-level mutable state_shared_metrics_filepersists across test runs.
_shared_metrics_fileis a class variable that survives across multiple instantiations and test cases. If tests createEvalKitMetricsinstances for different benchmarks, a stale path from a prior test could leak. Consider resetting it in__init__orsetup()when an instance-level path is found, or documenting the expected lifecycle.nemo_skills/pipeline/utils/eval.py (2)
365-379: Duplicate module import logic —_resolve_generation_task_classvs lines 461-474.
_resolve_generation_task_class(lines 32-42) performs the same import-and-get-GENERATION_TASK_CLASS as lines 461-474 later inprepare_eval_commands. The inline version raises on missingGENERATION_TASK_CLASS, while the helper silently returnsNone. Consider consolidating into one function with an option to raise or returnNone.
413-417: Forcingnum_jobs = total_evalswhen self-contained tasks are present may over-parallelize mixed workloads.If the benchmark list contains both self-contained and server-based benchmarks, this forces every benchmark into its own job. The comment at line 413-414 explains the rationale for self-contained tasks, but it may be an unexpected side effect for the non-self-contained benchmarks in the same eval run.
Consider documenting this behavior in the
--num_jobshelp text or logging which benchmarks are affected.nemo_skills/inference/eval/eval_kit.py (2)
162-227: Dataset is built twice on rank 0 whenworld_size > 1.Lines 218 and 221 both call
build_dataset(cfg.vlm_dataset, **dataset_kwargs). The first is rank-0 only (for download), the second is on all ranks. Rank 0 redundantly builds the dataset a second time. If dataset construction is expensive (beyond just downloading), consider caching the result from the first call.Suggested optimization
if world_size > 1: import torch.distributed as dist if rank == 0: - build_dataset(cfg.vlm_dataset, **dataset_kwargs) + self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs) dist.barrier() + if rank != 0: + self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs) + else: + self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs) - self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs)
337-361: Hardcoded dataset name lists will silently become stale as VLMEvalKit evolves.These lists mirror
VLMEvalKit/run.pybut must be manually kept in sync. If VLMEvalKit adds a new dataset that needs special kwargs (e.g.,nframe), it won't get them here. Consider importing or referencing these lists from VLMEvalKit if possible, or adding a comment with the VLMEvalKit version/commit these were copied from for traceability.nemo_skills/inference/mcore_skills.py (2)
126-165:get_env_prefix()andget_extra_package_dirs()are duplicated verbatim fromeval_kit.py.Both
EvalKitGenerationTask(ineval_kit.py, lines 128-150) andMegatronMCoreGenerationTaskshare identical implementations forget_env_prefix()andget_extra_package_dirs(). Extract these into a shared mixin or utility to avoid drift.
406-428: Redirectingsys.stdoutto/dev/nullis fragile — prefer suppressing at the logging/tqdm level.Replacing
sys.stdoutglobally on non-primary DP ranks suppresses all stdout, including potential error messages or debug info that isn't from VLMEvalKit. This pattern can also confuse debuggers and profilers. Consider usingcontextlib.redirect_stdoutfor a cleaner scope, or configuring VLMEvalKit's verbosity directly if possible.That said, the
finallycleanup at lines 425-428 is correct and ensuressys.stdoutis restored even on exceptions.
| try: | ||
| from nemo_skills.evaluation.evaluator.compute_eval import ComputeEvalEvaluator | ||
| except ImportError: | ||
| ComputeEvalEvaluator = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silent import suppression will produce a confusing error when compute-eval is actually requested.
If the dependency is missing and a user requests the "compute-eval" evaluator, they'll see "Evaluator class not found for type: compute-eval" (from get_evaluator_class) instead of a clear message about the missing import. Consider deferring the error to the point of use so it's actionable.
Suggested approach
-try:
- from nemo_skills.evaluation.evaluator.compute_eval import ComputeEvalEvaluator
-except ImportError:
- ComputeEvalEvaluator = None
+_compute_eval_import_error = None
+try:
+ from nemo_skills.evaluation.evaluator.compute_eval import ComputeEvalEvaluator
+except ImportError as e:
+ ComputeEvalEvaluator = None
+ _compute_eval_import_error = eThen in get_evaluator_class (or at registration lookup time), surface _compute_eval_import_error when eval_type == "compute-eval" and the class is None.
As per coding guidelines, "Do not catch exceptions when they are not normally expected to be raised; let code fail with clear errors instead of silently misbehaving."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| from nemo_skills.evaluation.evaluator.compute_eval import ComputeEvalEvaluator | |
| except ImportError: | |
| ComputeEvalEvaluator = None | |
| _compute_eval_import_error = None | |
| try: | |
| from nemo_skills.evaluation.evaluator.compute_eval import ComputeEvalEvaluator | |
| except ImportError as e: | |
| ComputeEvalEvaluator = None | |
| _compute_eval_import_error = e |
🤖 Prompt for AI Agents
In `@nemo_skills/evaluation/evaluator/__init__.py` around lines 30 - 33, The
current top-level try/except hides import errors for ComputeEvalEvaluator;
instead capture the ImportError into a module-level variable (e.g.
_compute_eval_import_error) and set ComputeEvalEvaluator = None, then in
get_evaluator_class (or the evaluator registration lookup) check if eval_type ==
"compute-eval" and if ComputeEvalEvaluator is None raise a clear ImportError
that includes _compute_eval_import_error; this defers the failure to the point
of use and gives an actionable message when someone requests the "compute-eval"
evaluator.
| def _flush_pkl_to_jsonl(self, pkl_path, index_meta, output_path): | ||
| """Read the pkl, find new entries, append them to the JSONL file.""" | ||
| if not os.path.exists(pkl_path): | ||
| return | ||
| try: | ||
| with open(pkl_path, "rb") as f: | ||
| data = pickle.load(f) | ||
| except Exception: | ||
| # pkl may be mid-write; skip this cycle | ||
| return | ||
| if not isinstance(data, dict): | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broad except Exception when loading pickle silently hides non-transient errors.
The comment says "pkl may be mid-write; skip this cycle", but this also swallows PermissionError, MemoryError, EOFError, etc. Consider narrowing to the specific exceptions that a mid-write pickle can raise.
Suggested fix
- except Exception:
+ except (OSError, EOFError, pickle.UnpicklingError):
# pkl may be mid-write; skip this cycle
return🧰 Tools
🪛 Ruff (0.15.0)
[error] 270-270: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue
(S301)
[warning] 271-271: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@nemo_skills/inference/eval/eval_kit.py` around lines 264 - 275, In
_flush_pkl_to_jsonl, replace the broad "except Exception" around pickle.load
with a narrower handler for the transient errors that indicate a mid-write
pickle (e.g., (EOFError, pickle.UnpicklingError, BlockingIOError)) so those are
safely skipped; for other unexpected errors (e.g., PermissionError, MemoryError)
let them propagate or log them explicitly before re-raising so they aren't
silently swallowed—update the exception clause in _flush_pkl_to_jsonl
accordingly and include a clear processLogger.error(...) call when re-raising
non-transient exceptions.
| def _evaluate_results(self): | ||
| """Compute metrics using VLMEvalKit's evaluation functions. | ||
|
|
||
| Uses the same asr_wer() that eval_kit.py calls via dataset.evaluate(), | ||
| so metrics are identical. Saves eval_kit_metrics.json (consumed by | ||
| EvalKitMetrics in the summarize step). | ||
| """ | ||
| output_file = self.cfg.output_file | ||
| if not output_file or not Path(output_file).exists(): | ||
| return | ||
|
|
||
| output_path = Path(output_file) | ||
|
|
||
| try: | ||
| from vlmeval.dataset.avlm.utils import asr_wer | ||
|
|
||
| # Read entries and build VLMEvalKit-format results list | ||
| entries = [] | ||
| results = [] | ||
| with open(output_file, "rt", encoding="utf-8") as fin: | ||
| for line in fin: | ||
| entry = json.loads(line) | ||
| # Strip leftover <think> tags (older runs may have them) | ||
| gen = entry.get("generation", "") | ||
| cleaned = self._strip_thinking_tags(gen) | ||
| if cleaned != gen: | ||
| entry["generation"] = cleaned | ||
| entries.append(entry) | ||
| results.append({ | ||
| "gt": entry.get("expected_answer", ""), | ||
| "pred": entry["generation"], | ||
| }) | ||
|
|
||
| # Re-write output.jsonl with cleaned generations | ||
| with open(output_file, "w", encoding="utf-8") as fout: | ||
| for entry in entries: | ||
| fout.write(json.dumps(entry) + "\n") | ||
|
|
||
| # Compute WER using VLMEvalKit (same function as eval_kit path) | ||
| wer_score = asr_wer(results) | ||
| LOG.info("ASR WER: %.2f%%", wer_score) | ||
|
|
||
| # Save as eval_kit_metrics.json (same format eval_kit.py writes) | ||
| metrics = {"wer": wer_score} | ||
| metrics_file = output_path.parent / "eval_kit_metrics.json" | ||
| with open(metrics_file, "w", encoding="utf-8") as f: | ||
| json.dump(metrics, f, indent=2) | ||
| LOG.info("Metrics saved to %s", metrics_file) | ||
|
|
||
| except ImportError: | ||
| LOG.warning( | ||
| "VLMEvalKit asr_wer not available — skipping eval-kit-style metrics. " | ||
| "The summarize_results job will compute metrics separately." | ||
| ) | ||
| except Exception: | ||
| LOG.exception("Inline metrics computation failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_evaluate_results assumes ASR WER metric — won't work for non-ASR benchmarks.
This method hardcodes from vlmeval.dataset.avlm.utils import asr_wer and always computes WER. For non-ASR benchmarks processed through mcore_skills.py, this will either fail (caught by the broad except Exception) or produce meaningless metrics. The METRICS_TYPE_OVERRIDE = "eval_kit" at line 136 means EvalKitMetrics will always read eval_kit_metrics.json, which will contain only WER.
Consider either:
- Making the evaluation function configurable (e.g., via config), or
- Guarding this with a dataset-type check, or
- Documenting that
mcore_skills.pyis currently ASR-only.
🤖 Prompt for AI Agents
In `@nemo_skills/inference/mcore_skills.py` around lines 467 - 522,
_evaluate_results currently always imports vlmeval.dataset.avlm.utils.asr_wer
and writes eval_kit_metrics.json with only WER which is wrong for non-ASR
datasets; update the method to consult a config flag (e.g.,
self.cfg.metrics_type or self.cfg.dataset_type) or a new self.cfg.eval_function
setting before importing/using asr_wer so you only compute WER when the dataset
is ASR-type, otherwise skip creating eval_kit_metrics.json or call a
configurable evaluator; reference the METRICS_TYPE_OVERRIDE constant and the
EvalKitMetrics consumer to ensure the written file matches the selected
evaluator, and add a clear guard in _evaluate_results (or load evaluator via a
registry/lookup) to avoid the broad Exception path producing meaningless
metrics.
| def _apply_task_overrides(combined_cmd, task_classes, job_num_gpus, cluster_config): | ||
| """Apply env/torchrun/container overrides declared by generation task classes. | ||
|
|
||
| Returns (modified_cmd, container). | ||
| """ | ||
| # Environment prefix (first non-empty wins; jobs are not mixed across task types) | ||
| for tc in task_classes: | ||
| prefix = tc.get_env_prefix() if hasattr(tc, "get_env_prefix") else "" | ||
| if prefix: | ||
| combined_cmd = f'{prefix}{combined_cmd}' | ||
| break | ||
|
|
||
| # Torchrun for multi-GPU data-parallel inference | ||
| if any(getattr(tc, "USE_TORCHRUN", False) for tc in task_classes): | ||
| if job_num_gpus and int(job_num_gpus) > 1: | ||
| combined_cmd = combined_cmd.replace( | ||
| 'python -m ', f'torchrun --nproc_per_node {job_num_gpus} -m ', 1 | ||
| ) | ||
|
|
||
| # Container selection (task class CONTAINER_KEY, falling back to nemo-skills default) | ||
| container = cluster_config["containers"]["nemo-skills"] | ||
| for tc in task_classes: | ||
| key = getattr(tc, "CONTAINER_KEY", None) | ||
| if key and key in cluster_config.get("containers", {}): | ||
| container = cluster_config["containers"][key] | ||
| break | ||
|
|
||
| return combined_cmd, container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent use of .get() vs direct access for cluster_config["containers"]
Line 60 accesses cluster_config["containers"] directly (which will raise KeyError if missing), but line 63 uses cluster_config.get("containers", {}). If "containers" is absent, line 60 will crash before line 63 is reached, making the .get() misleading. Use direct access consistently. As per coding guidelines, "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors".
Suggested fix
- if key and key in cluster_config.get("containers", {}):
+ if key and key in cluster_config["containers"]:🤖 Prompt for AI Agents
In `@nemo_skills/pipeline/eval.py` around lines 40 - 67, The code mixes direct
access and .get() for cluster_config["containers"] in _apply_task_overrides
which is inconsistent; update the membership check to use direct access so
failures are explicit. Replace cluster_config.get("containers", {}) with
cluster_config["containers"] in the for-loop that selects container (i.e.,
change if key and key in cluster_config.get("containers", {}): to if key and key
in cluster_config["containers"]:) while keeping the initial container =
cluster_config["containers"]["nemo-skills"] and the references to CONTAINER_KEY
and task_classes unchanged.
| def _resolve_generation_task_class(module_name: str): | ||
| """Import a generation module and return its GENERATION_TASK_CLASS, or None.""" | ||
| try: | ||
| if module_name.endswith(".py") or os.sep in module_name: | ||
| path_suffix = ".py" if not module_name.endswith(".py") else "" | ||
| mod = import_from_path(module_name + path_suffix) | ||
| else: | ||
| mod = importlib.import_module(module_name) | ||
| return getattr(mod, "GENERATION_TASK_CLASS", None) | ||
| except Exception: | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bare except Exception silently swallows import/syntax errors in generation modules.
If the generation module has a bug, misspelled path, or missing dependency, this returns None silently. Downstream code then treats the benchmark as having no task class, potentially skipping self-contained mode, env prefixes, extra package dirs, and metric type overrides — all without any error or warning. This violates the coding guideline: "Do not catch exceptions when they are not normally expected to be raised; let code fail with clear errors instead of silently misbehaving".
At minimum, log a warning so users know resolution failed. Better yet, only catch ImportError/ModuleNotFoundError and let other exceptions propagate.
Suggested fix
def _resolve_generation_task_class(module_name: str):
"""Import a generation module and return its GENERATION_TASK_CLASS, or None."""
try:
if module_name.endswith(".py") or os.sep in module_name:
path_suffix = ".py" if not module_name.endswith(".py") else ""
mod = import_from_path(module_name + path_suffix)
else:
mod = importlib.import_module(module_name)
return getattr(mod, "GENERATION_TASK_CLASS", None)
- except Exception:
+ except (ImportError, ModuleNotFoundError):
+ LOG.debug("Could not resolve GENERATION_TASK_CLASS from %s", module_name)
return None🧰 Tools
🪛 Ruff (0.15.0)
[warning] 41-41: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@nemo_skills/pipeline/utils/eval.py` around lines 32 - 42, The
_resolve_generation_task_class function currently swallows all exceptions;
change its error handling to only catch ImportError and ModuleNotFoundError (so
syntax/runtime errors in the module propagate) and, when catching these
import-related errors, log a warning including the exception details and the
module_name; keep the rest of the logic (import_from_path vs
importlib.import_module and returning getattr(..., "GENERATION_TASK_CLASS",
None)) unchanged and use the module logger (e.g., logging.getLogger(__name__))
or an existing logger in the file for the warning message.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: George <[email protected]>
VLMEvalKit's MultiModalMCore inference engine:
Key changes:
Summary by CodeRabbit
Release Notes
New Features
Improvements