Add ComputeEval Dataset Support #1124

blahblahasdf · 2025-12-17T20:35:56Z

Overview

This PR integrates ComputeEval - a benchmark for evaluating LLMs on CUDA code generation - into NeMo-Skills.

ComputeEval features:

406 (and growing) CUDA programming challenges
Functional correctness evaluation via compilation and test execution
Multiple release versions (2025-1, 2025-2, 2025-3)

Changes

Dataset Preparation - download ComputeEval problems from HuggingFace (gated dataset)
Generation Task - implement GenerationTask to generate CUDA solutions using ComputeEval's reference generation logic
Evaluator - implement BaseEvaluator using ComputeEval's evaluation logic
Metrics - compute pass@k metrics

Dependencies

Adds compute-eval package (compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@2d14770)
Note: compute-eval is a public repository but has not been published to PyPI yet

Summary by CodeRabbit

Release Notes

New Features
- Added support for compute-eval dataset from HuggingFace with optional release specifications.
- Introduced compute-eval evaluator for CUDA code assessment with automatic CTK version detection.
- Added compute-eval metrics for computing evaluation scores including pass-at-k performance.
- Integrated compute-eval generation task into the inference pipeline.
- Registered compute-eval as a new evaluation type across the framework.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-17T20:39:18Z

📝 Walkthrough

Walkthrough

This PR adds support for NVIDIA's compute-eval dataset and evaluation framework to NeMo Skills. It introduces a new compute-eval evaluator type, corresponding metrics, a dataset preparation script, a generation task for model inference, and registers these components in the appropriate registries.

Changes

Cohort / File(s)	Summary
Dataset Package Initialization `nemo_skills/dataset/compute-eval/__init__.py`	Introduces package constants: `EVAL_SPLIT`, `DATASET_GROUP`, `METRICS_TYPE`, `GENERATION_MODULE`, and `GENERATION_ARGS` for compute-eval configuration.
Dataset Preparation `nemo_skills/dataset/compute-eval/prepare.py`	New script to download NVIDIA compute-eval dataset from HuggingFace with optional release specification, token authentication via `HF_TOKEN` environment variable, and output formatting to `eval.jsonl`.
Evaluator Implementation `nemo_skills/evaluation/evaluator/compute_eval.py`, `nemo_skills/evaluation/evaluator/__init__.py`	Introduces `ComputeEvalEvaluator` class with NVCC version detection, pydantic validation for problem/solution, thread-pool execution of evaluation, and registers it in `EVALUATOR_MAP` and `EVALUATOR_CLASS_MAP`.
Metrics Implementation `nemo_skills/evaluation/metrics/code_metrics.py`, `nemo_skills/evaluation/metrics/map_metrics.py`	Introduces `ComputeEvalMetrics` class implementing score dict, incorrect sample, and update with pass-at-k computation; registers in `METRICS_MAP`.
Generation Task `nemo_skills/inference/eval/compute_eval.py`	Introduces `ComputeEvalGenerationTask` for model inference with server configuration validation, problem type validation, and solution generation; provides Hydra entry point `run_compute_eval`.
Dependencies `requirements/main.txt`	Adds git-based dependency for compute-eval package from specified commit.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

NVCC version detection and error handling in ComputeEvalEvaluator.__init__: verify compatibility logic and RuntimeError conditions
Thread pool execution in ComputeEvalEvaluator.eval_single: confirm proper async/threading pattern and exception handling
Type validation with discriminated unions in both evaluator and generation task: verify pydantic TypeAdapter configuration for CudaCppProblem | CudaPythonProblem
Configuration validation in ComputeEvalGenerationTask.__init__: ensure model and base_url enforcement is correct
Git dependency specification in requirements/main.txt: verify commit hash and repository URL are correct

Suggested labels

run GPU tests

Suggested reviewers

Kipok
gwarmstrong

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Add ComputeEval Dataset Support' accurately describes the main objective of the PR, which integrates ComputeEval into NeMo-Skills with dataset preparation, generation tasks, evaluators, and metrics.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (4)

nemo_skills/dataset/compute-eval/prepare.py (2)
29-34: Consider narrowing the exception type or adding a brief comment.

The bare Exception catch works but obscures intent. A comment explaining that any failure during whoami() indicates the user isn't logged in would improve clarity.
     # noinspection PyBroadException
     try:
         api = HfApi()
         api.whoami()
-        return None
+        return None  # User is logged in; HF will use cached credentials
-    except Exception:
+    except Exception:  # Any failure means user is not authenticated
         return None
48-54: Add validation for the expected dataset split.

If the dataset doesn't contain an "eval" split (e.g., due to an invalid release), the script will raise an unclear KeyError. Consider validating the split exists.
     dataset = load_dataset("nvidia/compute-eval", args.release, token=_get_hf_token())
+    if "eval" not in dataset:
+        raise ValueError(f"Dataset does not contain 'eval' split. Available splits: {list(dataset.keys())}")
     data_dir = Path(__file__).absolute().parent
nemo_skills/inference/eval/compute_eval.py (2)
54-67: Consider using pass for no-op methods and prefixing unused parameters.

The no-op lifecycle methods use return statements, but pass is more idiomatic for intentionally empty methods. Additionally, the data parameter in log_example_prompt is flagged as unused by static analysis - prefixing it with an underscore would signal it's intentionally unused.

Apply this diff:
-    def log_example_prompt(self, data):
-        return
+    def log_example_prompt(self, _data):
+        pass
 
     def setup_prompt(self):
-        return
+        pass
 
     def setup_llm(self):
-        return
+        pass
 
     def setup_litellm_cache(self):
-        return
+        pass
 
     def cleanup_litellm_cache(self):
-        return
+        pass
69-69: Prefix unused parameter with underscore.

The data parameter is flagged as unused by static analysis. If it's part of the interface but not needed in this implementation, prefix it with an underscore to signal it's intentionally unused.

Apply this diff:
-    async def process_single_datapoint(self, data_point, data):
+    async def process_single_datapoint(self, data_point, _data):

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e16e1e and 67e8149.

📒 Files selected for processing (8)

nemo_skills/dataset/compute-eval/__init__.py (1 hunks)
nemo_skills/dataset/compute-eval/prepare.py (1 hunks)
nemo_skills/evaluation/evaluator/__init__.py (2 hunks)
nemo_skills/evaluation/evaluator/compute_eval.py (1 hunks)
nemo_skills/evaluation/metrics/code_metrics.py (1 hunks)
nemo_skills/evaluation/metrics/map_metrics.py (2 hunks)
nemo_skills/inference/eval/compute_eval.py (1 hunks)
requirements/main.txt (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-23T17:56:57.556Z

Learnt from: Glorf
Repo: NVIDIA-NeMo/Skills PR: 908
File: requirements/main.txt:16-16
Timestamp: 2025-11-23T17:56:57.556Z
Learning: faiss-cpu must be explicitly listed in requirements/main.txt for BFCLv4 memory evaluations (memory_kv, memory_vector, memory_rec_sum) as it is an optional dependency of sentence_transformers that is required for vector similarity search functionality in the memory backends.

Applied to files:

requirements/main.txt

🧬 Code graph analysis (4)

nemo_skills/evaluation/evaluator/__init__.py (1)

nemo_skills/evaluation/evaluator/compute_eval.py (1)

ComputeEvalEvaluator (31-63)

nemo_skills/evaluation/metrics/map_metrics.py (1)

nemo_skills/evaluation/metrics/code_metrics.py (1)

ComputeEvalMetrics (126-135)

nemo_skills/evaluation/evaluator/compute_eval.py (2)

nemo_skills/evaluation/evaluator/base.py (1)

BaseEvaluator (34-91)

nemo_skills/utils.py (1)

get_logger_name (39-43)

nemo_skills/evaluation/metrics/code_metrics.py (7)

nemo_skills/evaluation/metrics/base.py (5)

BaseMetrics (23-434)

_get_score_dict (124-143)

get_incorrect_sample (200-206)

update (145-189)

_compute_pass_at_k (352-423)

nemo_skills/evaluation/metrics/if_metrics.py (3)

_get_score_dict (24-28)

get_incorrect_sample (30-33)

update (35-48)

nemo_skills/evaluation/metrics/answer_judgement_metrics.py (3)

_get_score_dict (35-39)

get_incorrect_sample (41-47)

update (121-132)

nemo_skills/evaluation/metrics/lean4_metrics.py (3)

_get_score_dict (23-24)

get_incorrect_sample (26-29)

update (46-48)

nemo_skills/evaluation/metrics/math_metrics.py (3)

_get_score_dict (67-79)

get_incorrect_sample (81-88)

update (90-120)

nemo_skills/evaluation/metrics/ruler_metrics.py (3)

_get_score_dict (19-20)

get_incorrect_sample (26-29)

update (22-24)

nemo_skills/evaluation/metrics/arena_metrics.py (2)

get_incorrect_sample (36-40)

update (42-84)

🪛 Ruff (0.14.8)

nemo_skills/inference/eval/compute_eval.py

54-54: Unused method argument: data

(ARG002)

69-69: Unused method argument: data

(ARG002)

nemo_skills/dataset/compute-eval/prepare.py

32-32: Consider moving this statement to an else block

(TRY300)

33-33: Do not catch blind exception: Exception

(BLE001)

nemo_skills/evaluation/metrics/code_metrics.py

130-130: Unused method argument: prediction

(ARG002)

🔇 Additional comments (10)

nemo_skills/evaluation/metrics/code_metrics.py (1)

126-135: LGTM - Implementation follows established patterns.

The ComputeEvalMetrics class correctly mirrors the structure of HumanEvalInfillingMetrics (lines 114-123). The unused prediction parameter in get_incorrect_sample is required by the base class interface, so the static analysis warning is a false positive.

nemo_skills/evaluation/metrics/map_metrics.py (2)

26-26: LGTM - Import correctly added.

72-72: LGTM - Registry entry properly added.

The key "compute-eval" correctly matches METRICS_TYPE defined in nemo_skills/dataset/compute-eval/__init__.py.

nemo_skills/evaluation/evaluator/__init__.py (2)

29-29: LGTM - Import correctly placed.

71-71: LGTM - Evaluator correctly registered in class map only.

The ComputeEvalEvaluator is correctly placed in EVALUATOR_CLASS_MAP (not EVALUATOR_MAP) since it implements eval_single for async single-point evaluation.

nemo_skills/evaluation/evaluator/compute_eval.py (1)

48-62: LGTM - Correct use of asyncio.to_thread for blocking evaluation.

Running evaluate_solution in a thread pool correctly avoids blocking the event loop during CUDA compilation and test execution.

requirements/main.txt (1)

17-17: LGTM - Git commit pinning ensures reproducibility.

The dependency is properly pinned to a specific commit (991b47c, currently the main branch HEAD) and correctly sorted alphabetically. The short hash is valid and unambiguous for this repository. Using the full 40-character SHA would provide additional robustness against theoretical hash collisions, but the current approach is acceptable for practical purposes.

nemo_skills/dataset/compute-eval/__init__.py (1)

15-19: LGTM - Dataset constants properly defined.

Constants are internally consistent: METRICS_TYPE matches the registry key in map_metrics.py (line 72), and EVAL_SPLIT matches the split accessed in prepare.py (lines 52-53). The GENERATION_MODULE path correctly points to the existing nemo_skills.inference.eval.compute_eval module.

nemo_skills/inference/eval/compute_eval.py (2)

1-34: LGTM!

The imports and module-level setup are well-structured. The TypeAdapter with a discriminated union for problem validation is a clean approach.

86-102: LGTM!

The module exports and main entry point are well-structured. The help message handling and logging setup follow good practices.

nemo_skills/evaluation/evaluator/compute_eval.py

nemo_skills/inference/eval/compute_eval.py

gwarmstrong

Looks pretty good, with a couple high level questions/concerns

gwarmstrong · 2025-12-18T23:20:45Z

nemo_skills/dataset/compute-eval/prepare.py

+
+    # noinspection PyBroadException
+    try:
+        api = HfApi()


what is the purpose of this?

This allows a user to access the HuggingFace gated (you have to be allow-listed to have access to the ComputeEval data today) data either via an environment variable (HF_TOKEN) or by using the HuggingFace CLI to login.

Actually, after looking closer we don't need to access the API at all for this to work. I'll just clean this up.

yeah, we should just need a load_dataset call and it will handle the token if it's already in your env variables, like this one is gated: https://github.com/blahblahasdf/Skills/blob/cb7829094799f9b5967e260d30e39c2535f67bab/nemo_skills/dataset/flores200/prepare.py#L49

gwarmstrong · 2025-12-18T23:46:17Z

nemo_skills/inference/eval/compute_eval.py

+                generate_model_completions,
+                system_prompt=self._system_prompt,
+                problem=problem,
+                model=self._model,


Looks like this is applying this function:
https://github.com/NVIDIA/compute-eval/blob/main/compute_eval/generate_completions.py#L31-L84

Which appears to use the ComputeEval clients?
What I would recommend we do instead, is write a version of that function that uses NeMo-Skills' process_single_datapoint--so this probably looks something like

result = super().process_sinlge_datapoint(data_point, data) ... completion = ... return FileSolution( task_id=problem.task_id, files=_parse_solution(completion), **debug_info, )

this will make sure the prompt logic is consistent with typical NeMo-Skills and that the eval prompt logic is the same as normal generations.

Then the above methods that are overwritten would just use the default methods rather than override to return nothing.

Thanks for the quick feedback! Your suggestion is to call process_sinlge_datapoint and then use the ComputeEval post processing to convert the model's response to a FileSolution? The "ComputeEval clients" are just a pass through to the OpenAI client. I think as is if you were to configure to a local model (or a model hosted somewhere else) it would just work as is but I agree it doesn't fit super cleanly with the NeMo-Skills abstractions.

Honestly, the generation side of the ComputeEval repo is generally intended to be a reference implementation for how other groups/companies would set up a generator scaffold to solve the ComputeEval problems. The reference implementation's prompt and post processing logic are pretty coupled.

The logic for calling the openai client is here, right? https://github.com/NVIDIA/compute-eval/blob/main/compute_eval/models/model_interface.py#L100-L154
Seems pretty straightforward and is pretty much a subset of the functionality our clients have. And if do it the way it's currently being done, you lose a lot of things we support in NeMo-Skills (like tool-calling, pretty much all sampling parameters not directly specified in the compute-eval function, as well as the supported server types would be quite limited).

Thanks for the pointers and happy New Year! I see where you're coming from on fitting this in with the NemoSkills generation more cleanly. What is the expectation on how to document thisGenerationTasks content format? I was able to get this working locally but there is definitely some coupling between the expected data format and system and user prompts required in a ++prompt_config. I would guess that has to be normal and just documented?

Yeah that's pretty typical. We don't really have it documented on a case-by-case basis--it's pretty easy to infer from the prompt config itself, and we have a generic section on understanding the interaction between data and prompt_config here: https://nvidia-nemo.github.io/Skills/basics/prompt-format/#prompt-api

greptile-apps · 2026-01-05T18:33:37Z

Greptile Summary

This PR integrates ComputeEval, a CUDA code generation benchmark, into the NeMo-Skills framework. The implementation adds dataset preparation from HuggingFace, a generation task for producing CUDA solutions, an evaluator that compiles and tests generated code, and metrics computation including pass@k scores.

Key changes:

Dataset preparation downloads problems from nvidia/compute-eval on HuggingFace with optional release version support
Generation task extends GenerationTask and parses LLM-generated CUDA code into structured solutions
Evaluator automatically detects installed CUDA Toolkit version and runs async evaluation using the compute-eval library
Metrics implementation follows existing code benchmark patterns (similar to HumanEvalInfillingMetrics)
All integration points properly registered in the framework's evaluator and metrics maps

Note: The compute-eval dependency version was updated from 991b47c to 4221fc2 in the final commit.

Confidence Score: 4/5

This PR is safe to merge with minimal risk, pending awareness of the protected member usage
The implementation follows existing patterns in the codebase consistently, integrates cleanly with the framework's architecture, and has proper error handling. Score reduced by one point due to the use of protected member _parse_solution from the compute-eval library, which was already flagged in previous review threads but represents a potential maintenance risk if the external library changes its internal API
Pay attention to nemo_skills/inference/eval/compute_eval.py due to its dependency on a protected member from the external library

Important Files Changed

Filename	Overview
nemo_skills/dataset/compute-eval/prepare.py	Implements dataset preparation script to download and format compute-eval problems from HuggingFace
nemo_skills/evaluation/evaluator/compute_eval.py	Implements CUDA code evaluator with automatic CTK version detection and async execution
nemo_skills/inference/eval/compute_eval.py	Implements generation task that parses CUDA solutions using compute-eval library (uses protected member)
requirements/main.txt	Adds compute-eval dependency pinned to commit 4221fc2

Sequence Diagram

sequenceDiagram
    participant User
    participant Prepare as prepare.py
    participant HF as HuggingFace
    participant Gen as ComputeEvalGenerationTask
    participant LLM as Language Model
    participant Eval as ComputeEvalEvaluator
    participant CE as compute-eval library
    participant Metrics as ComputeEvalMetrics

    User->>Prepare: Run dataset preparation
    Prepare->>HF: load_dataset("nvidia/compute-eval", release)
    HF-->>Prepare: Return dataset with problems
    Prepare->>Prepare: Format context_files_block
    Prepare->>Prepare: Write eval.jsonl with problem data
    
    User->>Gen: Generate solutions
    Gen->>Gen: Read eval.jsonl
    loop For each problem
        Gen->>LLM: Send problem prompt
        LLM-->>Gen: Return generated code
        Gen->>CE: _parse_solution(generation)
        CE-->>Gen: Parsed file solutions
        Gen->>Gen: Create FileSolution object
        Gen->>Gen: Write solution to output
    end
    
    User->>Eval: Evaluate solutions
    Eval->>Eval: get_nvcc_version() and parse_semver()
    loop For each data_point
        Eval->>Eval: Validate problem and solution
        Eval->>CE: evaluate_solution(problem, solution, ctk_version)
        CE->>CE: Compile and test CUDA code
        CE-->>Eval: Return graded result (passed, skipped, etc.)
        Eval->>Eval: Return evaluation dict
    end
    
    User->>Metrics: Compute metrics
    Metrics->>Metrics: Aggregate accuracy scores
    Metrics->>Metrics: Compute pass@k metrics
    Metrics-->>User: Return final metrics

greptile-apps

Additional Comments (1)

nemo_skills/inference/eval/compute_eval.py, line 21 (link)

style: Using protected member _parse_solution. Consider requesting public API when compute-eval stabilizes.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-07T20:23:05Z

nemo_skills/inference/eval/compute_eval.py

+from compute_eval.data.data_model import FileSolution
+
+# noinspection PyProtectedMember
+from compute_eval.generate_completions import _parse_solution


style: accessing protected member _parse_solution from external library

If _parse_solution becomes unavailable in future versions of compute-eval, this will break. Consider requesting a public API from the compute-eval library.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-07T23:35:52Z

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

greptile-apps

Greptile Overview

Greptile Summary

This PR integrates ComputeEval benchmark for evaluating CUDA code generation into NeMo-Skills.

Major Changes

Added dataset preparation script that downloads 406 CUDA problems from gated HuggingFace dataset
Implemented ComputeEvalGenerationTask that generates CUDA solutions and parses them into FileSolution format
Implemented ComputeEvalEvaluator that compiles and tests CUDA code using NVCC with automatic toolkit version detection
Added ComputeEvalMetrics class for computing accuracy and pass@k performance metrics
Registered compute-eval across all integration points (evaluators, metrics, dataset configs)

Issues Found

Dependency version mismatch: PR description specifies commit 991b47c but requirements/main.txt references @main branch, which may introduce breaking changes
Fragile external API dependency: Uses protected member _parse_solution from compute-eval library which could break if the library refactors its internal APIs

Architecture Assessment

The implementation follows NeMo-Skills patterns consistently - uses BaseEvaluator async evaluation, BaseMetrics for metrics computation, and proper registration in all required maps. The integration is clean and well-structured.

Confidence Score: 4/5

Safe to merge after fixing the dependency version mismatch in requirements.txt
Score reflects well-structured implementation following framework patterns, but critical dependency version mismatch needs resolution. The code is functional and integrates cleanly, but the @main reference instead of pinned commit could cause instability.
Pay attention to requirements/main.txt - must fix version mismatch before merging

Important Files Changed

File Analysis

Filename	Score	Overview
requirements/main.txt	3/5	Added `compute-eval` dependency but references `@main` instead of specific commit `@991b47c` mentioned in PR description
nemo_skills/evaluation/evaluator/compute_eval.py	5/5	Implements async evaluator that validates problems/solutions and runs CUDA compilation and tests with NVCC version checking
nemo_skills/inference/eval/compute_eval.py	4/5	Generation task that parses LLM output into FileSolution format - relies on protected `_parse_solution` API from external library

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant HF as HuggingFace
    participant GenerationTask as ComputeEvalGenerationTask
    participant LLM as Language Model
    participant Evaluator as ComputeEvalEvaluator
    participant ComputeEval as compute-eval Library
    participant NVCC as CUDA Toolkit
    participant Metrics as ComputeEvalMetrics

    Note over User,HF: 1. Dataset Preparation Phase
    User->>PrepareScript: Run prepare.py --release 2025-1
    PrepareScript->>HF: load_dataset("nvidia/compute-eval")
    HF-->>PrepareScript: Return 406 CUDA problems
    PrepareScript->>PrepareScript: Format context files
    PrepareScript->>PrepareScript: Write eval.jsonl

    Note over User,Metrics: 2. Generation Phase
    User->>GenerationTask: Run compute_eval inference
    GenerationTask->>GenerationTask: Load eval.jsonl
    loop For each problem
        GenerationTask->>LLM: Send prompt with context
        LLM-->>GenerationTask: Return CUDA code generation
        GenerationTask->>ComputeEval: _parse_solution(generation)
        ComputeEval-->>GenerationTask: FileSolution object
        GenerationTask->>GenerationTask: Save solution + generation
    end

    Note over User,Metrics: 3. Evaluation Phase
    User->>Evaluator: Run evaluation
    Evaluator->>NVCC: get_nvcc_version()
    NVCC-->>Evaluator: CTK version (e.g., 12.3)
    loop For each solution (async)
        Evaluator->>ComputeEval: evaluate_solution(problem, solution)
        ComputeEval->>NVCC: Compile CUDA code
        NVCC-->>ComputeEval: Build result
        ComputeEval->>ComputeEval: Run tests
        ComputeEval-->>Evaluator: Graded result (passed/failed)
        Evaluator->>Evaluator: Update data with pass/fail
    end

    Note over User,Metrics: 4. Metrics Computation
    User->>Metrics: Compute metrics
    Metrics->>Metrics: Calculate accuracy
    Metrics->>Metrics: Compute pass@k scores
    Metrics-->>User: Final metrics report

greptile-apps · 2026-01-08T19:45:34Z

requirements/main.txt


 bs4
 click < 8.2.0  # https://github.com/ai-dynamo/dynamo/issues/1039
+compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@main


PR description specifies commit 991b47c but requirements file references @main branch

Suggested change

compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@main

compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@991b47c

greptile-apps · 2026-01-08T19:45:35Z

nemo_skills/inference/eval/compute_eval.py

+from compute_eval.data.data_model import FileSolution
+
+# noinspection PyProtectedMember
+from compute_eval.generate_completions import _parse_solution


accessing protected member _parse_solution from external library - fragile dependency that may break if compute-eval refactors internal APIs

greptile-apps

Greptile Overview

Greptile Summary

This PR successfully integrates ComputeEval, a CUDA code generation benchmark, into NeMo-Skills following established patterns in the codebase.

Major changes:

Added dataset preparation script to download and format problems from HuggingFace's gated nvidia/compute-eval dataset
Implemented ComputeEvalEvaluator that compiles and executes CUDA code with automatic CTK version detection
Created ComputeEvalGenerationTask to generate CUDA solutions using the LLM inference pipeline
Added ComputeEvalMetrics class to compute pass@k performance metrics
Registered compute-eval across all integration points (evaluator map, metrics map, dataset config)

Integration quality:
The implementation follows NeMo-Skills conventions consistently - the evaluator extends BaseEvaluator, metrics extend BaseMetrics with _compute_pass_at_k, and the generation task extends GenerationTask. The dataset configuration uses standard constants (EVAL_SPLIT, DATASET_GROUP, METRICS_TYPE).

Minor concerns:

Uses a protected API (_parse_solution) from compute-eval that could break in future versions
Broad exception handling in evaluator could mask debugging issues
Dependency commit hash differs from PR description (appears intentional from git history)

Confidence Score: 4/5

This PR is safe to merge with minor style improvements recommended
The implementation is solid and follows established patterns throughout the codebase. The core logic correctly integrates compute-eval's evaluation pipeline with async support. The main concerns are style-related: reliance on a protected API and broad exception handling. These won't cause runtime failures but could impact maintainability. NVCC detection at initialization is appropriate.
Pay attention to nemo_skills/inference/eval/compute_eval.py due to protected API usage, and verify the compute-eval dependency version matches expectations

Important Files Changed

File Analysis

Filename	Score	Overview
nemo_skills/dataset/compute-eval/prepare.py	5/5	Dataset preparation script to download and transform ComputeEval problems from HuggingFace
nemo_skills/evaluation/evaluator/compute_eval.py	4/5	Evaluator implementation that validates solutions against CUDA problems with CTK version checks
nemo_skills/inference/eval/compute_eval.py	3/5	Generation task that parses generated CUDA code and creates solutions, but uses protected API
requirements/main.txt	4/5	Added compute-eval dependency from GitHub at commit 2d14770 (differs from PR description)

Sequence Diagram

sequenceDiagram
    participant User
    participant prepare.py
    participant HuggingFace
    participant GenerationTask
    participant LLM
    participant Evaluator
    participant compute_eval
    participant Metrics

    User->>prepare.py: Run dataset preparation
    prepare.py->>HuggingFace: load_dataset("nvidia/compute-eval", release)
    HuggingFace-->>prepare.py: Return problems with task_id, prompt, context_files
    prepare.py->>prepare.py: Format context_files with language fences
    prepare.py-->>User: Save to eval.jsonl

    User->>GenerationTask: Run inference (ComputeEvalGenerationTask)
    GenerationTask->>GenerationTask: Load eval.jsonl
    loop For each problem
        GenerationTask->>LLM: Generate CUDA solution
        LLM-->>GenerationTask: Generated code
        GenerationTask->>GenerationTask: Parse solution with _parse_solution()
        GenerationTask->>GenerationTask: Create FileSolution object
    end
    GenerationTask-->>User: Save generations with solutions

    User->>Evaluator: Run evaluation (ComputeEvalEvaluator)
    Evaluator->>Evaluator: Detect NVCC version and CTK
    loop For each generated solution
        Evaluator->>Evaluator: Validate problem and solution schemas
        Evaluator->>compute_eval: evaluate_solution(problem, solution, ctk_version)
        compute_eval->>compute_eval: Compile and run CUDA code
        compute_eval-->>Evaluator: Return passed, skipped, elapsed_time, outputs
    end
    Evaluator-->>User: Save evaluation results

    User->>Metrics: Compute metrics (ComputeEvalMetrics)
    Metrics->>Metrics: Calculate accuracy from passed field
    Metrics->>Metrics: Compute pass@k scores
    Metrics-->>User: Display results

greptile-apps · 2026-01-08T20:49:58Z

nemo_skills/inference/eval/compute_eval.py

+from compute_eval.data.data_model import FileSolution
+
+# noinspection PyProtectedMember
+from compute_eval.generate_completions import _parse_solution


Uses protected member _parse_solution from compute_eval.generate_completions. Protected members (prefixed with _) are not part of the public API and may change without notice. Consider requesting a public API from the compute-eval package or copying the implementation.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-08T20:49:59Z

nemo_skills/evaluation/evaluator/compute_eval.py

+        # noinspection PyBroadException
+        try:


Using except Exception catches all exceptions broadly. While the error is logged and returned in the result, this could mask unexpected issues during development (e.g., KeyError from missing data_point["problem"]). Consider catching more specific exceptions or at least logging the traceback for debugging.

requirements/main.txt

Signed-off-by: dlord <[email protected]>

Signed-off-by: George Zelenfroind <[email protected]> Signed-off-by: George <[email protected]> Signed-off-by: dlord <[email protected]>

Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>

Signed-off-by: SeanNaren <[email protected]> Co-authored-by: Mehrzad Samadi <[email protected]> Signed-off-by: dlord <[email protected]>

…A-NeMo#1119) Signed-off-by: Arkadiusz Nowaczynski <[email protected]> Signed-off-by: George Armstrong <[email protected]> Co-authored-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>

Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>

Signed-off-by: dlord <[email protected]>

… move `ComputeEvalGenerationTask` to be more on the rails of the `process_single_datapoint` approach. Signed-off-by: dlord <[email protected]>

…DIA-NeMo#1125) Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>

Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>

Signed-off-by: Wei Du <[email protected]> Signed-off-by: dlord <[email protected]>

…NVIDIA-NeMo#1129) Signed-off-by: Stephen Ge <[email protected]> Signed-off-by: dlord <[email protected]>

Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>

) Signed-off-by: Arkadiusz Nowaczynski <[email protected]> Co-authored-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>

Signed-off-by: wasiahmad <[email protected]> Signed-off-by: dlord <[email protected]>

Signed-off-by: wasiahmad <[email protected]> Signed-off-by: Nikolai Ludwig <[email protected]> Signed-off-by: George Armstrong <[email protected]> Signed-off-by: i-vainn <[email protected]> Signed-off-by: Grigor Nalbandyan <[email protected]> Signed-off-by: bzantium <[email protected]> Signed-off-by: Stephen Ge <[email protected]> Signed-off-by: Jiacheng Xu <[email protected]> Signed-off-by: George Zelenfroind <[email protected]> Co-authored-by: Nick Ludwig <[email protected]> Co-authored-by: George Armstrong <[email protected]> Co-authored-by: Ivan <[email protected]> Co-authored-by: Wojciech Prazuch <[email protected]> Co-authored-by: gnalbandyan <[email protected]> Co-authored-by: Minho Ryu <[email protected]> Co-authored-by: Stephen Ge <[email protected]> Co-authored-by: Jiacheng Xu <[email protected]> Co-authored-by: Jiacheng Xu <[email protected]> Co-authored-by: George <[email protected]> Co-authored-by: Sanyam Kapoor <[email protected]> Signed-off-by: dlord <[email protected]>

Signed-off-by: fzyzcjy <[email protected]> Signed-off-by: Igor Gitman <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Signed-off-by: dlord <[email protected]>

Signed-off-by: Sanyam Kapoor <[email protected]> Signed-off-by: dlord <[email protected]>

…eMo#1136) Signed-off-by: Valentin Mendelev <[email protected]> Co-authored-by: Nikolay Karpov <[email protected]> Signed-off-by: dlord <[email protected]>

Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>

Signed-off-by: Jocelyn Huang <[email protected]> Signed-off-by: dlord <[email protected]> # Conflicts: # nemo_skills/evaluation/evaluator/__init__.py

Signed-off-by: bzantium <[email protected]> Co-authored-by: Igor Gitman <[email protected]> Signed-off-by: dlord <[email protected]>

Signed-off-by: dlord <[email protected]>

Signed-off-by: Dan Lord <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This PR integrates ComputeEval, a CUDA code generation benchmark, into NeMo-Skills. The implementation adds dataset preparation, generation task, evaluator, and metrics following the framework's patterns.

Major Issues Found:

Missing ++prompt_config in GENERATION_ARGS will cause validation failures during generation
Dataset preparation lacks error handling for authentication failures and schema validation
Evaluator catches all exceptions indiscriminately and returns inconsistent error response fields
Generation task uses protected API (_parse_solution) from upstream library without error handling

Architecture:
The integration correctly follows NeMo-Skills patterns by:

Registering the evaluator in EVALUATOR_CLASS_MAP
Implementing BaseEvaluator with async eval_single method
Using TypeAdapter for Pydantic validation of compute-eval data models
Computing pass@k metrics consistent with other code benchmarks

Dependencies:
Adds compute-eval package pinned to commit 2d14770. Note this is a public but not yet PyPI-published dependency.

Confidence Score: 3/5

This PR has several critical issues that will cause runtime failures in common scenarios
Score reflects multiple logic errors that will cause failures: missing prompt_config will fail validation, missing error handling will cause unclear crashes, inconsistent error responses could break downstream processing, and use of protected APIs creates fragility
Pay close attention to nemo_skills/dataset/compute-eval/__init__.py (missing prompt_config), prepare.py (no error handling), and both evaluator files (error handling issues)

Important Files Changed

File Analysis

Filename	Score	Overview
nemo_skills/dataset/compute-eval/init.py	3/5	Dataset configuration file missing required `prompt_config` in GENERATION_ARGS, which will cause validation failure during generation
nemo_skills/dataset/compute-eval/prepare.py	3/5	Dataset preparation script lacks error handling for authentication failures and dataset schema validation
nemo_skills/evaluation/evaluator/compute_eval.py	3/5	Evaluator catches all exceptions indiscriminately and returns inconsistent error fields
nemo_skills/inference/eval/compute_eval.py	3/5	Generation task uses protected API from upstream library and lacks error handling in data processing

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant HF as HuggingFace
    participant GenTask as ComputeEvalGenerationTask
    participant LLM as Language Model
    participant Evaluator as ComputeEvalEvaluator
    participant ComputeEval as compute-eval Library
    participant Metrics as ComputeEvalMetrics

    User->>PrepareScript: Run dataset preparation
    PrepareScript->>HF: load_dataset("nvidia/compute-eval", release)
    HF-->>PrepareScript: Return dataset with problems
    PrepareScript->>PrepareScript: Format context_files_block
    PrepareScript->>PrepareScript: Write eval.jsonl with task_id, prompt, etc.

    User->>GenTask: Generate solutions
    GenTask->>GenTask: Read eval.jsonl data
    loop For each problem
        GenTask->>LLM: Send prompt with problem context
        LLM-->>GenTask: Return CUDA code generation
        GenTask->>ComputeEval: _parse_solution(generation)
        ComputeEval-->>GenTask: Parsed FileSolution
        GenTask->>GenTask: Create solution record with model_dump()
    end

    User->>Evaluator: Evaluate solutions
    loop For each solution (async)
        Evaluator->>Evaluator: validate_python(problem & solution)
        Evaluator->>ComputeEval: evaluate_solution(problem, solution, ctk_version)
        ComputeEval->>ComputeEval: Compile CUDA code
        ComputeEval->>ComputeEval: Execute tests
        ComputeEval-->>Evaluator: Return graded result (passed, skipped, etc.)
        Evaluator->>Evaluator: Write evaluation to file
    end

    User->>Metrics: Compute metrics
    Metrics->>Metrics: Calculate accuracy from "passed" field
    Metrics->>Metrics: Compute pass@k statistics
    Metrics-->>User: Return final metrics

greptile-apps · 2026-01-08T21:00:48Z

nemo_skills/evaluation/evaluator/compute_eval.py

+        except Exception as e:
+            _LOG.error(f"Error during evaluation: {e}")
+            return {
+                "passed": False,
+                "error": str(e),
+            }


catching all exceptions masks critical errors and returns inconsistent fields - successful evaluation returns skipped, elapsed_time, build_output, test_output but error case only returns passed and error

Suggested change

except Exception as e:

_LOG.error(f"Error during evaluation: {e}")

return {

"passed": False,

"error": str(e),

}

except KeyError as e:

_LOG.error(f"Missing required field in data_point: {e}")

return {

"passed": False,

"skipped": False,

"elapsed_time": 0.0,

"build_output": "",

"test_output": "",

"error": f"Missing required field: {e}",

}

except Exception as e:

_LOG.error(f"Error during evaluation: {e}")

return {

"passed": False,

"skipped": False,

"elapsed_time": 0.0,

"build_output": "",

"test_output": "",

"error": str(e),

}

greptile-apps · 2026-01-08T21:00:49Z

nemo_skills/inference/eval/compute_eval.py

+# noinspection PyProtectedMember
+from compute_eval.generate_completions import _parse_solution


accessing protected member _parse_solution from external library - upstream can break this without warning

greptile-apps · 2026-01-08T21:00:50Z

nemo_skills/inference/eval/compute_eval.py

+    async def process_single_datapoint(self, data_point, data):
+        res = await super().process_single_datapoint(data_point, data)
+
+        solution = FileSolution(
+            task_id=data_point["task_id"],
+            files=_parse_solution(res["generation"]),
+        )
+        return {
+            "solution": solution.model_dump(),
+            "generation": res["generation"],
+        }


missing error handling - _parse_solution can fail, data_point might not have task_id, res might not have generation

Suggested change

async def process_single_datapoint(self, data_point, data):

res = await super().process_single_datapoint(data_point, data)

solution = FileSolution(

task_id=data_point["task_id"],

files=_parse_solution(res["generation"]),

)

return {

"solution": solution.model_dump(),

"generation": res["generation"],

}

async def process_single_datapoint(self, data_point, data):

res = await super().process_single_datapoint(data_point, data)

try:

solution = FileSolution(

task_id=data_point["task_id"],

files=_parse_solution(res["generation"]),

)

return {

"solution": solution.model_dump(),

"generation": res["generation"],

}

except KeyError as e:

_LOG.error(f"Missing required field: {e}")

raise

except Exception as e:

_LOG.error(f"Failed to parse solution: {e}")

raise

greptile-apps · 2026-01-08T21:00:51Z

nemo_skills/dataset/compute-eval/__init__.py

+DATASET_GROUP = "code"
+METRICS_TYPE = "compute-eval"
+GENERATION_MODULE = "nemo_skills.inference.eval.compute_eval"
+GENERATION_ARGS = "++eval_type=compute-eval"


missing ++prompt_config in GENERATION_ARGS - will fail validation unless user specifies ++prompt_format=openai (see nemo_skills/inference/generate.py line 242 which requires prompt_config when prompt_format=='ns')

Suggested change

GENERATION_ARGS = "++eval_type=compute-eval"

GENERATION_ARGS = "++prompt_config=generic/codegen ++eval_type=compute-eval"

greptile-apps · 2026-01-08T21:00:52Z

nemo_skills/dataset/compute-eval/prepare.py

+
+    args = parser.parse_args()
+
+    dataset = load_dataset("nvidia/compute-eval", args.release, token=os.getenv("HF_TOKEN", None))


missing error handling for gated dataset authentication and network failures - will crash with unclear error if authentication fails or network is down

greptile-apps · 2026-01-08T21:00:53Z

nemo_skills/dataset/compute-eval/prepare.py

+                "task_id": item["task_id"],
+                "problem_prompt": item["prompt"],
+                "build_command": item["build_command"],
+                "context_files_block": _format_context_files_block(item["context_files"]),


no validation that context_files list items have required path and content keys - will raise KeyError if dataset schema changes

greptile-apps

Greptile Overview

Greptile Summary

This PR integrates ComputeEval, a CUDA code generation benchmark with 406+ problems, into NeMo-Skills. The implementation follows established patterns and includes dataset preparation, custom generation task, evaluator, and metrics computation.

Key Changes:

Added dataset preparation script to download from gated HuggingFace dataset
Implemented ComputeEvalGenerationTask using compute-eval's solution parsing
Created ComputeEvalEvaluator with NVCC version detection and async evaluation
Added ComputeEvalMetrics for pass@k computation
Registered evaluator in EVALUATOR_CLASS_MAP and metrics in METRICS_MAP
Pinned compute-eval dependency to specific commit

Issues Found:

Missing documentation for compute-eval benchmark (critical for users)
Missing test coverage - compute-eval not added to DATASETS list in tests/test_datasets.py
Using protected member _parse_solution from compute-eval package could break without warning
No validation for required HF_TOKEN environment variable before attempting dataset download
Overly broad exception handling in evaluator loses important error context
Missing explicit validation for required data_point keys

Architecture Notes:
The implementation correctly uses a custom generation module (nemo_skills.inference.eval.compute_eval) specified in the dataset __init__.py, which differs from standard generation. The evaluator properly checks for NVCC installation at initialization and uses TypeAdapters for discriminated union validation of problem/solution types.

Confidence Score: 3/5

Safe to merge with documentation and test coverage improvements needed
The core implementation is solid and follows established patterns in the codebase. The evaluator, metrics, and generation task are well-structured. However, the missing documentation is a significant gap that will impact usability, and the lack of test coverage means the new dataset won't be validated by CI. The use of a protected API from the external package and overly broad exception handling are code quality concerns but not blockers. No critical bugs or security issues were found.
Pay close attention to docs/evaluation/code.md (needs compute-eval documentation), tests/test_datasets.py (needs dataset entry), and nemo_skills/dataset/compute-eval/prepare.py (needs HF_TOKEN validation)

Important Files Changed

File Analysis

Filename	Score	Overview
tests/test_datasets.py	3/5	Missing compute-eval dataset entry in DATASETS list for test coverage
nemo_skills/dataset/compute-eval/prepare.py	4/5	Dataset preparation script added for downloading ComputeEval from HuggingFace
nemo_skills/evaluation/evaluator/compute_eval.py	4/5	Evaluator implementation added with NVCC version detection and async evaluation
nemo_skills/inference/eval/compute_eval.py	3/5	Generation task implementation uses protected member _parse_solution from compute_eval package
requirements/main.txt	4/5	Added compute-eval dependency pinned to specific commit from GitHub
docs/evaluation/code.md	3/5	No documentation added for compute-eval benchmark despite adding IOI and BIRD docs

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant HF as HuggingFace
    participant DataDir as Data Directory
    participant GenTask as ComputeEvalGenerationTask
    participant LLM as Language Model
    participant Evaluator as ComputeEvalEvaluator
    participant NVCC as CUDA Toolkit
    participant Metrics as ComputeEvalMetrics

    Note over User,DataDir: Data Preparation Phase
    User->>PrepareScript: ns prepare_data compute-eval --release=2025-1
    PrepareScript->>HF: load_dataset with auth token
    HF-->>PrepareScript: Return dataset["eval"]
    PrepareScript->>PrepareScript: Format context_files_block for each problem
    PrepareScript->>DataDir: Write eval.jsonl with problem, task_id, prompt, etc.

    Note over User,Metrics: Generation & Evaluation Phase
    User->>GenTask: ns eval --benchmarks=compute-eval
    GenTask->>GenTask: Load eval.jsonl
    GenTask->>LLM: Generate CUDA code solution
    LLM-->>GenTask: Return generation text
    GenTask->>GenTask: Parse solution via compute_eval._parse_solution()
    GenTask->>GenTask: Create FileSolution with task_id and parsed files
    GenTask->>DataDir: Save generation with solution field

    GenTask->>Evaluator: eval_single(data_point)
    Evaluator->>NVCC: get_nvcc_version() (on init)
    NVCC-->>Evaluator: Return CTK version
    Evaluator->>Evaluator: validate_python(problem) using TypeAdapter
    Evaluator->>Evaluator: validate_python(solution) using TypeAdapter
    Evaluator->>NVCC: evaluate_solution(problem, solution, CTK version)
    NVCC-->>Evaluator: Return graded result (passed, skipped, elapsed_time, outputs)
    Evaluator-->>GenTask: Return evaluation results

    Note over Metrics: Metrics Computation Phase
    GenTask->>Metrics: Compute metrics from results
    Metrics->>Metrics: Extract prediction["passed"] for each result
    Metrics->>Metrics: Compute pass@k statistics
    Metrics-->>User: Display accuracy and pass@k metrics

greptile-apps · 2026-01-08T21:01:06Z

nemo_skills/inference/eval/compute_eval.py

+from compute_eval.data.data_model import FileSolution
+
+# noinspection PyProtectedMember
+from compute_eval.generate_completions import _parse_solution


using protected member _parse_solution from external package

accessing _parse_solution with # noinspection PyProtectedMember bypasses the protection. if this is an internal API in compute-eval, it could break without warning. consider:

requesting compute-eval to expose this as public API

or implement solution parsing directly in nemo-skills

greptile-apps · 2026-01-08T21:01:11Z

requirements/main.txt


 bs4
 click < 8.2.0  # https://github.com/ai-dynamo/dynamo/issues/1039
+compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@2d14770


dependency pinned to specific commit without version tag

compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@2d14770 is pinned to a commit hash. consider:

using a tagged release if available

documenting why this specific commit is required

monitoring for when compute-eval is published to PyPI

this makes dependency management harder and doesn't benefit from semantic versioning

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-08T21:01:12Z

nemo_skills/dataset/compute-eval/prepare.py

+
+    args = parser.parse_args()
+
+    dataset = load_dataset("nvidia/compute-eval", args.release, token=os.getenv("HF_TOKEN", None))


potential failure if HF_TOKEN not set for gated dataset

load_dataset("nvidia/compute-eval", ...) will fail if the dataset is gated and no token is provided. add validation to check if the token is set and provide a helpful error message directing users to obtain a HuggingFace token

greptile-apps · 2026-01-08T21:01:13Z

nemo_skills/evaluation/evaluator/compute_eval.py

+        except Exception as e:
+            _LOG.error(f"Error during evaluation: {e}")
+            return {
+                "passed": False,
+                "error": str(e),
+            }


overly broad exception handling loses error context

catching all exceptions with except Exception and only logging masks validation errors, type errors, and other important failures. the metrics will treat this as a simple failure (passed: False) without distinguishing between:

invalid problem/solution format (validation error)

missing CUDA dependencies (runtime error)

actual test execution failure

consider catching specific exception types or at least including exception type in the error field

greptile-apps · 2026-01-08T21:01:14Z

nemo_skills/evaluation/evaluator/compute_eval.py

+            problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"])
+            solution = _SOLUTION_ADAPTER.validate_python(data_point["solution"])


missing error handling for missing keys

if data_point["problem"] or data_point["solution"] keys don't exist, this raises KeyError which gets caught by the broad exception handler. add explicit validation with helpful error messages

greptile-apps · 2026-01-08T21:01:15Z

nemo_skills/dataset/compute-eval/prepare.py

+                "build_command": item["build_command"],
+                "context_files_block": _format_context_files_block(item["context_files"]),
+            }
+            f.write(json.dumps(record, default=str) + "\n")


using default=str in json.dumps without clear reason

json.dumps(record, default=str) will convert any non-serializable objects to strings. this could hide data issues. since all fields should already be JSON-serializable (strings, dicts, lists), consider removing default=str or documenting why it's needed

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-08T21:01:18Z

Additional Comments (2)

tests/test_datasets.py
missing compute-eval dataset entry

this list should include ("compute-eval", ["eval"]) to ensure the new dataset is tested via test_dataset_init_defaults()

docs/evaluation/code.md
missing documentation for compute-eval benchmark

this PR adds compute-eval support but doesn't document it. users need to know:

what compute-eval is
how to prepare data (ns prepare_data compute-eval --release=2025-1)
that it requires HF_TOKEN env var for gated dataset access
how to run evaluation
hardware requirements (needs CUDA toolkit/nvcc installed)

see other benchmarks in this file for documentation structure

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Signed-off-by: dlord <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This PR integrates ComputeEval, a CUDA code generation benchmark, into NeMo-Skills by implementing dataset preparation, generation tasks, evaluation logic, and metrics computation.

Key changes:

Dataset preparation downloads problems from HuggingFace gated dataset and formats them with context files
Generation task uses compute-eval's _parse_solution to parse LLM-generated CUDA code into structured FileSolution objects
Evaluator runs async compilation and test execution with automatic NVCC version detection
Metrics compute pass@k scores using the existing framework patterns
Comprehensive documentation added with security warnings about executing machine-generated CUDA code

Technical implementation:

Follows existing NeMo-Skills patterns for dataset integration (similar to human-eval, scicode)
Proper error handling throughout with try-catch blocks and graceful degradation
Async evaluation with configurable parallelism (default 10 concurrent requests)
Uses Pydantic TypeAdapters for robust deserialization of problem and solution objects
All integration points properly registered in __init__.py files and METRICS_MAP/EVALUATOR_CLASS_MAP

Confidence Score: 4/5

This PR is safe to merge with minor considerations about the protected API dependency
The implementation is solid and follows established patterns in the codebase. All components (dataset prep, generation, evaluation, metrics) are properly integrated and include appropriate error handling. The main concern is the dependency on a protected member (_parse_solution) from the external compute-eval package, which could break if that library changes its internal API. However, since compute-eval is also maintained by NVIDIA and pinned to a specific commit, this risk is manageable. The code is well-documented, includes security warnings for executing generated CUDA code, and properly handles edge cases.
nemo_skills/inference/eval/compute_eval.py relies on protected API from compute-eval

Important Files Changed

File Analysis

Filename	Score	Overview
requirements/main.txt	5/5	Added compute-eval dependency pinned to specific commit
nemo_skills/dataset/compute-eval/prepare.py	5/5	Implements dataset download from HuggingFace with proper error handling and formatting
nemo_skills/inference/eval/compute_eval.py	4/5	Generation task with protected member access; includes proper error handling
nemo_skills/evaluation/evaluator/compute_eval.py	5/5	Evaluator with proper async execution, NVCC version detection, and comprehensive error handling
nemo_skills/evaluation/metrics/code_metrics.py	5/5	Added ComputeEvalMetrics class following existing patterns for pass@k metrics

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant HuggingFace
    participant GenerationTask as ComputeEvalGenerationTask
    participant LLM
    participant ParseSolution as _parse_solution
    participant Evaluator as ComputeEvalEvaluator
    participant ComputeEval as compute-eval Library
    participant NVCC
    participant Metrics as ComputeEvalMetrics

    Note over User,HuggingFace: Data Preparation Phase
    User->>PrepareScript: ns prepare_data compute-eval
    PrepareScript->>HuggingFace: load_dataset("nvidia/compute-eval")
    HuggingFace-->>PrepareScript: Dataset with CUDA problems
    PrepareScript->>PrepareScript: Format context_files_block
    PrepareScript->>PrepareScript: Write eval.jsonl

    Note over User,LLM: Generation Phase
    User->>GenerationTask: ns eval (with inference config)
    GenerationTask->>GenerationTask: Load eval.jsonl
    loop For each problem
        GenerationTask->>LLM: Generate CUDA solution (with prompt)
        LLM-->>GenerationTask: Raw code generation
        GenerationTask->>ParseSolution: _parse_solution(generation)
        ParseSolution-->>GenerationTask: Parsed files structure
        GenerationTask->>GenerationTask: Create FileSolution object
    end
    GenerationTask->>GenerationTask: Write output.jsonl

    Note over User,NVCC: Evaluation Phase
    User->>Evaluator: Evaluate generated solutions
    Evaluator->>NVCC: get_nvcc_version()
    NVCC-->>Evaluator: CTK version info
    loop For each solution (async)
        Evaluator->>ComputeEval: evaluate_solution(problem, solution, ctk_version)
        ComputeEval->>ComputeEval: Compile CUDA code
        ComputeEval->>ComputeEval: Execute tests
        ComputeEval-->>Evaluator: GradedResult (passed, skipped, outputs)
        Evaluator->>Evaluator: Write evaluation results
    end

    Note over User,Metrics: Metrics Computation Phase
    User->>Metrics: Compute final metrics
    Metrics->>Metrics: Calculate accuracy (passed rate)
    Metrics->>Metrics: Compute pass@k metrics
    Metrics-->>User: Final metrics.json

greptile-apps · 2026-01-08T23:31:20Z

nemo_skills/inference/eval/compute_eval.py

+from compute_eval.data.data_model import FileSolution
+
+# noinspection PyProtectedMember
+from compute_eval.generate_completions import _parse_solution


accessing protected member _parse_solution from external package - if compute-eval changes this internal API, the code could break

greptile-apps

Greptile Overview

Greptile Summary

This PR successfully integrates the ComputeEval benchmark for evaluating LLMs on CUDA code generation into NeMo-Skills.

Major changes:

Added dataset preparation script that downloads problems from HuggingFace with proper authentication and formats context files
Implemented ComputeEvalGenerationTask to generate CUDA solutions using the LLM
Implemented ComputeEvalEvaluator that compiles and executes generated CUDA code against test suites
Added ComputeEvalMetrics to compute accuracy and pass@k performance metrics
Registered compute-eval across all framework maps (evaluators, metrics, datasets)
Added comprehensive documentation with setup instructions and security warnings
Pinned compute-eval dependency to specific GitHub commit (2d14770)

Minor issue:

The generation task imports a protected member (_parse_solution) from the compute-eval library, which is not ideal but acceptable given the library hasn't published a stable API yet

Confidence Score: 4/5

This PR is safe to merge with minimal risk
The implementation follows existing patterns in the codebase consistently, includes comprehensive error handling, and provides thorough documentation. The only concern is the use of a protected member from the external library, but this is acknowledged with a suppression comment and is a pragmatic choice given the library's pre-release state. All integration points are properly registered.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
nemo_skills/dataset/compute-eval/prepare.py	5/5	adds dataset download and preparation logic with proper formatting for context files
nemo_skills/evaluation/evaluator/compute_eval.py	4/5	implements evaluator with async support, includes comprehensive error handling
nemo_skills/inference/eval/compute_eval.py	4/5	implements generation task with solution parsing, uses protected member from external lib
nemo_skills/evaluation/metrics/code_metrics.py	5/5	adds ComputeEvalMetrics class following existing patterns for pass@k metrics
requirements/main.txt	5/5	adds compute-eval dependency from GitHub at specific commit

Sequence Diagram

sequenceDiagram
    participant User
    participant ns_prepare_data as ns prepare_data
    participant HuggingFace as HuggingFace Dataset
    participant prepare_py as prepare.py
    participant ns_eval as ns eval
    participant GenerationTask as ComputeEvalGenerationTask
    participant LLM as LLM Server
    participant Evaluator as ComputeEvalEvaluator
    participant compute_eval as compute-eval Library
    participant NVCC as CUDA Toolkit
    participant Metrics as ComputeEvalMetrics

    User->>ns_prepare_data: ns prepare_data compute-eval --release 2025-1
    ns_prepare_data->>prepare_py: Execute prepare.py
    prepare_py->>HuggingFace: load_dataset with auth token
    HuggingFace-->>prepare_py: Return dataset with CUDA problems
    prepare_py->>prepare_py: Format context files with code fences
    prepare_py->>prepare_py: Write eval.jsonl with problem data
    prepare_py-->>User: Dataset prepared

    User->>ns_eval: ns eval --benchmarks=compute-eval
    ns_eval->>GenerationTask: Initialize ComputeEvalGenerationTask
    ns_eval->>GenerationTask: Load eval.jsonl data
    
    loop For each problem
        GenerationTask->>LLM: Send prompt with problem description
        LLM-->>GenerationTask: Return generated CUDA code
        GenerationTask->>compute_eval: _parse_solution(generation)
        compute_eval-->>GenerationTask: Parsed FileSolution object
        GenerationTask->>GenerationTask: Store solution in output
    end

    ns_eval->>Evaluator: Initialize ComputeEvalEvaluator
    Evaluator->>NVCC: get_nvcc_version()
    NVCC-->>Evaluator: Return CUDA Toolkit version
    Evaluator->>Evaluator: Store CTK major/minor version
    
    loop For each solution (async)
        Evaluator->>compute_eval: evaluate_solution(problem, solution, ctk_version)
        compute_eval->>compute_eval: Compile CUDA code
        compute_eval->>compute_eval: Execute test suite
        compute_eval-->>Evaluator: Return GradedSolution (passed, build_output, test_output)
        Evaluator->>Evaluator: Write evaluation results to file
    end

    ns_eval->>Metrics: Compute metrics
    Metrics->>Metrics: Calculate accuracy and pass@k
    Metrics-->>User: Display metrics.json with results

greptile-apps · 2026-01-08T23:34:38Z

nemo_skills/inference/eval/compute_eval.py

+from compute_eval.data.data_model import FileSolution
+
+# noinspection PyProtectedMember
+from compute_eval.generate_completions import _parse_solution


importing private function _parse_solution from external library. Consider requesting the compute-eval package to export this as a public API, since it's needed for integration.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

blahblahasdf · 2026-01-08T23:35:34Z

Closing this PR in favor of one directly from a branch on NeMo Skills.

coderabbitai bot reviewed Dec 17, 2025

View reviewed changes

gwarmstrong requested changes Dec 18, 2025

View reviewed changes

gwarmstrong assigned blahblahasdf Dec 18, 2025

greptile-apps bot reviewed Jan 5, 2026

View reviewed changes

greptile-apps bot reviewed Jan 7, 2026

View reviewed changes

greptile-apps bot reviewed Jan 8, 2026

View reviewed changes

blahblahasdf and others added 20 commits January 8, 2026 12:52

Add ComputeEval skill.

98e9dbf

Signed-off-by: dlord <[email protected]>

Split out evaluator from generation task.

e9ce660

Signed-off-by: dlord <[email protected]>

Adding clan PR with AudioBench and Librispeech PC. (NVIDIA-NeMo#1103)

d1f8e8a

Signed-off-by: George Zelenfroind <[email protected]> Signed-off-by: George <[email protected]> Signed-off-by: dlord <[email protected]>

Schema overrides for tool-calling (NVIDIA-NeMo#1118)

1510216

Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>

FIX tool call error handling and search tool errors (NVIDIA-NeMo#1120)

572f770

Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>

Use run.Script for generate pipeline (NVIDIA-NeMo#1052)

b2308f7

Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>

Port ICPC changes to IOI (NVIDIA-NeMo#1046)

0832f7d

Signed-off-by: SeanNaren <[email protected]> Co-authored-by: Mehrzad Samadi <[email protected]> Signed-off-by: dlord <[email protected]>

FIX tavily search results return type (NVIDIA-NeMo#1123)

0211bea

Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>

Address CodeRabbit review comments.

980a574

Signed-off-by: dlord <[email protected]>

Add exception handling during eval.

52059dd

Signed-off-by: dlord <[email protected]>

Remove unncessary HuggingFace Hub call.

920e174

Signed-off-by: dlord <[email protected]>

Push Problem data transformation logic in to the prepare script and…

1530926

… move `ComputeEvalGenerationTask` to be more on the rails of the `process_single_datapoint` approach. Signed-off-by: dlord <[email protected]>

Revert "Use run.Script for generate pipeline (NVIDIA-NeMo#1052)" (NVI…

cd88e99

…DIA-NeMo#1125) Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>

Fix: add serialized_output on bad request (NVIDIA-NeMo#1127)

dc9c920

Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>

update paper link (NVIDIA-NeMo#1128)

9e1a951

Signed-off-by: Wei Du <[email protected]> Signed-off-by: dlord <[email protected]>

update paper link, references to dataset, self-correction differences (…

ef71245

…NVIDIA-NeMo#1129) Signed-off-by: Stephen Ge <[email protected]> Signed-off-by: dlord <[email protected]>

FIX ioi ignore (NVIDIA-NeMo#1131)

a676e34

Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>

download AA-LCR_extracted-text.zip via hf_hub_download (NVIDIA-NeMo#1126

417ef07

) Signed-off-by: Arkadiusz Nowaczynski <[email protected]> Co-authored-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>

Evaluation on Livecodebench-pro (NVIDIA-NeMo#1115)

a90cc44

Signed-off-by: wasiahmad <[email protected]> Signed-off-by: dlord <[email protected]>

wasiahmad and others added 11 commits January 8, 2026 12:52

Trust remote code in tokenizer (NVIDIA-NeMo#1146)

c576b91

Signed-off-by: fzyzcjy <[email protected]> Signed-off-by: Igor Gitman <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Signed-off-by: dlord <[email protected]>

Resolve broken links in docs (NVIDIA-NeMo#1150)

7225f97

Signed-off-by: Sanyam Kapoor <[email protected]> Signed-off-by: dlord <[email protected]>

Introduced vLLM_multimodal model to save multimodal outputs (NVIDIA-N…

1cbe0b1

…eMo#1136) Signed-off-by: Valentin Mendelev <[email protected]> Co-authored-by: Nikolay Karpov <[email protected]> Signed-off-by: dlord <[email protected]>

add swe-rebench to excluded datasets (NVIDIA-NeMo#1154)

4b49e26

Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>

Fix run.Script refactor (NVIDIA-NeMo#1133)

5df7599

Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>

BIRD Benchmark (Text-to-SQL) (NVIDIA-NeMo#1132)

7f405cb

Signed-off-by: Jocelyn Huang <[email protected]> Signed-off-by: dlord <[email protected]> # Conflicts: # nemo_skills/evaluation/evaluator/__init__.py

o3-mini-20250131 -> o3-mini-2025-01-31 (NVIDIA-NeMo#1149)

73624d6

Signed-off-by: bzantium <[email protected]> Co-authored-by: Igor Gitman <[email protected]> Signed-off-by: dlord <[email protected]>

Update compute-eval version.

94f2208

Signed-off-by: dlord <[email protected]>

Another update to compute eval version that removes the pinned urllib3.

544199b

Signed-off-by: dlord <[email protected]>

Set compute eval dep to explicit hash.

b7cb62c

Signed-off-by: dlord <[email protected]>

blahblahasdf force-pushed the add-compute-eval branch from 665c095 to b7cb62c Compare January 8, 2026 20:53

Merge branch 'main' into add-compute-eval

7457060

Signed-off-by: Dan Lord <[email protected]>

greptile-apps bot reviewed Jan 8, 2026

View reviewed changes

blahblahasdf added 2 commits January 8, 2026 15:30

Address a bunch of the automated PR comments.

a7aa94a

Signed-off-by: dlord <[email protected]>

Add compute eval documentation.

9a4f608

Signed-off-by: dlord <[email protected]>

blahblahasdf force-pushed the add-compute-eval branch from 696997a to 9a4f608 Compare January 8, 2026 23:30

greptile-apps bot reviewed Jan 8, 2026

View reviewed changes

blahblahasdf closed this Jan 8, 2026

gwarmstrong mentioned this pull request Jan 10, 2026

Add compute eval #1158

Merged

	compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@main
	compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@991b47c

		# noinspection PyProtectedMember
		from compute_eval.generate_completions import _parse_solution

	GENERATION_ARGS = "++eval_type=compute-eval"
	GENERATION_ARGS = "++prompt_config=generic/codegen ++eval_type=compute-eval"


		args = parser.parse_args()

		dataset = load_dataset("nvidia/compute-eval", args.release, token=os.getenv("HF_TOKEN", None))

		problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"])
		solution = _SOLUTION_ADAPTER.validate_python(data_point["solution"])

Add ComputeEval Dataset Support #1124

Add ComputeEval Dataset Support #1124

Uh oh!

Conversation

blahblahasdf commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

ComputeEval features:

Changes

Dependencies

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gwarmstrong Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 7, 2026

Greptile found no issues!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

blahblahasdf commented Dec 17, 2025 •

edited

Loading

coderabbitai bot commented Dec 17, 2025 •

edited

Loading

gwarmstrong Dec 19, 2025 •

edited

Loading

greptile-apps bot commented Jan 5, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading