Skip to content

Conversation

@blahblahasdf
Copy link
Collaborator

@blahblahasdf blahblahasdf commented Dec 17, 2025

Overview

This PR integrates ComputeEval - a benchmark for evaluating LLMs on CUDA code generation - into NeMo-Skills.

ComputeEval features:

  1. 406 (and growing) CUDA programming challenges
  2. Functional correctness evaluation via compilation and test execution
  3. Multiple release versions (2025-1, 2025-2, 2025-3)

Changes

  1. Dataset Preparation - download ComputeEval problems from HuggingFace (gated dataset)
  2. Generation Task - implement GenerationTask to generate CUDA solutions using ComputeEval's reference generation logic
  3. Evaluator - implement BaseEvaluator using ComputeEval's evaluation logic
  4. Metrics - compute pass@k metrics

Dependencies

Adds compute-eval package (compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@2d14770)
Note: compute-eval is a public repository but has not been published to PyPI yet

Summary by CodeRabbit

Release Notes

  • New Features
    • Added support for compute-eval dataset from HuggingFace with optional release specifications.
    • Introduced compute-eval evaluator for CUDA code assessment with automatic CTK version detection.
    • Added compute-eval metrics for computing evaluation scores including pass-at-k performance.
    • Integrated compute-eval generation task into the inference pipeline.
    • Registered compute-eval as a new evaluation type across the framework.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 17, 2025

📝 Walkthrough

Walkthrough

This PR adds support for NVIDIA's compute-eval dataset and evaluation framework to NeMo Skills. It introduces a new compute-eval evaluator type, corresponding metrics, a dataset preparation script, a generation task for model inference, and registers these components in the appropriate registries.

Changes

Cohort / File(s) Summary
Dataset Package Initialization
nemo_skills/dataset/compute-eval/__init__.py
Introduces package constants: EVAL_SPLIT, DATASET_GROUP, METRICS_TYPE, GENERATION_MODULE, and GENERATION_ARGS for compute-eval configuration.
Dataset Preparation
nemo_skills/dataset/compute-eval/prepare.py
New script to download NVIDIA compute-eval dataset from HuggingFace with optional release specification, token authentication via HF_TOKEN environment variable, and output formatting to eval.jsonl.
Evaluator Implementation
nemo_skills/evaluation/evaluator/compute_eval.py, nemo_skills/evaluation/evaluator/__init__.py
Introduces ComputeEvalEvaluator class with NVCC version detection, pydantic validation for problem/solution, thread-pool execution of evaluation, and registers it in EVALUATOR_MAP and EVALUATOR_CLASS_MAP.
Metrics Implementation
nemo_skills/evaluation/metrics/code_metrics.py, nemo_skills/evaluation/metrics/map_metrics.py
Introduces ComputeEvalMetrics class implementing score dict, incorrect sample, and update with pass-at-k computation; registers in METRICS_MAP.
Generation Task
nemo_skills/inference/eval/compute_eval.py
Introduces ComputeEvalGenerationTask for model inference with server configuration validation, problem type validation, and solution generation; provides Hydra entry point run_compute_eval.
Dependencies
requirements/main.txt
Adds git-based dependency for compute-eval package from specified commit.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • NVCC version detection and error handling in ComputeEvalEvaluator.__init__: verify compatibility logic and RuntimeError conditions
  • Thread pool execution in ComputeEvalEvaluator.eval_single: confirm proper async/threading pattern and exception handling
  • Type validation with discriminated unions in both evaluator and generation task: verify pydantic TypeAdapter configuration for CudaCppProblem | CudaPythonProblem
  • Configuration validation in ComputeEvalGenerationTask.__init__: ensure model and base_url enforcement is correct
  • Git dependency specification in requirements/main.txt: verify commit hash and repository URL are correct

Suggested labels

run GPU tests

Suggested reviewers

  • Kipok
  • gwarmstrong
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add ComputeEval Dataset Support' accurately describes the main objective of the PR, which integrates ComputeEval into NeMo-Skills with dataset preparation, generation tasks, evaluators, and metrics.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (4)
nemo_skills/dataset/compute-eval/prepare.py (2)

29-34: Consider narrowing the exception type or adding a brief comment.

The bare Exception catch works but obscures intent. A comment explaining that any failure during whoami() indicates the user isn't logged in would improve clarity.

     # noinspection PyBroadException
     try:
         api = HfApi()
         api.whoami()
-        return None
+        return None  # User is logged in; HF will use cached credentials
-    except Exception:
+    except Exception:  # Any failure means user is not authenticated
         return None

48-54: Add validation for the expected dataset split.

If the dataset doesn't contain an "eval" split (e.g., due to an invalid release), the script will raise an unclear KeyError. Consider validating the split exists.

     dataset = load_dataset("nvidia/compute-eval", args.release, token=_get_hf_token())
+    if "eval" not in dataset:
+        raise ValueError(f"Dataset does not contain 'eval' split. Available splits: {list(dataset.keys())}")
     data_dir = Path(__file__).absolute().parent
nemo_skills/inference/eval/compute_eval.py (2)

54-67: Consider using pass for no-op methods and prefixing unused parameters.

The no-op lifecycle methods use return statements, but pass is more idiomatic for intentionally empty methods. Additionally, the data parameter in log_example_prompt is flagged as unused by static analysis - prefixing it with an underscore would signal it's intentionally unused.

Apply this diff:

-    def log_example_prompt(self, data):
-        return
+    def log_example_prompt(self, _data):
+        pass
 
     def setup_prompt(self):
-        return
+        pass
 
     def setup_llm(self):
-        return
+        pass
 
     def setup_litellm_cache(self):
-        return
+        pass
 
     def cleanup_litellm_cache(self):
-        return
+        pass

69-69: Prefix unused parameter with underscore.

The data parameter is flagged as unused by static analysis. If it's part of the interface but not needed in this implementation, prefix it with an underscore to signal it's intentionally unused.

Apply this diff:

-    async def process_single_datapoint(self, data_point, data):
+    async def process_single_datapoint(self, data_point, _data):
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e16e1e and 67e8149.

📒 Files selected for processing (8)
  • nemo_skills/dataset/compute-eval/__init__.py (1 hunks)
  • nemo_skills/dataset/compute-eval/prepare.py (1 hunks)
  • nemo_skills/evaluation/evaluator/__init__.py (2 hunks)
  • nemo_skills/evaluation/evaluator/compute_eval.py (1 hunks)
  • nemo_skills/evaluation/metrics/code_metrics.py (1 hunks)
  • nemo_skills/evaluation/metrics/map_metrics.py (2 hunks)
  • nemo_skills/inference/eval/compute_eval.py (1 hunks)
  • requirements/main.txt (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-23T17:56:57.556Z
Learnt from: Glorf
Repo: NVIDIA-NeMo/Skills PR: 908
File: requirements/main.txt:16-16
Timestamp: 2025-11-23T17:56:57.556Z
Learning: faiss-cpu must be explicitly listed in requirements/main.txt for BFCLv4 memory evaluations (memory_kv, memory_vector, memory_rec_sum) as it is an optional dependency of sentence_transformers that is required for vector similarity search functionality in the memory backends.

Applied to files:

  • requirements/main.txt
🧬 Code graph analysis (4)
nemo_skills/evaluation/evaluator/__init__.py (1)
nemo_skills/evaluation/evaluator/compute_eval.py (1)
  • ComputeEvalEvaluator (31-63)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/code_metrics.py (1)
  • ComputeEvalMetrics (126-135)
nemo_skills/evaluation/evaluator/compute_eval.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
  • BaseEvaluator (34-91)
nemo_skills/utils.py (1)
  • get_logger_name (39-43)
nemo_skills/evaluation/metrics/code_metrics.py (7)
nemo_skills/evaluation/metrics/base.py (5)
  • BaseMetrics (23-434)
  • _get_score_dict (124-143)
  • get_incorrect_sample (200-206)
  • update (145-189)
  • _compute_pass_at_k (352-423)
nemo_skills/evaluation/metrics/if_metrics.py (3)
  • _get_score_dict (24-28)
  • get_incorrect_sample (30-33)
  • update (35-48)
nemo_skills/evaluation/metrics/answer_judgement_metrics.py (3)
  • _get_score_dict (35-39)
  • get_incorrect_sample (41-47)
  • update (121-132)
nemo_skills/evaluation/metrics/lean4_metrics.py (3)
  • _get_score_dict (23-24)
  • get_incorrect_sample (26-29)
  • update (46-48)
nemo_skills/evaluation/metrics/math_metrics.py (3)
  • _get_score_dict (67-79)
  • get_incorrect_sample (81-88)
  • update (90-120)
nemo_skills/evaluation/metrics/ruler_metrics.py (3)
  • _get_score_dict (19-20)
  • get_incorrect_sample (26-29)
  • update (22-24)
nemo_skills/evaluation/metrics/arena_metrics.py (2)
  • get_incorrect_sample (36-40)
  • update (42-84)
🪛 Ruff (0.14.8)
nemo_skills/inference/eval/compute_eval.py

54-54: Unused method argument: data

(ARG002)


69-69: Unused method argument: data

(ARG002)

nemo_skills/dataset/compute-eval/prepare.py

32-32: Consider moving this statement to an else block

(TRY300)


33-33: Do not catch blind exception: Exception

(BLE001)

nemo_skills/evaluation/metrics/code_metrics.py

130-130: Unused method argument: prediction

(ARG002)

🔇 Additional comments (10)
nemo_skills/evaluation/metrics/code_metrics.py (1)

126-135: LGTM - Implementation follows established patterns.

The ComputeEvalMetrics class correctly mirrors the structure of HumanEvalInfillingMetrics (lines 114-123). The unused prediction parameter in get_incorrect_sample is required by the base class interface, so the static analysis warning is a false positive.

nemo_skills/evaluation/metrics/map_metrics.py (2)

26-26: LGTM - Import correctly added.


72-72: LGTM - Registry entry properly added.

The key "compute-eval" correctly matches METRICS_TYPE defined in nemo_skills/dataset/compute-eval/__init__.py.

nemo_skills/evaluation/evaluator/__init__.py (2)

29-29: LGTM - Import correctly placed.


71-71: LGTM - Evaluator correctly registered in class map only.

The ComputeEvalEvaluator is correctly placed in EVALUATOR_CLASS_MAP (not EVALUATOR_MAP) since it implements eval_single for async single-point evaluation.

nemo_skills/evaluation/evaluator/compute_eval.py (1)

48-62: LGTM - Correct use of asyncio.to_thread for blocking evaluation.

Running evaluate_solution in a thread pool correctly avoids blocking the event loop during CUDA compilation and test execution.

requirements/main.txt (1)

17-17: LGTM - Git commit pinning ensures reproducibility.

The dependency is properly pinned to a specific commit (991b47c, currently the main branch HEAD) and correctly sorted alphabetically. The short hash is valid and unambiguous for this repository. Using the full 40-character SHA would provide additional robustness against theoretical hash collisions, but the current approach is acceptable for practical purposes.

nemo_skills/dataset/compute-eval/__init__.py (1)

15-19: LGTM - Dataset constants properly defined.

Constants are internally consistent: METRICS_TYPE matches the registry key in map_metrics.py (line 72), and EVAL_SPLIT matches the split accessed in prepare.py (lines 52-53). The GENERATION_MODULE path correctly points to the existing nemo_skills.inference.eval.compute_eval module.

nemo_skills/inference/eval/compute_eval.py (2)

1-34: LGTM!

The imports and module-level setup are well-structured. The TypeAdapter with a discriminated union for problem validation is a clean approach.


86-102: LGTM!

The module exports and main entry point are well-structured. The help message handling and logging setup follow good practices.

Copy link
Collaborator

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, with a couple high level questions/concerns


# noinspection PyBroadException
try:
api = HfApi()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows a user to access the HuggingFace gated (you have to be allow-listed to have access to the ComputeEval data today) data either via an environment variable (HF_TOKEN) or by using the HuggingFace CLI to login.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, after looking closer we don't need to access the API at all for this to work. I'll just clean this up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we should just need a load_dataset call and it will handle the token if it's already in your env variables, like this one is gated: https://github.com/blahblahasdf/Skills/blob/cb7829094799f9b5967e260d30e39c2535f67bab/nemo_skills/dataset/flores200/prepare.py#L49

generate_model_completions,
system_prompt=self._system_prompt,
problem=problem,
model=self._model,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is applying this function:
https://github.com/NVIDIA/compute-eval/blob/main/compute_eval/generate_completions.py#L31-L84

Which appears to use the ComputeEval clients?
What I would recommend we do instead, is write a version of that function that uses NeMo-Skills' process_single_datapoint--so this probably looks something like

    result = super().process_sinlge_datapoint(data_point, data)
    ...

    completion = ...
    return FileSolution(
        task_id=problem.task_id,
        files=_parse_solution(completion),
        **debug_info,
    )

this will make sure the prompt logic is consistent with typical NeMo-Skills and that the eval prompt logic is the same as normal generations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the above methods that are overwritten would just use the default methods rather than override to return nothing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick feedback! Your suggestion is to call process_sinlge_datapoint and then use the ComputeEval post processing to convert the model's response to a FileSolution? The "ComputeEval clients" are just a pass through to the OpenAI client. I think as is if you were to configure to a local model (or a model hosted somewhere else) it would just work as is but I agree it doesn't fit super cleanly with the NeMo-Skills abstractions.

Honestly, the generation side of the ComputeEval repo is generally intended to be a reference implementation for how other groups/companies would set up a generator scaffold to solve the ComputeEval problems. The reference implementation's prompt and post processing logic are pretty coupled.

Copy link
Collaborator

@gwarmstrong gwarmstrong Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for calling the openai client is here, right? https://github.com/NVIDIA/compute-eval/blob/main/compute_eval/models/model_interface.py#L100-L154
Seems pretty straightforward and is pretty much a subset of the functionality our clients have. And if do it the way it's currently being done, you lose a lot of things we support in NeMo-Skills (like tool-calling, pretty much all sampling parameters not directly specified in the compute-eval function, as well as the supported server types would be quite limited).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointers and happy New Year! I see where you're coming from on fitting this in with the NemoSkills generation more cleanly. What is the expectation on how to document thisGenerationTasks content format? I was able to get this working locally but there is definitely some coupling between the expected data format and system and user prompts required in a ++prompt_config. I would guess that has to be normal and just documented?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's pretty typical. We don't really have it documented on a case-by-case basis--it's pretty easy to infer from the prompt config itself, and we have a generic section on understanding the interaction between data and prompt_config here: https://nvidia-nemo.github.io/Skills/basics/prompt-format/#prompt-api

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 5, 2026

Greptile Summary

This PR integrates ComputeEval, a CUDA code generation benchmark, into the NeMo-Skills framework. The implementation adds dataset preparation from HuggingFace, a generation task for producing CUDA solutions, an evaluator that compiles and tests generated code, and metrics computation including pass@k scores.

Key changes:

  • Dataset preparation downloads problems from nvidia/compute-eval on HuggingFace with optional release version support
  • Generation task extends GenerationTask and parses LLM-generated CUDA code into structured solutions
  • Evaluator automatically detects installed CUDA Toolkit version and runs async evaluation using the compute-eval library
  • Metrics implementation follows existing code benchmark patterns (similar to HumanEvalInfillingMetrics)
  • All integration points properly registered in the framework's evaluator and metrics maps

Note: The compute-eval dependency version was updated from 991b47c to 4221fc2 in the final commit.

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk, pending awareness of the protected member usage
  • The implementation follows existing patterns in the codebase consistently, integrates cleanly with the framework's architecture, and has proper error handling. Score reduced by one point due to the use of protected member _parse_solution from the compute-eval library, which was already flagged in previous review threads but represents a potential maintenance risk if the external library changes its internal API
  • Pay attention to nemo_skills/inference/eval/compute_eval.py due to its dependency on a protected member from the external library

Important Files Changed

Filename Overview
nemo_skills/dataset/compute-eval/prepare.py Implements dataset preparation script to download and format compute-eval problems from HuggingFace
nemo_skills/evaluation/evaluator/compute_eval.py Implements CUDA code evaluator with automatic CTK version detection and async execution
nemo_skills/inference/eval/compute_eval.py Implements generation task that parses CUDA solutions using compute-eval library (uses protected member)
requirements/main.txt Adds compute-eval dependency pinned to commit 4221fc2

Sequence Diagram

sequenceDiagram
    participant User
    participant Prepare as prepare.py
    participant HF as HuggingFace
    participant Gen as ComputeEvalGenerationTask
    participant LLM as Language Model
    participant Eval as ComputeEvalEvaluator
    participant CE as compute-eval library
    participant Metrics as ComputeEvalMetrics

    User->>Prepare: Run dataset preparation
    Prepare->>HF: load_dataset("nvidia/compute-eval", release)
    HF-->>Prepare: Return dataset with problems
    Prepare->>Prepare: Format context_files_block
    Prepare->>Prepare: Write eval.jsonl with problem data
    
    User->>Gen: Generate solutions
    Gen->>Gen: Read eval.jsonl
    loop For each problem
        Gen->>LLM: Send problem prompt
        LLM-->>Gen: Return generated code
        Gen->>CE: _parse_solution(generation)
        CE-->>Gen: Parsed file solutions
        Gen->>Gen: Create FileSolution object
        Gen->>Gen: Write solution to output
    end
    
    User->>Eval: Evaluate solutions
    Eval->>Eval: get_nvcc_version() and parse_semver()
    loop For each data_point
        Eval->>Eval: Validate problem and solution
        Eval->>CE: evaluate_solution(problem, solution, ctk_version)
        CE->>CE: Compile and test CUDA code
        CE-->>Eval: Return graded result (passed, skipped, etc.)
        Eval->>Eval: Return evaluation dict
    end
    
    User->>Metrics: Compute metrics
    Metrics->>Metrics: Aggregate accuracy scores
    Metrics->>Metrics: Compute pass@k metrics
    Metrics-->>User: Return final metrics
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. nemo_skills/inference/eval/compute_eval.py, line 21 (link)

    style: Using protected member _parse_solution. Consider requesting public API when compute-eval stabilizes.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

8 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

from compute_eval.data.data_model import FileSolution

# noinspection PyProtectedMember
from compute_eval.generate_completions import _parse_solution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: accessing protected member _parse_solution from external library

If _parse_solution becomes unavailable in future versions of compute-eval, this will break. Consider requesting a public API from the compute-eval library.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 7, 2026

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR integrates ComputeEval benchmark for evaluating CUDA code generation into NeMo-Skills.

Major Changes

  • Added dataset preparation script that downloads 406 CUDA problems from gated HuggingFace dataset
  • Implemented ComputeEvalGenerationTask that generates CUDA solutions and parses them into FileSolution format
  • Implemented ComputeEvalEvaluator that compiles and tests CUDA code using NVCC with automatic toolkit version detection
  • Added ComputeEvalMetrics class for computing accuracy and pass@k performance metrics
  • Registered compute-eval across all integration points (evaluators, metrics, dataset configs)

Issues Found

  • Dependency version mismatch: PR description specifies commit 991b47c but requirements/main.txt references @main branch, which may introduce breaking changes
  • Fragile external API dependency: Uses protected member _parse_solution from compute-eval library which could break if the library refactors its internal APIs

Architecture Assessment

The implementation follows NeMo-Skills patterns consistently - uses BaseEvaluator async evaluation, BaseMetrics for metrics computation, and proper registration in all required maps. The integration is clean and well-structured.

Confidence Score: 4/5

  • Safe to merge after fixing the dependency version mismatch in requirements.txt
  • Score reflects well-structured implementation following framework patterns, but critical dependency version mismatch needs resolution. The code is functional and integrates cleanly, but the @main reference instead of pinned commit could cause instability.
  • Pay attention to requirements/main.txt - must fix version mismatch before merging

Important Files Changed

File Analysis

Filename Score Overview
requirements/main.txt 3/5 Added compute-eval dependency but references @main instead of specific commit @991b47c mentioned in PR description
nemo_skills/evaluation/evaluator/compute_eval.py 5/5 Implements async evaluator that validates problems/solutions and runs CUDA compilation and tests with NVCC version checking
nemo_skills/inference/eval/compute_eval.py 4/5 Generation task that parses LLM output into FileSolution format - relies on protected _parse_solution API from external library

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant HF as HuggingFace
    participant GenerationTask as ComputeEvalGenerationTask
    participant LLM as Language Model
    participant Evaluator as ComputeEvalEvaluator
    participant ComputeEval as compute-eval Library
    participant NVCC as CUDA Toolkit
    participant Metrics as ComputeEvalMetrics

    Note over User,HF: 1. Dataset Preparation Phase
    User->>PrepareScript: Run prepare.py --release 2025-1
    PrepareScript->>HF: load_dataset("nvidia/compute-eval")
    HF-->>PrepareScript: Return 406 CUDA problems
    PrepareScript->>PrepareScript: Format context files
    PrepareScript->>PrepareScript: Write eval.jsonl

    Note over User,Metrics: 2. Generation Phase
    User->>GenerationTask: Run compute_eval inference
    GenerationTask->>GenerationTask: Load eval.jsonl
    loop For each problem
        GenerationTask->>LLM: Send prompt with context
        LLM-->>GenerationTask: Return CUDA code generation
        GenerationTask->>ComputeEval: _parse_solution(generation)
        ComputeEval-->>GenerationTask: FileSolution object
        GenerationTask->>GenerationTask: Save solution + generation
    end

    Note over User,Metrics: 3. Evaluation Phase
    User->>Evaluator: Run evaluation
    Evaluator->>NVCC: get_nvcc_version()
    NVCC-->>Evaluator: CTK version (e.g., 12.3)
    loop For each solution (async)
        Evaluator->>ComputeEval: evaluate_solution(problem, solution)
        ComputeEval->>NVCC: Compile CUDA code
        NVCC-->>ComputeEval: Build result
        ComputeEval->>ComputeEval: Run tests
        ComputeEval-->>Evaluator: Graded result (passed/failed)
        Evaluator->>Evaluator: Update data with pass/fail
    end

    Note over User,Metrics: 4. Metrics Computation
    User->>Metrics: Compute metrics
    Metrics->>Metrics: Calculate accuracy
    Metrics->>Metrics: Compute pass@k scores
    Metrics-->>User: Final metrics report
Loading


bs4
click < 8.2.0 # https://github.com/ai-dynamo/dynamo/issues/1039
compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description specifies commit 991b47c but requirements file references @main branch

Suggested change
compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@main
compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@991b47c

from compute_eval.data.data_model import FileSolution

# noinspection PyProtectedMember
from compute_eval.generate_completions import _parse_solution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accessing protected member _parse_solution from external library - fragile dependency that may break if compute-eval refactors internal APIs

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR successfully integrates ComputeEval, a CUDA code generation benchmark, into NeMo-Skills following established patterns in the codebase.

Major changes:

  • Added dataset preparation script to download and format problems from HuggingFace's gated nvidia/compute-eval dataset
  • Implemented ComputeEvalEvaluator that compiles and executes CUDA code with automatic CTK version detection
  • Created ComputeEvalGenerationTask to generate CUDA solutions using the LLM inference pipeline
  • Added ComputeEvalMetrics class to compute pass@k performance metrics
  • Registered compute-eval across all integration points (evaluator map, metrics map, dataset config)

Integration quality:
The implementation follows NeMo-Skills conventions consistently - the evaluator extends BaseEvaluator, metrics extend BaseMetrics with _compute_pass_at_k, and the generation task extends GenerationTask. The dataset configuration uses standard constants (EVAL_SPLIT, DATASET_GROUP, METRICS_TYPE).

Minor concerns:

  • Uses a protected API (_parse_solution) from compute-eval that could break in future versions
  • Broad exception handling in evaluator could mask debugging issues
  • Dependency commit hash differs from PR description (appears intentional from git history)

Confidence Score: 4/5

  • This PR is safe to merge with minor style improvements recommended
  • The implementation is solid and follows established patterns throughout the codebase. The core logic correctly integrates compute-eval's evaluation pipeline with async support. The main concerns are style-related: reliance on a protected API and broad exception handling. These won't cause runtime failures but could impact maintainability. NVCC detection at initialization is appropriate.
  • Pay attention to nemo_skills/inference/eval/compute_eval.py due to protected API usage, and verify the compute-eval dependency version matches expectations

Important Files Changed

File Analysis

Filename Score Overview
nemo_skills/dataset/compute-eval/prepare.py 5/5 Dataset preparation script to download and transform ComputeEval problems from HuggingFace
nemo_skills/evaluation/evaluator/compute_eval.py 4/5 Evaluator implementation that validates solutions against CUDA problems with CTK version checks
nemo_skills/inference/eval/compute_eval.py 3/5 Generation task that parses generated CUDA code and creates solutions, but uses protected API
requirements/main.txt 4/5 Added compute-eval dependency from GitHub at commit 2d14770 (differs from PR description)

Sequence Diagram

sequenceDiagram
    participant User
    participant prepare.py
    participant HuggingFace
    participant GenerationTask
    participant LLM
    participant Evaluator
    participant compute_eval
    participant Metrics

    User->>prepare.py: Run dataset preparation
    prepare.py->>HuggingFace: load_dataset("nvidia/compute-eval", release)
    HuggingFace-->>prepare.py: Return problems with task_id, prompt, context_files
    prepare.py->>prepare.py: Format context_files with language fences
    prepare.py-->>User: Save to eval.jsonl

    User->>GenerationTask: Run inference (ComputeEvalGenerationTask)
    GenerationTask->>GenerationTask: Load eval.jsonl
    loop For each problem
        GenerationTask->>LLM: Generate CUDA solution
        LLM-->>GenerationTask: Generated code
        GenerationTask->>GenerationTask: Parse solution with _parse_solution()
        GenerationTask->>GenerationTask: Create FileSolution object
    end
    GenerationTask-->>User: Save generations with solutions

    User->>Evaluator: Run evaluation (ComputeEvalEvaluator)
    Evaluator->>Evaluator: Detect NVCC version and CTK
    loop For each generated solution
        Evaluator->>Evaluator: Validate problem and solution schemas
        Evaluator->>compute_eval: evaluate_solution(problem, solution, ctk_version)
        compute_eval->>compute_eval: Compile and run CUDA code
        compute_eval-->>Evaluator: Return passed, skipped, elapsed_time, outputs
    end
    Evaluator-->>User: Save evaluation results

    User->>Metrics: Compute metrics (ComputeEvalMetrics)
    Metrics->>Metrics: Calculate accuracy from passed field
    Metrics->>Metrics: Compute pass@k scores
    Metrics-->>User: Display results
Loading

from compute_eval.data.data_model import FileSolution

# noinspection PyProtectedMember
from compute_eval.generate_completions import _parse_solution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses protected member _parse_solution from compute_eval.generate_completions. Protected members (prefixed with _) are not part of the public API and may change without notice. Consider requesting a public API from the compute-eval package or copying the implementation.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +46 to +47
# noinspection PyBroadException
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using except Exception catches all exceptions broadly. While the error is logged and returned in the result, this could mask unexpected issues during development (e.g., KeyError from missing data_point["problem"]). Consider catching more specific exceptions or at least logging the traceback for debugging.

blahblahasdf and others added 20 commits January 8, 2026 12:52
Signed-off-by: George Zelenfroind <[email protected]>
Signed-off-by: George <[email protected]>
Signed-off-by: dlord <[email protected]>
Signed-off-by: SeanNaren <[email protected]>
Co-authored-by: Mehrzad Samadi <[email protected]>
Signed-off-by: dlord <[email protected]>
…A-NeMo#1119)

Signed-off-by: Arkadiusz Nowaczynski <[email protected]>
Signed-off-by: George Armstrong <[email protected]>
Co-authored-by: George Armstrong <[email protected]>
Signed-off-by: dlord <[email protected]>
… move `ComputeEvalGenerationTask` to be more on the rails of the `process_single_datapoint` approach.

Signed-off-by: dlord <[email protected]>
Signed-off-by: Wei Du <[email protected]>
Signed-off-by: dlord <[email protected]>
Signed-off-by: George Armstrong <[email protected]>
Signed-off-by: dlord <[email protected]>
)

Signed-off-by: Arkadiusz Nowaczynski <[email protected]>
Co-authored-by: George Armstrong <[email protected]>
Signed-off-by: dlord <[email protected]>
wasiahmad and others added 11 commits January 8, 2026 12:52
Signed-off-by: wasiahmad <[email protected]>
Signed-off-by: Nikolai Ludwig <[email protected]>
Signed-off-by: George Armstrong <[email protected]>
Signed-off-by: i-vainn <[email protected]>
Signed-off-by: Grigor Nalbandyan <[email protected]>
Signed-off-by: bzantium <[email protected]>
Signed-off-by: Stephen Ge <[email protected]>
Signed-off-by: Jiacheng Xu <[email protected]>
Signed-off-by: George Zelenfroind <[email protected]>
Co-authored-by: Nick Ludwig <[email protected]>
Co-authored-by: George Armstrong <[email protected]>
Co-authored-by: Ivan <[email protected]>
Co-authored-by: Wojciech Prazuch <[email protected]>
Co-authored-by: gnalbandyan <[email protected]>
Co-authored-by: Minho Ryu <[email protected]>
Co-authored-by: Stephen Ge <[email protected]>
Co-authored-by: Jiacheng Xu <[email protected]>
Co-authored-by: Jiacheng Xu <[email protected]>
Co-authored-by: George <[email protected]>
Co-authored-by: Sanyam Kapoor <[email protected]>
Signed-off-by: dlord <[email protected]>
Signed-off-by: fzyzcjy <[email protected]>
Signed-off-by: Igor Gitman <[email protected]>
Co-authored-by: fzyzcjy <[email protected]>
Signed-off-by: dlord <[email protected]>
…eMo#1136)

Signed-off-by: Valentin Mendelev <[email protected]>
Co-authored-by: Nikolay Karpov <[email protected]>
Signed-off-by: dlord <[email protected]>
Signed-off-by: George Armstrong <[email protected]>
Signed-off-by: dlord <[email protected]>
Signed-off-by: Jocelyn Huang <[email protected]>
Signed-off-by: dlord <[email protected]>

# Conflicts:
#	nemo_skills/evaluation/evaluator/__init__.py
Signed-off-by: bzantium <[email protected]>
Co-authored-by: Igor Gitman <[email protected]>
Signed-off-by: dlord <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR integrates ComputeEval, a CUDA code generation benchmark, into NeMo-Skills. The implementation adds dataset preparation, generation task, evaluator, and metrics following the framework's patterns.

Major Issues Found:

  • Missing ++prompt_config in GENERATION_ARGS will cause validation failures during generation
  • Dataset preparation lacks error handling for authentication failures and schema validation
  • Evaluator catches all exceptions indiscriminately and returns inconsistent error response fields
  • Generation task uses protected API (_parse_solution) from upstream library without error handling

Architecture:
The integration correctly follows NeMo-Skills patterns by:

  • Registering the evaluator in EVALUATOR_CLASS_MAP
  • Implementing BaseEvaluator with async eval_single method
  • Using TypeAdapter for Pydantic validation of compute-eval data models
  • Computing pass@k metrics consistent with other code benchmarks

Dependencies:
Adds compute-eval package pinned to commit 2d14770. Note this is a public but not yet PyPI-published dependency.

Confidence Score: 3/5

  • This PR has several critical issues that will cause runtime failures in common scenarios
  • Score reflects multiple logic errors that will cause failures: missing prompt_config will fail validation, missing error handling will cause unclear crashes, inconsistent error responses could break downstream processing, and use of protected APIs creates fragility
  • Pay close attention to nemo_skills/dataset/compute-eval/__init__.py (missing prompt_config), prepare.py (no error handling), and both evaluator files (error handling issues)

Important Files Changed

File Analysis

Filename Score Overview
nemo_skills/dataset/compute-eval/init.py 3/5 Dataset configuration file missing required prompt_config in GENERATION_ARGS, which will cause validation failure during generation
nemo_skills/dataset/compute-eval/prepare.py 3/5 Dataset preparation script lacks error handling for authentication failures and dataset schema validation
nemo_skills/evaluation/evaluator/compute_eval.py 3/5 Evaluator catches all exceptions indiscriminately and returns inconsistent error fields
nemo_skills/inference/eval/compute_eval.py 3/5 Generation task uses protected API from upstream library and lacks error handling in data processing

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant HF as HuggingFace
    participant GenTask as ComputeEvalGenerationTask
    participant LLM as Language Model
    participant Evaluator as ComputeEvalEvaluator
    participant ComputeEval as compute-eval Library
    participant Metrics as ComputeEvalMetrics

    User->>PrepareScript: Run dataset preparation
    PrepareScript->>HF: load_dataset("nvidia/compute-eval", release)
    HF-->>PrepareScript: Return dataset with problems
    PrepareScript->>PrepareScript: Format context_files_block
    PrepareScript->>PrepareScript: Write eval.jsonl with task_id, prompt, etc.

    User->>GenTask: Generate solutions
    GenTask->>GenTask: Read eval.jsonl data
    loop For each problem
        GenTask->>LLM: Send prompt with problem context
        LLM-->>GenTask: Return CUDA code generation
        GenTask->>ComputeEval: _parse_solution(generation)
        ComputeEval-->>GenTask: Parsed FileSolution
        GenTask->>GenTask: Create solution record with model_dump()
    end

    User->>Evaluator: Evaluate solutions
    loop For each solution (async)
        Evaluator->>Evaluator: validate_python(problem & solution)
        Evaluator->>ComputeEval: evaluate_solution(problem, solution, ctk_version)
        ComputeEval->>ComputeEval: Compile CUDA code
        ComputeEval->>ComputeEval: Execute tests
        ComputeEval-->>Evaluator: Return graded result (passed, skipped, etc.)
        Evaluator->>Evaluator: Write evaluation to file
    end

    User->>Metrics: Compute metrics
    Metrics->>Metrics: Calculate accuracy from "passed" field
    Metrics->>Metrics: Compute pass@k statistics
    Metrics-->>User: Return final metrics
Loading

Comment on lines 66 to 71
except Exception as e:
_LOG.error(f"Error during evaluation: {e}")
return {
"passed": False,
"error": str(e),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catching all exceptions masks critical errors and returns inconsistent fields - successful evaluation returns skipped, elapsed_time, build_output, test_output but error case only returns passed and error

Suggested change
except Exception as e:
_LOG.error(f"Error during evaluation: {e}")
return {
"passed": False,
"error": str(e),
}
except KeyError as e:
_LOG.error(f"Missing required field in data_point: {e}")
return {
"passed": False,
"skipped": False,
"elapsed_time": 0.0,
"build_output": "",
"test_output": "",
"error": f"Missing required field: {e}",
}
except Exception as e:
_LOG.error(f"Error during evaluation: {e}")
return {
"passed": False,
"skipped": False,
"elapsed_time": 0.0,
"build_output": "",
"test_output": "",
"error": str(e),
}

Comment on lines +20 to +21
# noinspection PyProtectedMember
from compute_eval.generate_completions import _parse_solution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accessing protected member _parse_solution from external library - upstream can break this without warning

Comment on lines 38 to 48
async def process_single_datapoint(self, data_point, data):
res = await super().process_single_datapoint(data_point, data)

solution = FileSolution(
task_id=data_point["task_id"],
files=_parse_solution(res["generation"]),
)
return {
"solution": solution.model_dump(),
"generation": res["generation"],
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing error handling - _parse_solution can fail, data_point might not have task_id, res might not have generation

Suggested change
async def process_single_datapoint(self, data_point, data):
res = await super().process_single_datapoint(data_point, data)
solution = FileSolution(
task_id=data_point["task_id"],
files=_parse_solution(res["generation"]),
)
return {
"solution": solution.model_dump(),
"generation": res["generation"],
}
async def process_single_datapoint(self, data_point, data):
res = await super().process_single_datapoint(data_point, data)
try:
solution = FileSolution(
task_id=data_point["task_id"],
files=_parse_solution(res["generation"]),
)
return {
"solution": solution.model_dump(),
"generation": res["generation"],
}
except KeyError as e:
_LOG.error(f"Missing required field: {e}")
raise
except Exception as e:
_LOG.error(f"Failed to parse solution: {e}")
raise

DATASET_GROUP = "code"
METRICS_TYPE = "compute-eval"
GENERATION_MODULE = "nemo_skills.inference.eval.compute_eval"
GENERATION_ARGS = "++eval_type=compute-eval"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing ++prompt_config in GENERATION_ARGS - will fail validation unless user specifies ++prompt_format=openai (see nemo_skills/inference/generate.py line 242 which requires prompt_config when prompt_format=='ns')

Suggested change
GENERATION_ARGS = "++eval_type=compute-eval"
GENERATION_ARGS = "++prompt_config=generic/codegen ++eval_type=compute-eval"


args = parser.parse_args()

dataset = load_dataset("nvidia/compute-eval", args.release, token=os.getenv("HF_TOKEN", None))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing error handling for gated dataset authentication and network failures - will crash with unclear error if authentication fails or network is down

"task_id": item["task_id"],
"problem_prompt": item["prompt"],
"build_command": item["build_command"],
"context_files_block": _format_context_files_block(item["context_files"]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no validation that context_files list items have required path and content keys - will raise KeyError if dataset schema changes

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR integrates ComputeEval, a CUDA code generation benchmark with 406+ problems, into NeMo-Skills. The implementation follows established patterns and includes dataset preparation, custom generation task, evaluator, and metrics computation.

Key Changes:

  • Added dataset preparation script to download from gated HuggingFace dataset
  • Implemented ComputeEvalGenerationTask using compute-eval's solution parsing
  • Created ComputeEvalEvaluator with NVCC version detection and async evaluation
  • Added ComputeEvalMetrics for pass@k computation
  • Registered evaluator in EVALUATOR_CLASS_MAP and metrics in METRICS_MAP
  • Pinned compute-eval dependency to specific commit

Issues Found:

  • Missing documentation for compute-eval benchmark (critical for users)
  • Missing test coverage - compute-eval not added to DATASETS list in tests/test_datasets.py
  • Using protected member _parse_solution from compute-eval package could break without warning
  • No validation for required HF_TOKEN environment variable before attempting dataset download
  • Overly broad exception handling in evaluator loses important error context
  • Missing explicit validation for required data_point keys

Architecture Notes:
The implementation correctly uses a custom generation module (nemo_skills.inference.eval.compute_eval) specified in the dataset __init__.py, which differs from standard generation. The evaluator properly checks for NVCC installation at initialization and uses TypeAdapters for discriminated union validation of problem/solution types.

Confidence Score: 3/5

  • Safe to merge with documentation and test coverage improvements needed
  • The core implementation is solid and follows established patterns in the codebase. The evaluator, metrics, and generation task are well-structured. However, the missing documentation is a significant gap that will impact usability, and the lack of test coverage means the new dataset won't be validated by CI. The use of a protected API from the external package and overly broad exception handling are code quality concerns but not blockers. No critical bugs or security issues were found.
  • Pay close attention to docs/evaluation/code.md (needs compute-eval documentation), tests/test_datasets.py (needs dataset entry), and nemo_skills/dataset/compute-eval/prepare.py (needs HF_TOKEN validation)

Important Files Changed

File Analysis

Filename Score Overview
tests/test_datasets.py 3/5 Missing compute-eval dataset entry in DATASETS list for test coverage
nemo_skills/dataset/compute-eval/prepare.py 4/5 Dataset preparation script added for downloading ComputeEval from HuggingFace
nemo_skills/evaluation/evaluator/compute_eval.py 4/5 Evaluator implementation added with NVCC version detection and async evaluation
nemo_skills/inference/eval/compute_eval.py 3/5 Generation task implementation uses protected member _parse_solution from compute_eval package
requirements/main.txt 4/5 Added compute-eval dependency pinned to specific commit from GitHub
docs/evaluation/code.md 3/5 No documentation added for compute-eval benchmark despite adding IOI and BIRD docs

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant HF as HuggingFace
    participant DataDir as Data Directory
    participant GenTask as ComputeEvalGenerationTask
    participant LLM as Language Model
    participant Evaluator as ComputeEvalEvaluator
    participant NVCC as CUDA Toolkit
    participant Metrics as ComputeEvalMetrics

    Note over User,DataDir: Data Preparation Phase
    User->>PrepareScript: ns prepare_data compute-eval --release=2025-1
    PrepareScript->>HF: load_dataset with auth token
    HF-->>PrepareScript: Return dataset["eval"]
    PrepareScript->>PrepareScript: Format context_files_block for each problem
    PrepareScript->>DataDir: Write eval.jsonl with problem, task_id, prompt, etc.

    Note over User,Metrics: Generation & Evaluation Phase
    User->>GenTask: ns eval --benchmarks=compute-eval
    GenTask->>GenTask: Load eval.jsonl
    GenTask->>LLM: Generate CUDA code solution
    LLM-->>GenTask: Return generation text
    GenTask->>GenTask: Parse solution via compute_eval._parse_solution()
    GenTask->>GenTask: Create FileSolution with task_id and parsed files
    GenTask->>DataDir: Save generation with solution field

    GenTask->>Evaluator: eval_single(data_point)
    Evaluator->>NVCC: get_nvcc_version() (on init)
    NVCC-->>Evaluator: Return CTK version
    Evaluator->>Evaluator: validate_python(problem) using TypeAdapter
    Evaluator->>Evaluator: validate_python(solution) using TypeAdapter
    Evaluator->>NVCC: evaluate_solution(problem, solution, CTK version)
    NVCC-->>Evaluator: Return graded result (passed, skipped, elapsed_time, outputs)
    Evaluator-->>GenTask: Return evaluation results

    Note over Metrics: Metrics Computation Phase
    GenTask->>Metrics: Compute metrics from results
    Metrics->>Metrics: Extract prediction["passed"] for each result
    Metrics->>Metrics: Compute pass@k statistics
    Metrics-->>User: Display accuracy and pass@k metrics
Loading

from compute_eval.data.data_model import FileSolution

# noinspection PyProtectedMember
from compute_eval.generate_completions import _parse_solution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using protected member _parse_solution from external package

accessing _parse_solution with # noinspection PyProtectedMember bypasses the protection. if this is an internal API in compute-eval, it could break without warning. consider:

  1. requesting compute-eval to expose this as public API
  2. or implement solution parsing directly in nemo-skills


bs4
click < 8.2.0 # https://github.com/ai-dynamo/dynamo/issues/1039
compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@2d14770
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dependency pinned to specific commit without version tag

compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@2d14770 is pinned to a commit hash. consider:

  1. using a tagged release if available
  2. documenting why this specific commit is required
  3. monitoring for when compute-eval is published to PyPI

this makes dependency management harder and doesn't benefit from semantic versioning

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!


args = parser.parse_args()

dataset = load_dataset("nvidia/compute-eval", args.release, token=os.getenv("HF_TOKEN", None))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential failure if HF_TOKEN not set for gated dataset

load_dataset("nvidia/compute-eval", ...) will fail if the dataset is gated and no token is provided. add validation to check if the token is set and provide a helpful error message directing users to obtain a HuggingFace token

Comment on lines 66 to 71
except Exception as e:
_LOG.error(f"Error during evaluation: {e}")
return {
"passed": False,
"error": str(e),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overly broad exception handling loses error context

catching all exceptions with except Exception and only logging masks validation errors, type errors, and other important failures. the metrics will treat this as a simple failure (passed: False) without distinguishing between:

  • invalid problem/solution format (validation error)
  • missing CUDA dependencies (runtime error)
  • actual test execution failure

consider catching specific exception types or at least including exception type in the error field

Comment on lines +48 to +49
problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"])
solution = _SOLUTION_ADAPTER.validate_python(data_point["solution"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing error handling for missing keys

if data_point["problem"] or data_point["solution"] keys don't exist, this raises KeyError which gets caught by the broad exception handler. add explicit validation with helpful error messages

"build_command": item["build_command"],
"context_files_block": _format_context_files_block(item["context_files"]),
}
f.write(json.dumps(record, default=str) + "\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using default=str in json.dumps without clear reason

json.dumps(record, default=str) will convert any non-serializable objects to strings. this could hide data issues. since all fields should already be JSON-serializable (strings, dicts, lists), consider removing default=str or documenting why it's needed

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 8, 2026

Additional Comments (2)

tests/test_datasets.py
missing compute-eval dataset entry

this list should include ("compute-eval", ["eval"]) to ensure the new dataset is tested via test_dataset_init_defaults()


docs/evaluation/code.md
missing documentation for compute-eval benchmark

this PR adds compute-eval support but doesn't document it. users need to know:

  • what compute-eval is
  • how to prepare data (ns prepare_data compute-eval --release=2025-1)
  • that it requires HF_TOKEN env var for gated dataset access
  • how to run evaluation
  • hardware requirements (needs CUDA toolkit/nvcc installed)

see other benchmarks in this file for documentation structure

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR integrates ComputeEval, a CUDA code generation benchmark, into NeMo-Skills by implementing dataset preparation, generation tasks, evaluation logic, and metrics computation.

Key changes:

  • Dataset preparation downloads problems from HuggingFace gated dataset and formats them with context files
  • Generation task uses compute-eval's _parse_solution to parse LLM-generated CUDA code into structured FileSolution objects
  • Evaluator runs async compilation and test execution with automatic NVCC version detection
  • Metrics compute pass@k scores using the existing framework patterns
  • Comprehensive documentation added with security warnings about executing machine-generated CUDA code

Technical implementation:

  • Follows existing NeMo-Skills patterns for dataset integration (similar to human-eval, scicode)
  • Proper error handling throughout with try-catch blocks and graceful degradation
  • Async evaluation with configurable parallelism (default 10 concurrent requests)
  • Uses Pydantic TypeAdapters for robust deserialization of problem and solution objects
  • All integration points properly registered in __init__.py files and METRICS_MAP/EVALUATOR_CLASS_MAP

Confidence Score: 4/5

  • This PR is safe to merge with minor considerations about the protected API dependency
  • The implementation is solid and follows established patterns in the codebase. All components (dataset prep, generation, evaluation, metrics) are properly integrated and include appropriate error handling. The main concern is the dependency on a protected member (_parse_solution) from the external compute-eval package, which could break if that library changes its internal API. However, since compute-eval is also maintained by NVIDIA and pinned to a specific commit, this risk is manageable. The code is well-documented, includes security warnings for executing generated CUDA code, and properly handles edge cases.
  • nemo_skills/inference/eval/compute_eval.py relies on protected API from compute-eval

Important Files Changed

File Analysis

Filename Score Overview
requirements/main.txt 5/5 Added compute-eval dependency pinned to specific commit
nemo_skills/dataset/compute-eval/prepare.py 5/5 Implements dataset download from HuggingFace with proper error handling and formatting
nemo_skills/inference/eval/compute_eval.py 4/5 Generation task with protected member access; includes proper error handling
nemo_skills/evaluation/evaluator/compute_eval.py 5/5 Evaluator with proper async execution, NVCC version detection, and comprehensive error handling
nemo_skills/evaluation/metrics/code_metrics.py 5/5 Added ComputeEvalMetrics class following existing patterns for pass@k metrics

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant HuggingFace
    participant GenerationTask as ComputeEvalGenerationTask
    participant LLM
    participant ParseSolution as _parse_solution
    participant Evaluator as ComputeEvalEvaluator
    participant ComputeEval as compute-eval Library
    participant NVCC
    participant Metrics as ComputeEvalMetrics

    Note over User,HuggingFace: Data Preparation Phase
    User->>PrepareScript: ns prepare_data compute-eval
    PrepareScript->>HuggingFace: load_dataset("nvidia/compute-eval")
    HuggingFace-->>PrepareScript: Dataset with CUDA problems
    PrepareScript->>PrepareScript: Format context_files_block
    PrepareScript->>PrepareScript: Write eval.jsonl

    Note over User,LLM: Generation Phase
    User->>GenerationTask: ns eval (with inference config)
    GenerationTask->>GenerationTask: Load eval.jsonl
    loop For each problem
        GenerationTask->>LLM: Generate CUDA solution (with prompt)
        LLM-->>GenerationTask: Raw code generation
        GenerationTask->>ParseSolution: _parse_solution(generation)
        ParseSolution-->>GenerationTask: Parsed files structure
        GenerationTask->>GenerationTask: Create FileSolution object
    end
    GenerationTask->>GenerationTask: Write output.jsonl

    Note over User,NVCC: Evaluation Phase
    User->>Evaluator: Evaluate generated solutions
    Evaluator->>NVCC: get_nvcc_version()
    NVCC-->>Evaluator: CTK version info
    loop For each solution (async)
        Evaluator->>ComputeEval: evaluate_solution(problem, solution, ctk_version)
        ComputeEval->>ComputeEval: Compile CUDA code
        ComputeEval->>ComputeEval: Execute tests
        ComputeEval-->>Evaluator: GradedResult (passed, skipped, outputs)
        Evaluator->>Evaluator: Write evaluation results
    end

    Note over User,Metrics: Metrics Computation Phase
    User->>Metrics: Compute final metrics
    Metrics->>Metrics: Calculate accuracy (passed rate)
    Metrics->>Metrics: Compute pass@k metrics
    Metrics-->>User: Final metrics.json
Loading

from compute_eval.data.data_model import FileSolution

# noinspection PyProtectedMember
from compute_eval.generate_completions import _parse_solution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accessing protected member _parse_solution from external package - if compute-eval changes this internal API, the code could break

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR successfully integrates the ComputeEval benchmark for evaluating LLMs on CUDA code generation into NeMo-Skills.

Major changes:

  • Added dataset preparation script that downloads problems from HuggingFace with proper authentication and formats context files
  • Implemented ComputeEvalGenerationTask to generate CUDA solutions using the LLM
  • Implemented ComputeEvalEvaluator that compiles and executes generated CUDA code against test suites
  • Added ComputeEvalMetrics to compute accuracy and pass@k performance metrics
  • Registered compute-eval across all framework maps (evaluators, metrics, datasets)
  • Added comprehensive documentation with setup instructions and security warnings
  • Pinned compute-eval dependency to specific GitHub commit (2d14770)

Minor issue:

  • The generation task imports a protected member (_parse_solution) from the compute-eval library, which is not ideal but acceptable given the library hasn't published a stable API yet

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk
  • The implementation follows existing patterns in the codebase consistently, includes comprehensive error handling, and provides thorough documentation. The only concern is the use of a protected member from the external library, but this is acknowledged with a suppression comment and is a pragmatic choice given the library's pre-release state. All integration points are properly registered.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
nemo_skills/dataset/compute-eval/prepare.py 5/5 adds dataset download and preparation logic with proper formatting for context files
nemo_skills/evaluation/evaluator/compute_eval.py 4/5 implements evaluator with async support, includes comprehensive error handling
nemo_skills/inference/eval/compute_eval.py 4/5 implements generation task with solution parsing, uses protected member from external lib
nemo_skills/evaluation/metrics/code_metrics.py 5/5 adds ComputeEvalMetrics class following existing patterns for pass@k metrics
requirements/main.txt 5/5 adds compute-eval dependency from GitHub at specific commit

Sequence Diagram

sequenceDiagram
    participant User
    participant ns_prepare_data as ns prepare_data
    participant HuggingFace as HuggingFace Dataset
    participant prepare_py as prepare.py
    participant ns_eval as ns eval
    participant GenerationTask as ComputeEvalGenerationTask
    participant LLM as LLM Server
    participant Evaluator as ComputeEvalEvaluator
    participant compute_eval as compute-eval Library
    participant NVCC as CUDA Toolkit
    participant Metrics as ComputeEvalMetrics

    User->>ns_prepare_data: ns prepare_data compute-eval --release 2025-1
    ns_prepare_data->>prepare_py: Execute prepare.py
    prepare_py->>HuggingFace: load_dataset with auth token
    HuggingFace-->>prepare_py: Return dataset with CUDA problems
    prepare_py->>prepare_py: Format context files with code fences
    prepare_py->>prepare_py: Write eval.jsonl with problem data
    prepare_py-->>User: Dataset prepared

    User->>ns_eval: ns eval --benchmarks=compute-eval
    ns_eval->>GenerationTask: Initialize ComputeEvalGenerationTask
    ns_eval->>GenerationTask: Load eval.jsonl data
    
    loop For each problem
        GenerationTask->>LLM: Send prompt with problem description
        LLM-->>GenerationTask: Return generated CUDA code
        GenerationTask->>compute_eval: _parse_solution(generation)
        compute_eval-->>GenerationTask: Parsed FileSolution object
        GenerationTask->>GenerationTask: Store solution in output
    end

    ns_eval->>Evaluator: Initialize ComputeEvalEvaluator
    Evaluator->>NVCC: get_nvcc_version()
    NVCC-->>Evaluator: Return CUDA Toolkit version
    Evaluator->>Evaluator: Store CTK major/minor version
    
    loop For each solution (async)
        Evaluator->>compute_eval: evaluate_solution(problem, solution, ctk_version)
        compute_eval->>compute_eval: Compile CUDA code
        compute_eval->>compute_eval: Execute test suite
        compute_eval-->>Evaluator: Return GradedSolution (passed, build_output, test_output)
        Evaluator->>Evaluator: Write evaluation results to file
    end

    ns_eval->>Metrics: Compute metrics
    Metrics->>Metrics: Calculate accuracy and pass@k
    Metrics-->>User: Display metrics.json with results
Loading

from compute_eval.data.data_model import FileSolution

# noinspection PyProtectedMember
from compute_eval.generate_completions import _parse_solution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

importing private function _parse_solution from external library. Consider requesting the compute-eval package to export this as a public API, since it's needed for integration.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@blahblahasdf
Copy link
Collaborator Author

Closing this PR in favor of one directly from a branch on NeMo Skills.

@gwarmstrong gwarmstrong mentioned this pull request Jan 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.