-
Notifications
You must be signed in to change notification settings - Fork 140
Add ComputeEval Dataset Support #1124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis PR adds support for NVIDIA's compute-eval dataset and evaluation framework to NeMo Skills. It introduces a new compute-eval evaluator type, corresponding metrics, a dataset preparation script, a generation task for model inference, and registers these components in the appropriate registries. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (4)
nemo_skills/dataset/compute-eval/prepare.py (2)
29-34: Consider narrowing the exception type or adding a brief comment.The bare
Exceptioncatch works but obscures intent. A comment explaining that any failure duringwhoami()indicates the user isn't logged in would improve clarity.# noinspection PyBroadException try: api = HfApi() api.whoami() - return None + return None # User is logged in; HF will use cached credentials - except Exception: + except Exception: # Any failure means user is not authenticated return None
48-54: Add validation for the expected dataset split.If the dataset doesn't contain an "eval" split (e.g., due to an invalid release), the script will raise an unclear
KeyError. Consider validating the split exists.dataset = load_dataset("nvidia/compute-eval", args.release, token=_get_hf_token()) + if "eval" not in dataset: + raise ValueError(f"Dataset does not contain 'eval' split. Available splits: {list(dataset.keys())}") data_dir = Path(__file__).absolute().parentnemo_skills/inference/eval/compute_eval.py (2)
54-67: Consider usingpassfor no-op methods and prefixing unused parameters.The no-op lifecycle methods use
returnstatements, butpassis more idiomatic for intentionally empty methods. Additionally, thedataparameter inlog_example_promptis flagged as unused by static analysis - prefixing it with an underscore would signal it's intentionally unused.Apply this diff:
- def log_example_prompt(self, data): - return + def log_example_prompt(self, _data): + pass def setup_prompt(self): - return + pass def setup_llm(self): - return + pass def setup_litellm_cache(self): - return + pass def cleanup_litellm_cache(self): - return + pass
69-69: Prefix unused parameter with underscore.The
dataparameter is flagged as unused by static analysis. If it's part of the interface but not needed in this implementation, prefix it with an underscore to signal it's intentionally unused.Apply this diff:
- async def process_single_datapoint(self, data_point, data): + async def process_single_datapoint(self, data_point, _data):
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
nemo_skills/dataset/compute-eval/__init__.py(1 hunks)nemo_skills/dataset/compute-eval/prepare.py(1 hunks)nemo_skills/evaluation/evaluator/__init__.py(2 hunks)nemo_skills/evaluation/evaluator/compute_eval.py(1 hunks)nemo_skills/evaluation/metrics/code_metrics.py(1 hunks)nemo_skills/evaluation/metrics/map_metrics.py(2 hunks)nemo_skills/inference/eval/compute_eval.py(1 hunks)requirements/main.txt(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-23T17:56:57.556Z
Learnt from: Glorf
Repo: NVIDIA-NeMo/Skills PR: 908
File: requirements/main.txt:16-16
Timestamp: 2025-11-23T17:56:57.556Z
Learning: faiss-cpu must be explicitly listed in requirements/main.txt for BFCLv4 memory evaluations (memory_kv, memory_vector, memory_rec_sum) as it is an optional dependency of sentence_transformers that is required for vector similarity search functionality in the memory backends.
Applied to files:
requirements/main.txt
🧬 Code graph analysis (4)
nemo_skills/evaluation/evaluator/__init__.py (1)
nemo_skills/evaluation/evaluator/compute_eval.py (1)
ComputeEvalEvaluator(31-63)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/code_metrics.py (1)
ComputeEvalMetrics(126-135)
nemo_skills/evaluation/evaluator/compute_eval.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
BaseEvaluator(34-91)nemo_skills/utils.py (1)
get_logger_name(39-43)
nemo_skills/evaluation/metrics/code_metrics.py (7)
nemo_skills/evaluation/metrics/base.py (5)
BaseMetrics(23-434)_get_score_dict(124-143)get_incorrect_sample(200-206)update(145-189)_compute_pass_at_k(352-423)nemo_skills/evaluation/metrics/if_metrics.py (3)
_get_score_dict(24-28)get_incorrect_sample(30-33)update(35-48)nemo_skills/evaluation/metrics/answer_judgement_metrics.py (3)
_get_score_dict(35-39)get_incorrect_sample(41-47)update(121-132)nemo_skills/evaluation/metrics/lean4_metrics.py (3)
_get_score_dict(23-24)get_incorrect_sample(26-29)update(46-48)nemo_skills/evaluation/metrics/math_metrics.py (3)
_get_score_dict(67-79)get_incorrect_sample(81-88)update(90-120)nemo_skills/evaluation/metrics/ruler_metrics.py (3)
_get_score_dict(19-20)get_incorrect_sample(26-29)update(22-24)nemo_skills/evaluation/metrics/arena_metrics.py (2)
get_incorrect_sample(36-40)update(42-84)
🪛 Ruff (0.14.8)
nemo_skills/inference/eval/compute_eval.py
54-54: Unused method argument: data
(ARG002)
69-69: Unused method argument: data
(ARG002)
nemo_skills/dataset/compute-eval/prepare.py
32-32: Consider moving this statement to an else block
(TRY300)
33-33: Do not catch blind exception: Exception
(BLE001)
nemo_skills/evaluation/metrics/code_metrics.py
130-130: Unused method argument: prediction
(ARG002)
🔇 Additional comments (10)
nemo_skills/evaluation/metrics/code_metrics.py (1)
126-135: LGTM - Implementation follows established patterns.The
ComputeEvalMetricsclass correctly mirrors the structure ofHumanEvalInfillingMetrics(lines 114-123). The unusedpredictionparameter inget_incorrect_sampleis required by the base class interface, so the static analysis warning is a false positive.nemo_skills/evaluation/metrics/map_metrics.py (2)
26-26: LGTM - Import correctly added.
72-72: LGTM - Registry entry properly added.The key
"compute-eval"correctly matchesMETRICS_TYPEdefined innemo_skills/dataset/compute-eval/__init__.py.nemo_skills/evaluation/evaluator/__init__.py (2)
29-29: LGTM - Import correctly placed.
71-71: LGTM - Evaluator correctly registered in class map only.The
ComputeEvalEvaluatoris correctly placed inEVALUATOR_CLASS_MAP(notEVALUATOR_MAP) since it implementseval_singlefor async single-point evaluation.nemo_skills/evaluation/evaluator/compute_eval.py (1)
48-62: LGTM - Correct use of asyncio.to_thread for blocking evaluation.Running
evaluate_solutionin a thread pool correctly avoids blocking the event loop during CUDA compilation and test execution.requirements/main.txt (1)
17-17: LGTM - Git commit pinning ensures reproducibility.The dependency is properly pinned to a specific commit (991b47c, currently the main branch HEAD) and correctly sorted alphabetically. The short hash is valid and unambiguous for this repository. Using the full 40-character SHA would provide additional robustness against theoretical hash collisions, but the current approach is acceptable for practical purposes.
nemo_skills/dataset/compute-eval/__init__.py (1)
15-19: LGTM - Dataset constants properly defined.Constants are internally consistent:
METRICS_TYPEmatches the registry key inmap_metrics.py(line 72), andEVAL_SPLITmatches the split accessed inprepare.py(lines 52-53). TheGENERATION_MODULEpath correctly points to the existingnemo_skills.inference.eval.compute_evalmodule.nemo_skills/inference/eval/compute_eval.py (2)
1-34: LGTM!The imports and module-level setup are well-structured. The TypeAdapter with a discriminated union for problem validation is a clean approach.
86-102: LGTM!The module exports and main entry point are well-structured. The help message handling and logging setup follow good practices.
gwarmstrong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, with a couple high level questions/concerns
|
|
||
| # noinspection PyBroadException | ||
| try: | ||
| api = HfApi() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows a user to access the HuggingFace gated (you have to be allow-listed to have access to the ComputeEval data today) data either via an environment variable (HF_TOKEN) or by using the HuggingFace CLI to login.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, after looking closer we don't need to access the API at all for this to work. I'll just clean this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we should just need a load_dataset call and it will handle the token if it's already in your env variables, like this one is gated: https://github.com/blahblahasdf/Skills/blob/cb7829094799f9b5967e260d30e39c2535f67bab/nemo_skills/dataset/flores200/prepare.py#L49
| generate_model_completions, | ||
| system_prompt=self._system_prompt, | ||
| problem=problem, | ||
| model=self._model, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is applying this function:
https://github.com/NVIDIA/compute-eval/blob/main/compute_eval/generate_completions.py#L31-L84
Which appears to use the ComputeEval clients?
What I would recommend we do instead, is write a version of that function that uses NeMo-Skills' process_single_datapoint--so this probably looks something like
result = super().process_sinlge_datapoint(data_point, data)
...
completion = ...
return FileSolution(
task_id=problem.task_id,
files=_parse_solution(completion),
**debug_info,
)
this will make sure the prompt logic is consistent with typical NeMo-Skills and that the eval prompt logic is the same as normal generations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the above methods that are overwritten would just use the default methods rather than override to return nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick feedback! Your suggestion is to call process_sinlge_datapoint and then use the ComputeEval post processing to convert the model's response to a FileSolution? The "ComputeEval clients" are just a pass through to the OpenAI client. I think as is if you were to configure to a local model (or a model hosted somewhere else) it would just work as is but I agree it doesn't fit super cleanly with the NeMo-Skills abstractions.
Honestly, the generation side of the ComputeEval repo is generally intended to be a reference implementation for how other groups/companies would set up a generator scaffold to solve the ComputeEval problems. The reference implementation's prompt and post processing logic are pretty coupled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for calling the openai client is here, right? https://github.com/NVIDIA/compute-eval/blob/main/compute_eval/models/model_interface.py#L100-L154
Seems pretty straightforward and is pretty much a subset of the functionality our clients have. And if do it the way it's currently being done, you lose a lot of things we support in NeMo-Skills (like tool-calling, pretty much all sampling parameters not directly specified in the compute-eval function, as well as the supported server types would be quite limited).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointers and happy New Year! I see where you're coming from on fitting this in with the NemoSkills generation more cleanly. What is the expectation on how to document thisGenerationTasks content format? I was able to get this working locally but there is definitely some coupling between the expected data format and system and user prompts required in a ++prompt_config. I would guess that has to be normal and just documented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's pretty typical. We don't really have it documented on a case-by-case basis--it's pretty easy to infer from the prompt config itself, and we have a generic section on understanding the interaction between data and prompt_config here: https://nvidia-nemo.github.io/Skills/basics/prompt-format/#prompt-api
Greptile SummaryThis PR integrates ComputeEval, a CUDA code generation benchmark, into the NeMo-Skills framework. The implementation adds dataset preparation from HuggingFace, a generation task for producing CUDA solutions, an evaluator that compiles and tests generated code, and metrics computation including pass@k scores. Key changes:
Note: The compute-eval dependency version was updated from Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Prepare as prepare.py
participant HF as HuggingFace
participant Gen as ComputeEvalGenerationTask
participant LLM as Language Model
participant Eval as ComputeEvalEvaluator
participant CE as compute-eval library
participant Metrics as ComputeEvalMetrics
User->>Prepare: Run dataset preparation
Prepare->>HF: load_dataset("nvidia/compute-eval", release)
HF-->>Prepare: Return dataset with problems
Prepare->>Prepare: Format context_files_block
Prepare->>Prepare: Write eval.jsonl with problem data
User->>Gen: Generate solutions
Gen->>Gen: Read eval.jsonl
loop For each problem
Gen->>LLM: Send problem prompt
LLM-->>Gen: Return generated code
Gen->>CE: _parse_solution(generation)
CE-->>Gen: Parsed file solutions
Gen->>Gen: Create FileSolution object
Gen->>Gen: Write solution to output
end
User->>Eval: Evaluate solutions
Eval->>Eval: get_nvcc_version() and parse_semver()
loop For each data_point
Eval->>Eval: Validate problem and solution
Eval->>CE: evaluate_solution(problem, solution, ctk_version)
CE->>CE: Compile and test CUDA code
CE-->>Eval: Return graded result (passed, skipped, etc.)
Eval->>Eval: Return evaluation dict
end
User->>Metrics: Compute metrics
Metrics->>Metrics: Aggregate accuracy scores
Metrics->>Metrics: Compute pass@k metrics
Metrics-->>User: Return final metrics
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
nemo_skills/inference/eval/compute_eval.py, line 21 (link)style: Using protected member
_parse_solution. Consider requesting public API when compute-eval stabilizes.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
8 files reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, 1 comment
| from compute_eval.data.data_model import FileSolution | ||
|
|
||
| # noinspection PyProtectedMember | ||
| from compute_eval.generate_completions import _parse_solution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: accessing protected member _parse_solution from external library
If _parse_solution becomes unavailable in future versions of compute-eval, this will break. Consider requesting a public API from the compute-eval library.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Greptile found no issues!From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR integrates ComputeEval benchmark for evaluating CUDA code generation into NeMo-Skills.
Major Changes
- Added dataset preparation script that downloads 406 CUDA problems from gated HuggingFace dataset
- Implemented
ComputeEvalGenerationTaskthat generates CUDA solutions and parses them intoFileSolutionformat - Implemented
ComputeEvalEvaluatorthat compiles and tests CUDA code using NVCC with automatic toolkit version detection - Added
ComputeEvalMetricsclass for computing accuracy and pass@k performance metrics - Registered compute-eval across all integration points (evaluators, metrics, dataset configs)
Issues Found
- Dependency version mismatch: PR description specifies commit
991b47cbutrequirements/main.txtreferences@mainbranch, which may introduce breaking changes - Fragile external API dependency: Uses protected member
_parse_solutionfromcompute-evallibrary which could break if the library refactors its internal APIs
Architecture Assessment
The implementation follows NeMo-Skills patterns consistently - uses BaseEvaluator async evaluation, BaseMetrics for metrics computation, and proper registration in all required maps. The integration is clean and well-structured.
Confidence Score: 4/5
- Safe to merge after fixing the dependency version mismatch in requirements.txt
- Score reflects well-structured implementation following framework patterns, but critical dependency version mismatch needs resolution. The code is functional and integrates cleanly, but the
@mainreference instead of pinned commit could cause instability. - Pay attention to
requirements/main.txt- must fix version mismatch before merging
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| requirements/main.txt | 3/5 | Added compute-eval dependency but references @main instead of specific commit @991b47c mentioned in PR description |
| nemo_skills/evaluation/evaluator/compute_eval.py | 5/5 | Implements async evaluator that validates problems/solutions and runs CUDA compilation and tests with NVCC version checking |
| nemo_skills/inference/eval/compute_eval.py | 4/5 | Generation task that parses LLM output into FileSolution format - relies on protected _parse_solution API from external library |
Sequence Diagram
sequenceDiagram
participant User
participant PrepareScript as prepare.py
participant HF as HuggingFace
participant GenerationTask as ComputeEvalGenerationTask
participant LLM as Language Model
participant Evaluator as ComputeEvalEvaluator
participant ComputeEval as compute-eval Library
participant NVCC as CUDA Toolkit
participant Metrics as ComputeEvalMetrics
Note over User,HF: 1. Dataset Preparation Phase
User->>PrepareScript: Run prepare.py --release 2025-1
PrepareScript->>HF: load_dataset("nvidia/compute-eval")
HF-->>PrepareScript: Return 406 CUDA problems
PrepareScript->>PrepareScript: Format context files
PrepareScript->>PrepareScript: Write eval.jsonl
Note over User,Metrics: 2. Generation Phase
User->>GenerationTask: Run compute_eval inference
GenerationTask->>GenerationTask: Load eval.jsonl
loop For each problem
GenerationTask->>LLM: Send prompt with context
LLM-->>GenerationTask: Return CUDA code generation
GenerationTask->>ComputeEval: _parse_solution(generation)
ComputeEval-->>GenerationTask: FileSolution object
GenerationTask->>GenerationTask: Save solution + generation
end
Note over User,Metrics: 3. Evaluation Phase
User->>Evaluator: Run evaluation
Evaluator->>NVCC: get_nvcc_version()
NVCC-->>Evaluator: CTK version (e.g., 12.3)
loop For each solution (async)
Evaluator->>ComputeEval: evaluate_solution(problem, solution)
ComputeEval->>NVCC: Compile CUDA code
NVCC-->>ComputeEval: Build result
ComputeEval->>ComputeEval: Run tests
ComputeEval-->>Evaluator: Graded result (passed/failed)
Evaluator->>Evaluator: Update data with pass/fail
end
Note over User,Metrics: 4. Metrics Computation
User->>Metrics: Compute metrics
Metrics->>Metrics: Calculate accuracy
Metrics->>Metrics: Compute pass@k scores
Metrics-->>User: Final metrics report
requirements/main.txt
Outdated
|
|
||
| bs4 | ||
| click < 8.2.0 # https://github.com/ai-dynamo/dynamo/issues/1039 | ||
| compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR description specifies commit 991b47c but requirements file references @main branch
| compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@main | |
| compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@991b47c |
| from compute_eval.data.data_model import FileSolution | ||
|
|
||
| # noinspection PyProtectedMember | ||
| from compute_eval.generate_completions import _parse_solution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accessing protected member _parse_solution from external library - fragile dependency that may break if compute-eval refactors internal APIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR successfully integrates ComputeEval, a CUDA code generation benchmark, into NeMo-Skills following established patterns in the codebase.
Major changes:
- Added dataset preparation script to download and format problems from HuggingFace's gated
nvidia/compute-evaldataset - Implemented
ComputeEvalEvaluatorthat compiles and executes CUDA code with automatic CTK version detection - Created
ComputeEvalGenerationTaskto generate CUDA solutions using the LLM inference pipeline - Added
ComputeEvalMetricsclass to compute pass@k performance metrics - Registered compute-eval across all integration points (evaluator map, metrics map, dataset config)
Integration quality:
The implementation follows NeMo-Skills conventions consistently - the evaluator extends BaseEvaluator, metrics extend BaseMetrics with _compute_pass_at_k, and the generation task extends GenerationTask. The dataset configuration uses standard constants (EVAL_SPLIT, DATASET_GROUP, METRICS_TYPE).
Minor concerns:
- Uses a protected API (
_parse_solution) from compute-eval that could break in future versions - Broad exception handling in evaluator could mask debugging issues
- Dependency commit hash differs from PR description (appears intentional from git history)
Confidence Score: 4/5
- This PR is safe to merge with minor style improvements recommended
- The implementation is solid and follows established patterns throughout the codebase. The core logic correctly integrates compute-eval's evaluation pipeline with async support. The main concerns are style-related: reliance on a protected API and broad exception handling. These won't cause runtime failures but could impact maintainability. NVCC detection at initialization is appropriate.
- Pay attention to
nemo_skills/inference/eval/compute_eval.pydue to protected API usage, and verify the compute-eval dependency version matches expectations
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| nemo_skills/dataset/compute-eval/prepare.py | 5/5 | Dataset preparation script to download and transform ComputeEval problems from HuggingFace |
| nemo_skills/evaluation/evaluator/compute_eval.py | 4/5 | Evaluator implementation that validates solutions against CUDA problems with CTK version checks |
| nemo_skills/inference/eval/compute_eval.py | 3/5 | Generation task that parses generated CUDA code and creates solutions, but uses protected API |
| requirements/main.txt | 4/5 | Added compute-eval dependency from GitHub at commit 2d14770 (differs from PR description) |
Sequence Diagram
sequenceDiagram
participant User
participant prepare.py
participant HuggingFace
participant GenerationTask
participant LLM
participant Evaluator
participant compute_eval
participant Metrics
User->>prepare.py: Run dataset preparation
prepare.py->>HuggingFace: load_dataset("nvidia/compute-eval", release)
HuggingFace-->>prepare.py: Return problems with task_id, prompt, context_files
prepare.py->>prepare.py: Format context_files with language fences
prepare.py-->>User: Save to eval.jsonl
User->>GenerationTask: Run inference (ComputeEvalGenerationTask)
GenerationTask->>GenerationTask: Load eval.jsonl
loop For each problem
GenerationTask->>LLM: Generate CUDA solution
LLM-->>GenerationTask: Generated code
GenerationTask->>GenerationTask: Parse solution with _parse_solution()
GenerationTask->>GenerationTask: Create FileSolution object
end
GenerationTask-->>User: Save generations with solutions
User->>Evaluator: Run evaluation (ComputeEvalEvaluator)
Evaluator->>Evaluator: Detect NVCC version and CTK
loop For each generated solution
Evaluator->>Evaluator: Validate problem and solution schemas
Evaluator->>compute_eval: evaluate_solution(problem, solution, ctk_version)
compute_eval->>compute_eval: Compile and run CUDA code
compute_eval-->>Evaluator: Return passed, skipped, elapsed_time, outputs
end
Evaluator-->>User: Save evaluation results
User->>Metrics: Compute metrics (ComputeEvalMetrics)
Metrics->>Metrics: Calculate accuracy from passed field
Metrics->>Metrics: Compute pass@k scores
Metrics-->>User: Display results
| from compute_eval.data.data_model import FileSolution | ||
|
|
||
| # noinspection PyProtectedMember | ||
| from compute_eval.generate_completions import _parse_solution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uses protected member _parse_solution from compute_eval.generate_completions. Protected members (prefixed with _) are not part of the public API and may change without notice. Consider requesting a public API from the compute-eval package or copying the implementation.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| # noinspection PyBroadException | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using except Exception catches all exceptions broadly. While the error is logged and returned in the result, this could mask unexpected issues during development (e.g., KeyError from missing data_point["problem"]). Consider catching more specific exceptions or at least logging the traceback for debugging.
Signed-off-by: dlord <[email protected]>
Signed-off-by: dlord <[email protected]>
Signed-off-by: George Zelenfroind <[email protected]> Signed-off-by: George <[email protected]> Signed-off-by: dlord <[email protected]>
Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>
Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>
Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>
Signed-off-by: SeanNaren <[email protected]> Co-authored-by: Mehrzad Samadi <[email protected]> Signed-off-by: dlord <[email protected]>
…A-NeMo#1119) Signed-off-by: Arkadiusz Nowaczynski <[email protected]> Signed-off-by: George Armstrong <[email protected]> Co-authored-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>
Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>
Signed-off-by: dlord <[email protected]>
Signed-off-by: dlord <[email protected]>
Signed-off-by: dlord <[email protected]>
… move `ComputeEvalGenerationTask` to be more on the rails of the `process_single_datapoint` approach. Signed-off-by: dlord <[email protected]>
…DIA-NeMo#1125) Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>
Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>
Signed-off-by: Wei Du <[email protected]> Signed-off-by: dlord <[email protected]>
…NVIDIA-NeMo#1129) Signed-off-by: Stephen Ge <[email protected]> Signed-off-by: dlord <[email protected]>
Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>
) Signed-off-by: Arkadiusz Nowaczynski <[email protected]> Co-authored-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>
Signed-off-by: wasiahmad <[email protected]> Signed-off-by: dlord <[email protected]>
Signed-off-by: wasiahmad <[email protected]> Signed-off-by: Nikolai Ludwig <[email protected]> Signed-off-by: George Armstrong <[email protected]> Signed-off-by: i-vainn <[email protected]> Signed-off-by: Grigor Nalbandyan <[email protected]> Signed-off-by: bzantium <[email protected]> Signed-off-by: Stephen Ge <[email protected]> Signed-off-by: Jiacheng Xu <[email protected]> Signed-off-by: George Zelenfroind <[email protected]> Co-authored-by: Nick Ludwig <[email protected]> Co-authored-by: George Armstrong <[email protected]> Co-authored-by: Ivan <[email protected]> Co-authored-by: Wojciech Prazuch <[email protected]> Co-authored-by: gnalbandyan <[email protected]> Co-authored-by: Minho Ryu <[email protected]> Co-authored-by: Stephen Ge <[email protected]> Co-authored-by: Jiacheng Xu <[email protected]> Co-authored-by: Jiacheng Xu <[email protected]> Co-authored-by: George <[email protected]> Co-authored-by: Sanyam Kapoor <[email protected]> Signed-off-by: dlord <[email protected]>
Signed-off-by: fzyzcjy <[email protected]> Signed-off-by: Igor Gitman <[email protected]> Co-authored-by: fzyzcjy <[email protected]> Signed-off-by: dlord <[email protected]>
Signed-off-by: Sanyam Kapoor <[email protected]> Signed-off-by: dlord <[email protected]>
…eMo#1136) Signed-off-by: Valentin Mendelev <[email protected]> Co-authored-by: Nikolay Karpov <[email protected]> Signed-off-by: dlord <[email protected]>
Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>
Signed-off-by: George Armstrong <[email protected]> Signed-off-by: dlord <[email protected]>
Signed-off-by: Jocelyn Huang <[email protected]> Signed-off-by: dlord <[email protected]> # Conflicts: # nemo_skills/evaluation/evaluator/__init__.py
Signed-off-by: bzantium <[email protected]> Co-authored-by: Igor Gitman <[email protected]> Signed-off-by: dlord <[email protected]>
Signed-off-by: dlord <[email protected]>
Signed-off-by: dlord <[email protected]>
Signed-off-by: dlord <[email protected]>
665c095 to
b7cb62c
Compare
Signed-off-by: Dan Lord <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR integrates ComputeEval, a CUDA code generation benchmark, into NeMo-Skills. The implementation adds dataset preparation, generation task, evaluator, and metrics following the framework's patterns.
Major Issues Found:
- Missing
++prompt_configin GENERATION_ARGS will cause validation failures during generation - Dataset preparation lacks error handling for authentication failures and schema validation
- Evaluator catches all exceptions indiscriminately and returns inconsistent error response fields
- Generation task uses protected API (
_parse_solution) from upstream library without error handling
Architecture:
The integration correctly follows NeMo-Skills patterns by:
- Registering the evaluator in EVALUATOR_CLASS_MAP
- Implementing BaseEvaluator with async eval_single method
- Using TypeAdapter for Pydantic validation of compute-eval data models
- Computing pass@k metrics consistent with other code benchmarks
Dependencies:
Adds compute-eval package pinned to commit 2d14770. Note this is a public but not yet PyPI-published dependency.
Confidence Score: 3/5
- This PR has several critical issues that will cause runtime failures in common scenarios
- Score reflects multiple logic errors that will cause failures: missing prompt_config will fail validation, missing error handling will cause unclear crashes, inconsistent error responses could break downstream processing, and use of protected APIs creates fragility
- Pay close attention to
nemo_skills/dataset/compute-eval/__init__.py(missing prompt_config),prepare.py(no error handling), and both evaluator files (error handling issues)
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| nemo_skills/dataset/compute-eval/init.py | 3/5 | Dataset configuration file missing required prompt_config in GENERATION_ARGS, which will cause validation failure during generation |
| nemo_skills/dataset/compute-eval/prepare.py | 3/5 | Dataset preparation script lacks error handling for authentication failures and dataset schema validation |
| nemo_skills/evaluation/evaluator/compute_eval.py | 3/5 | Evaluator catches all exceptions indiscriminately and returns inconsistent error fields |
| nemo_skills/inference/eval/compute_eval.py | 3/5 | Generation task uses protected API from upstream library and lacks error handling in data processing |
Sequence Diagram
sequenceDiagram
participant User
participant PrepareScript as prepare.py
participant HF as HuggingFace
participant GenTask as ComputeEvalGenerationTask
participant LLM as Language Model
participant Evaluator as ComputeEvalEvaluator
participant ComputeEval as compute-eval Library
participant Metrics as ComputeEvalMetrics
User->>PrepareScript: Run dataset preparation
PrepareScript->>HF: load_dataset("nvidia/compute-eval", release)
HF-->>PrepareScript: Return dataset with problems
PrepareScript->>PrepareScript: Format context_files_block
PrepareScript->>PrepareScript: Write eval.jsonl with task_id, prompt, etc.
User->>GenTask: Generate solutions
GenTask->>GenTask: Read eval.jsonl data
loop For each problem
GenTask->>LLM: Send prompt with problem context
LLM-->>GenTask: Return CUDA code generation
GenTask->>ComputeEval: _parse_solution(generation)
ComputeEval-->>GenTask: Parsed FileSolution
GenTask->>GenTask: Create solution record with model_dump()
end
User->>Evaluator: Evaluate solutions
loop For each solution (async)
Evaluator->>Evaluator: validate_python(problem & solution)
Evaluator->>ComputeEval: evaluate_solution(problem, solution, ctk_version)
ComputeEval->>ComputeEval: Compile CUDA code
ComputeEval->>ComputeEval: Execute tests
ComputeEval-->>Evaluator: Return graded result (passed, skipped, etc.)
Evaluator->>Evaluator: Write evaluation to file
end
User->>Metrics: Compute metrics
Metrics->>Metrics: Calculate accuracy from "passed" field
Metrics->>Metrics: Compute pass@k statistics
Metrics-->>User: Return final metrics
| except Exception as e: | ||
| _LOG.error(f"Error during evaluation: {e}") | ||
| return { | ||
| "passed": False, | ||
| "error": str(e), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
catching all exceptions masks critical errors and returns inconsistent fields - successful evaluation returns skipped, elapsed_time, build_output, test_output but error case only returns passed and error
| except Exception as e: | |
| _LOG.error(f"Error during evaluation: {e}") | |
| return { | |
| "passed": False, | |
| "error": str(e), | |
| } | |
| except KeyError as e: | |
| _LOG.error(f"Missing required field in data_point: {e}") | |
| return { | |
| "passed": False, | |
| "skipped": False, | |
| "elapsed_time": 0.0, | |
| "build_output": "", | |
| "test_output": "", | |
| "error": f"Missing required field: {e}", | |
| } | |
| except Exception as e: | |
| _LOG.error(f"Error during evaluation: {e}") | |
| return { | |
| "passed": False, | |
| "skipped": False, | |
| "elapsed_time": 0.0, | |
| "build_output": "", | |
| "test_output": "", | |
| "error": str(e), | |
| } |
| # noinspection PyProtectedMember | ||
| from compute_eval.generate_completions import _parse_solution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accessing protected member _parse_solution from external library - upstream can break this without warning
| async def process_single_datapoint(self, data_point, data): | ||
| res = await super().process_single_datapoint(data_point, data) | ||
|
|
||
| solution = FileSolution( | ||
| task_id=data_point["task_id"], | ||
| files=_parse_solution(res["generation"]), | ||
| ) | ||
| return { | ||
| "solution": solution.model_dump(), | ||
| "generation": res["generation"], | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing error handling - _parse_solution can fail, data_point might not have task_id, res might not have generation
| async def process_single_datapoint(self, data_point, data): | |
| res = await super().process_single_datapoint(data_point, data) | |
| solution = FileSolution( | |
| task_id=data_point["task_id"], | |
| files=_parse_solution(res["generation"]), | |
| ) | |
| return { | |
| "solution": solution.model_dump(), | |
| "generation": res["generation"], | |
| } | |
| async def process_single_datapoint(self, data_point, data): | |
| res = await super().process_single_datapoint(data_point, data) | |
| try: | |
| solution = FileSolution( | |
| task_id=data_point["task_id"], | |
| files=_parse_solution(res["generation"]), | |
| ) | |
| return { | |
| "solution": solution.model_dump(), | |
| "generation": res["generation"], | |
| } | |
| except KeyError as e: | |
| _LOG.error(f"Missing required field: {e}") | |
| raise | |
| except Exception as e: | |
| _LOG.error(f"Failed to parse solution: {e}") | |
| raise |
| DATASET_GROUP = "code" | ||
| METRICS_TYPE = "compute-eval" | ||
| GENERATION_MODULE = "nemo_skills.inference.eval.compute_eval" | ||
| GENERATION_ARGS = "++eval_type=compute-eval" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing ++prompt_config in GENERATION_ARGS - will fail validation unless user specifies ++prompt_format=openai (see nemo_skills/inference/generate.py line 242 which requires prompt_config when prompt_format=='ns')
| GENERATION_ARGS = "++eval_type=compute-eval" | |
| GENERATION_ARGS = "++prompt_config=generic/codegen ++eval_type=compute-eval" |
|
|
||
| args = parser.parse_args() | ||
|
|
||
| dataset = load_dataset("nvidia/compute-eval", args.release, token=os.getenv("HF_TOKEN", None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing error handling for gated dataset authentication and network failures - will crash with unclear error if authentication fails or network is down
| "task_id": item["task_id"], | ||
| "problem_prompt": item["prompt"], | ||
| "build_command": item["build_command"], | ||
| "context_files_block": _format_context_files_block(item["context_files"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no validation that context_files list items have required path and content keys - will raise KeyError if dataset schema changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR integrates ComputeEval, a CUDA code generation benchmark with 406+ problems, into NeMo-Skills. The implementation follows established patterns and includes dataset preparation, custom generation task, evaluator, and metrics computation.
Key Changes:
- Added dataset preparation script to download from gated HuggingFace dataset
- Implemented
ComputeEvalGenerationTaskusing compute-eval's solution parsing - Created
ComputeEvalEvaluatorwith NVCC version detection and async evaluation - Added
ComputeEvalMetricsfor pass@k computation - Registered evaluator in
EVALUATOR_CLASS_MAPand metrics inMETRICS_MAP - Pinned
compute-evaldependency to specific commit
Issues Found:
- Missing documentation for compute-eval benchmark (critical for users)
- Missing test coverage -
compute-evalnot added toDATASETSlist intests/test_datasets.py - Using protected member
_parse_solutionfrom compute-eval package could break without warning - No validation for required HF_TOKEN environment variable before attempting dataset download
- Overly broad exception handling in evaluator loses important error context
- Missing explicit validation for required data_point keys
Architecture Notes:
The implementation correctly uses a custom generation module (nemo_skills.inference.eval.compute_eval) specified in the dataset __init__.py, which differs from standard generation. The evaluator properly checks for NVCC installation at initialization and uses TypeAdapters for discriminated union validation of problem/solution types.
Confidence Score: 3/5
- Safe to merge with documentation and test coverage improvements needed
- The core implementation is solid and follows established patterns in the codebase. The evaluator, metrics, and generation task are well-structured. However, the missing documentation is a significant gap that will impact usability, and the lack of test coverage means the new dataset won't be validated by CI. The use of a protected API from the external package and overly broad exception handling are code quality concerns but not blockers. No critical bugs or security issues were found.
- Pay close attention to
docs/evaluation/code.md(needs compute-eval documentation),tests/test_datasets.py(needs dataset entry), andnemo_skills/dataset/compute-eval/prepare.py(needs HF_TOKEN validation)
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| tests/test_datasets.py | 3/5 | Missing compute-eval dataset entry in DATASETS list for test coverage |
| nemo_skills/dataset/compute-eval/prepare.py | 4/5 | Dataset preparation script added for downloading ComputeEval from HuggingFace |
| nemo_skills/evaluation/evaluator/compute_eval.py | 4/5 | Evaluator implementation added with NVCC version detection and async evaluation |
| nemo_skills/inference/eval/compute_eval.py | 3/5 | Generation task implementation uses protected member _parse_solution from compute_eval package |
| requirements/main.txt | 4/5 | Added compute-eval dependency pinned to specific commit from GitHub |
| docs/evaluation/code.md | 3/5 | No documentation added for compute-eval benchmark despite adding IOI and BIRD docs |
Sequence Diagram
sequenceDiagram
participant User
participant PrepareScript as prepare.py
participant HF as HuggingFace
participant DataDir as Data Directory
participant GenTask as ComputeEvalGenerationTask
participant LLM as Language Model
participant Evaluator as ComputeEvalEvaluator
participant NVCC as CUDA Toolkit
participant Metrics as ComputeEvalMetrics
Note over User,DataDir: Data Preparation Phase
User->>PrepareScript: ns prepare_data compute-eval --release=2025-1
PrepareScript->>HF: load_dataset with auth token
HF-->>PrepareScript: Return dataset["eval"]
PrepareScript->>PrepareScript: Format context_files_block for each problem
PrepareScript->>DataDir: Write eval.jsonl with problem, task_id, prompt, etc.
Note over User,Metrics: Generation & Evaluation Phase
User->>GenTask: ns eval --benchmarks=compute-eval
GenTask->>GenTask: Load eval.jsonl
GenTask->>LLM: Generate CUDA code solution
LLM-->>GenTask: Return generation text
GenTask->>GenTask: Parse solution via compute_eval._parse_solution()
GenTask->>GenTask: Create FileSolution with task_id and parsed files
GenTask->>DataDir: Save generation with solution field
GenTask->>Evaluator: eval_single(data_point)
Evaluator->>NVCC: get_nvcc_version() (on init)
NVCC-->>Evaluator: Return CTK version
Evaluator->>Evaluator: validate_python(problem) using TypeAdapter
Evaluator->>Evaluator: validate_python(solution) using TypeAdapter
Evaluator->>NVCC: evaluate_solution(problem, solution, CTK version)
NVCC-->>Evaluator: Return graded result (passed, skipped, elapsed_time, outputs)
Evaluator-->>GenTask: Return evaluation results
Note over Metrics: Metrics Computation Phase
GenTask->>Metrics: Compute metrics from results
Metrics->>Metrics: Extract prediction["passed"] for each result
Metrics->>Metrics: Compute pass@k statistics
Metrics-->>User: Display accuracy and pass@k metrics
| from compute_eval.data.data_model import FileSolution | ||
|
|
||
| # noinspection PyProtectedMember | ||
| from compute_eval.generate_completions import _parse_solution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using protected member _parse_solution from external package
accessing _parse_solution with # noinspection PyProtectedMember bypasses the protection. if this is an internal API in compute-eval, it could break without warning. consider:
- requesting compute-eval to expose this as public API
- or implement solution parsing directly in nemo-skills
|
|
||
| bs4 | ||
| click < 8.2.0 # https://github.com/ai-dynamo/dynamo/issues/1039 | ||
| compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@2d14770 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dependency pinned to specific commit without version tag
compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@2d14770 is pinned to a commit hash. consider:
- using a tagged release if available
- documenting why this specific commit is required
- monitoring for when compute-eval is published to PyPI
this makes dependency management harder and doesn't benefit from semantic versioning
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
|
||
| args = parser.parse_args() | ||
|
|
||
| dataset = load_dataset("nvidia/compute-eval", args.release, token=os.getenv("HF_TOKEN", None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potential failure if HF_TOKEN not set for gated dataset
load_dataset("nvidia/compute-eval", ...) will fail if the dataset is gated and no token is provided. add validation to check if the token is set and provide a helpful error message directing users to obtain a HuggingFace token
| except Exception as e: | ||
| _LOG.error(f"Error during evaluation: {e}") | ||
| return { | ||
| "passed": False, | ||
| "error": str(e), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overly broad exception handling loses error context
catching all exceptions with except Exception and only logging masks validation errors, type errors, and other important failures. the metrics will treat this as a simple failure (passed: False) without distinguishing between:
- invalid problem/solution format (validation error)
- missing CUDA dependencies (runtime error)
- actual test execution failure
consider catching specific exception types or at least including exception type in the error field
| problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"]) | ||
| solution = _SOLUTION_ADAPTER.validate_python(data_point["solution"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing error handling for missing keys
if data_point["problem"] or data_point["solution"] keys don't exist, this raises KeyError which gets caught by the broad exception handler. add explicit validation with helpful error messages
| "build_command": item["build_command"], | ||
| "context_files_block": _format_context_files_block(item["context_files"]), | ||
| } | ||
| f.write(json.dumps(record, default=str) + "\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using default=str in json.dumps without clear reason
json.dumps(record, default=str) will convert any non-serializable objects to strings. this could hide data issues. since all fields should already be JSON-serializable (strings, dicts, lists), consider removing default=str or documenting why it's needed
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Additional Comments (2)
this list should include
this PR adds compute-eval support but doesn't document it. users need to know:
see other benchmarks in this file for documentation structure Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! |
Signed-off-by: dlord <[email protected]>
Signed-off-by: dlord <[email protected]>
696997a to
9a4f608
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR integrates ComputeEval, a CUDA code generation benchmark, into NeMo-Skills by implementing dataset preparation, generation tasks, evaluation logic, and metrics computation.
Key changes:
- Dataset preparation downloads problems from HuggingFace gated dataset and formats them with context files
- Generation task uses compute-eval's
_parse_solutionto parse LLM-generated CUDA code into structuredFileSolutionobjects - Evaluator runs async compilation and test execution with automatic NVCC version detection
- Metrics compute pass@k scores using the existing framework patterns
- Comprehensive documentation added with security warnings about executing machine-generated CUDA code
Technical implementation:
- Follows existing NeMo-Skills patterns for dataset integration (similar to human-eval, scicode)
- Proper error handling throughout with try-catch blocks and graceful degradation
- Async evaluation with configurable parallelism (default 10 concurrent requests)
- Uses Pydantic TypeAdapters for robust deserialization of problem and solution objects
- All integration points properly registered in
__init__.pyfiles andMETRICS_MAP/EVALUATOR_CLASS_MAP
Confidence Score: 4/5
- This PR is safe to merge with minor considerations about the protected API dependency
- The implementation is solid and follows established patterns in the codebase. All components (dataset prep, generation, evaluation, metrics) are properly integrated and include appropriate error handling. The main concern is the dependency on a protected member (
_parse_solution) from the external compute-eval package, which could break if that library changes its internal API. However, since compute-eval is also maintained by NVIDIA and pinned to a specific commit, this risk is manageable. The code is well-documented, includes security warnings for executing generated CUDA code, and properly handles edge cases. nemo_skills/inference/eval/compute_eval.pyrelies on protected API from compute-eval
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| requirements/main.txt | 5/5 | Added compute-eval dependency pinned to specific commit |
| nemo_skills/dataset/compute-eval/prepare.py | 5/5 | Implements dataset download from HuggingFace with proper error handling and formatting |
| nemo_skills/inference/eval/compute_eval.py | 4/5 | Generation task with protected member access; includes proper error handling |
| nemo_skills/evaluation/evaluator/compute_eval.py | 5/5 | Evaluator with proper async execution, NVCC version detection, and comprehensive error handling |
| nemo_skills/evaluation/metrics/code_metrics.py | 5/5 | Added ComputeEvalMetrics class following existing patterns for pass@k metrics |
Sequence Diagram
sequenceDiagram
participant User
participant PrepareScript as prepare.py
participant HuggingFace
participant GenerationTask as ComputeEvalGenerationTask
participant LLM
participant ParseSolution as _parse_solution
participant Evaluator as ComputeEvalEvaluator
participant ComputeEval as compute-eval Library
participant NVCC
participant Metrics as ComputeEvalMetrics
Note over User,HuggingFace: Data Preparation Phase
User->>PrepareScript: ns prepare_data compute-eval
PrepareScript->>HuggingFace: load_dataset("nvidia/compute-eval")
HuggingFace-->>PrepareScript: Dataset with CUDA problems
PrepareScript->>PrepareScript: Format context_files_block
PrepareScript->>PrepareScript: Write eval.jsonl
Note over User,LLM: Generation Phase
User->>GenerationTask: ns eval (with inference config)
GenerationTask->>GenerationTask: Load eval.jsonl
loop For each problem
GenerationTask->>LLM: Generate CUDA solution (with prompt)
LLM-->>GenerationTask: Raw code generation
GenerationTask->>ParseSolution: _parse_solution(generation)
ParseSolution-->>GenerationTask: Parsed files structure
GenerationTask->>GenerationTask: Create FileSolution object
end
GenerationTask->>GenerationTask: Write output.jsonl
Note over User,NVCC: Evaluation Phase
User->>Evaluator: Evaluate generated solutions
Evaluator->>NVCC: get_nvcc_version()
NVCC-->>Evaluator: CTK version info
loop For each solution (async)
Evaluator->>ComputeEval: evaluate_solution(problem, solution, ctk_version)
ComputeEval->>ComputeEval: Compile CUDA code
ComputeEval->>ComputeEval: Execute tests
ComputeEval-->>Evaluator: GradedResult (passed, skipped, outputs)
Evaluator->>Evaluator: Write evaluation results
end
Note over User,Metrics: Metrics Computation Phase
User->>Metrics: Compute final metrics
Metrics->>Metrics: Calculate accuracy (passed rate)
Metrics->>Metrics: Compute pass@k metrics
Metrics-->>User: Final metrics.json
| from compute_eval.data.data_model import FileSolution | ||
|
|
||
| # noinspection PyProtectedMember | ||
| from compute_eval.generate_completions import _parse_solution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accessing protected member _parse_solution from external package - if compute-eval changes this internal API, the code could break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR successfully integrates the ComputeEval benchmark for evaluating LLMs on CUDA code generation into NeMo-Skills.
Major changes:
- Added dataset preparation script that downloads problems from HuggingFace with proper authentication and formats context files
- Implemented
ComputeEvalGenerationTaskto generate CUDA solutions using the LLM - Implemented
ComputeEvalEvaluatorthat compiles and executes generated CUDA code against test suites - Added
ComputeEvalMetricsto compute accuracy and pass@k performance metrics - Registered compute-eval across all framework maps (evaluators, metrics, datasets)
- Added comprehensive documentation with setup instructions and security warnings
- Pinned compute-eval dependency to specific GitHub commit (2d14770)
Minor issue:
- The generation task imports a protected member (
_parse_solution) from the compute-eval library, which is not ideal but acceptable given the library hasn't published a stable API yet
Confidence Score: 4/5
- This PR is safe to merge with minimal risk
- The implementation follows existing patterns in the codebase consistently, includes comprehensive error handling, and provides thorough documentation. The only concern is the use of a protected member from the external library, but this is acknowledged with a suppression comment and is a pragmatic choice given the library's pre-release state. All integration points are properly registered.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| nemo_skills/dataset/compute-eval/prepare.py | 5/5 | adds dataset download and preparation logic with proper formatting for context files |
| nemo_skills/evaluation/evaluator/compute_eval.py | 4/5 | implements evaluator with async support, includes comprehensive error handling |
| nemo_skills/inference/eval/compute_eval.py | 4/5 | implements generation task with solution parsing, uses protected member from external lib |
| nemo_skills/evaluation/metrics/code_metrics.py | 5/5 | adds ComputeEvalMetrics class following existing patterns for pass@k metrics |
| requirements/main.txt | 5/5 | adds compute-eval dependency from GitHub at specific commit |
Sequence Diagram
sequenceDiagram
participant User
participant ns_prepare_data as ns prepare_data
participant HuggingFace as HuggingFace Dataset
participant prepare_py as prepare.py
participant ns_eval as ns eval
participant GenerationTask as ComputeEvalGenerationTask
participant LLM as LLM Server
participant Evaluator as ComputeEvalEvaluator
participant compute_eval as compute-eval Library
participant NVCC as CUDA Toolkit
participant Metrics as ComputeEvalMetrics
User->>ns_prepare_data: ns prepare_data compute-eval --release 2025-1
ns_prepare_data->>prepare_py: Execute prepare.py
prepare_py->>HuggingFace: load_dataset with auth token
HuggingFace-->>prepare_py: Return dataset with CUDA problems
prepare_py->>prepare_py: Format context files with code fences
prepare_py->>prepare_py: Write eval.jsonl with problem data
prepare_py-->>User: Dataset prepared
User->>ns_eval: ns eval --benchmarks=compute-eval
ns_eval->>GenerationTask: Initialize ComputeEvalGenerationTask
ns_eval->>GenerationTask: Load eval.jsonl data
loop For each problem
GenerationTask->>LLM: Send prompt with problem description
LLM-->>GenerationTask: Return generated CUDA code
GenerationTask->>compute_eval: _parse_solution(generation)
compute_eval-->>GenerationTask: Parsed FileSolution object
GenerationTask->>GenerationTask: Store solution in output
end
ns_eval->>Evaluator: Initialize ComputeEvalEvaluator
Evaluator->>NVCC: get_nvcc_version()
NVCC-->>Evaluator: Return CUDA Toolkit version
Evaluator->>Evaluator: Store CTK major/minor version
loop For each solution (async)
Evaluator->>compute_eval: evaluate_solution(problem, solution, ctk_version)
compute_eval->>compute_eval: Compile CUDA code
compute_eval->>compute_eval: Execute test suite
compute_eval-->>Evaluator: Return GradedSolution (passed, build_output, test_output)
Evaluator->>Evaluator: Write evaluation results to file
end
ns_eval->>Metrics: Compute metrics
Metrics->>Metrics: Calculate accuracy and pass@k
Metrics-->>User: Display metrics.json with results
| from compute_eval.data.data_model import FileSolution | ||
|
|
||
| # noinspection PyProtectedMember | ||
| from compute_eval.generate_completions import _parse_solution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
importing private function _parse_solution from external library. Consider requesting the compute-eval package to export this as a public API, since it's needed for integration.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
Closing this PR in favor of one directly from a branch on NeMo Skills. |
Overview
This PR integrates ComputeEval - a benchmark for evaluating LLMs on CUDA code generation - into NeMo-Skills.
ComputeEval features:
Changes
GenerationTaskto generate CUDA solutions using ComputeEval's reference generation logicBaseEvaluatorusing ComputeEval's evaluation logicDependencies
Adds compute-eval package (
compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@2d14770)Note: compute-eval is a public repository but has not been published to PyPI yet
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.