Skip to content

Commit 696997a

Browse files
committed
Add compute eval documentation.
1 parent 2ce1eeb commit 696997a

File tree

1 file changed

+59
-0
lines changed

1 file changed

+59
-0
lines changed

docs/evaluation/code.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -178,6 +178,65 @@ all you need to do is replace `openhands` with `swe_agent` in the command above.
178178
!!! note
179179
For evaluation, we use a [custom fork](https://github.com/Kipok/SWE-bench) of the SWE-bench repository that supports running evaluation inside of an existing container. It may not always have the latest updates from the upstream repo.
180180

181+
### compute-eval
182+
183+
- Benchmark is defined in [`nemo_skills/dataset/compute-eval/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/compute-eval/__init__.py)
184+
- Original benchmark source is [here](https://github.com/NVIDIA/compute-eval).
185+
186+
ComputeEval is a benchmark for evaluating Large Language Models on CUDA code generation tasks. It features handcrafted CUDA programming challenges that test an LLM's capability at writing reliable CUDA code. The benchmark includes functional correctness evaluation through compilation and execution against held-out test suites.
187+
188+
**Prerequisites:** NVIDIA GPU with CUDA Toolkit 12 or greater must be installed, and `nvcc` must be available in your PATH.
189+
190+
#### Data Preparation
191+
192+
First, prepare the dataset by running the `ns prepare_data` command. You can optionally specify a release version:
193+
194+
```bash
195+
ns prepare_data compute-eval --release 2025-1
196+
```
197+
198+
If no release is specified, the default release will be downloaded. This will generate an `eval.jsonl` file in the `nemo_skills/dataset/compute-eval/` directory.
199+
200+
**Note:** You need to set the `HF_TOKEN` environment variable because the dataset requires authentication.
201+
202+
#### Running the Evaluation
203+
204+
Once the data is prepared, you can run the evaluation. Replace `<...>` placeholders with your cluster and directory paths.
205+
206+
This command runs an evaluation of [OpenReasoning-Nemotron-32B](https://huggingface.co/nvidia/OpenReasoning-Nemotron-32B) on a Slurm cluster:
207+
208+
```bash
209+
ns eval \
210+
--cluster=<CLUSTER_NAME> \
211+
--model=nvidia/OpenReasoning-Nemotron-32B \
212+
--server_type=vllm \
213+
--server_args="--async-scheduling" \
214+
--server_nodes=1 \
215+
--server_gpus=8 \
216+
--benchmarks=compute-eval \
217+
--data_dir=<DATA_DIR> \
218+
--output_dir=<OUTPUT_DIR> \
219+
++inference.temperature=0.6 \
220+
++inference.top_p=0.95 \
221+
++inference.tokens_to_generate=16384
222+
```
223+
224+
**Security Note:** ComputeEval executes machine-generated CUDA code. While the benchmark is designed for evaluation purposes, we strongly recommend running evaluations in a sandboxed environment (e.g., a Docker container or virtual machine) to minimize security risks.
225+
226+
#### Verifying Results
227+
228+
After all jobs are complete, you can check the results in `<OUTPUT_DIR>/eval-results/compute-eval/metrics.json`. You can also review `<OUTPUT_DIR>/eval-results/compute-eval/summarized-results/main_*`. They should look something like this:
229+
230+
```
231+
---------------------------- compute-eval -----------------------------
232+
evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracy
233+
pass@1 | 50 | 8432 | 1245 | 64.00%
234+
```
235+
236+
The benchmark reports:
237+
- **accuracy**: Percentage of problems where generated code compiled and passed all tests
238+
- **pass@1**: Same as accuracy for single-solution generation
239+
- **pass@k**: Success rate when generating k solutions per problem (if configured)
181240

182241
### IOI
183242

0 commit comments

Comments
 (0)