Skip to content

Commit 98bafd1

Browse files
sephmardmarta-sd
andauthored
docs(quickstart): add nel-assistant launcher configuration guide (#819)
Adds launcher-configuration.md documenting the nel-assistant agent skill for conversational, YAML-free evaluation setup. Updates index with toctree entry and grid card. --------- Signed-off-by: sephmard <smard@nvidia.com> Signed-off-by: Seph Mard <seph.mard@gmail.com> Co-authored-by: Marta Stepniewska-Dziubinska <marta-sd@users.noreply.github.com>
1 parent 765c89f commit 98bafd1

File tree

2 files changed

+239
-0
lines changed

2 files changed

+239
-0
lines changed

docs/get-started/quickstart/index.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,14 @@ Select the approach that best matches your workflow and technical requirements:
3838
Unified CLI experience with automated container management, built-in orchestration, and result export capabilities.
3939
:::
4040

41+
:::{grid-item-card} {octicon}`comment-discussion;1.5em;sd-mr-1` Launcher Configuration with nel-assistant
42+
:link: gs-quickstart-launcher-configuration
43+
:link-type: ref
44+
**For conversational config**
45+
46+
Natural language evaluation setup via the nel-assistant agent skill. No manual YAML authoring required.
47+
:::
48+
4149
:::{grid-item-card} {octicon}`code;1.5em;sd-mr-1` NeMo Evaluator Core
4250
:link: gs-quickstart-core
4351
:link-type: ref
@@ -275,6 +283,7 @@ nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/l
275283
:hidden:
276284
277285
NeMo Evaluator Launcher <launcher>
286+
Launcher Configuration with nel-assistant <launcher-configuration>
278287
NeMo Evaluator Core <core>
279288
NeMo Framework Container <nemo-fw>
280289
Container Direct <container>
Lines changed: 230 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
(gs-quickstart-launcher-configuration)=
2+
# Conversational LLM Evaluations in Minutes with NVIDIA NeMo Evaluator Agent Skills
3+
4+
Running LLM evaluations should not require manually drafting long and complex YAML files. For developers, configuration overhead often becomes the bottleneck. The new **nel-assistant** agent skill enables natural language configuration of production-ready evaluations.
5+
6+
Built on the [NVIDIA NeMo Evaluator library](https://github.com/NVIDIA-NeMo/Evaluator), it allows developers to configure, run, and monitor evaluations directly within Cursor, or any other preferred agentic development tool. All through interaction with the agent and not manually creating YAML files or shell commands.
7+
8+
---
9+
10+
## The Problem: Configuration Overhead
11+
12+
Running a single LLM evaluation means making dozens of interconnected decisions:
13+
14+
- **Execution:** Local Docker or SLURM cluster?
15+
- **Deployment:** vLLM, SGLang, NVIDIA NIM, NVIDIA TensorRT-LLM, or external endpoint? How many nodes?
16+
- **Model:** What temperature? What context length? Does it use reasoning tokens?
17+
- **Benchmarks:** Tau2-Bench, MTEB, GSM8K, AIME, GPQA, LiveCodeBench, RULER, more? All of the above?
18+
- **Export:** Local files, CSV, Weights & Biases, or MLflow?
19+
20+
Each choice spawns sub-choices. Using vLLM? Configure tensor parallelism. Running reasoning models? Parse thinking tokens. Multi-node SLURM? Set up HAProxy load balancing. The result is an overhead of working through complex YAML configs that are easy to get wrong and hard to debug.
21+
22+
```yaml
23+
# A typical evaluation config - 200+ lines, easy to misconfigure
24+
execution:
25+
backend: slurm
26+
params:
27+
account: ??? # What's my account name again?
28+
partition: ???
29+
time_limit: "24:00:00"
30+
deployment:
31+
backend: vllm
32+
params:
33+
tensor_parallel_size: ??? # 2? 4? 8?
34+
max_model_len: ??? # Model card says... 32K? 128K?
35+
36+
model:
37+
sampling_params:
38+
temperature: ??? # Was it 0.6 or 0.8?
39+
top_p: ???
40+
41+
# ... 150 more lines ...
42+
```
43+
44+
This configuration burden turns a simple evaluation into unnecessary complexity.
45+
46+
---
47+
48+
## The Solution: Agent Skills for LLM Evaluation
49+
50+
Agent skills have become the standard way to give agents deep domain expertise. A "skill" packages instructions, scripts, and resources into portable folders that any compatible agent can discover and use.
51+
52+
In the coding assistants space, the ecosystem has exploded covering everything from Git workflows to API integrations. The **nel-assistant** is an agent skill that transforms Cursor, Claude Code, Codex, and other agentic developer tools into an evaluation expert. Instead of manual YAML authoring, you have a conversation. The skill researches model cards to find optimal temperatures, top_p values, and context lengths, and identifies the optimal tensor parallelism for your specific GPU setup.
53+
54+
**One conversation. Zero manual YAML debugging. Production-ready config.**
55+
56+
```
57+
You: "Evaluate NVIDIA Nemotron-3-Nano-30B-A3B on standard benchmarks
58+
using vLLM locally. Export to Weights & Biases."
59+
60+
Agent: I'll help you set that up. First, let me check your environment...
61+
✓ NeMo Evaluator 26.01 detected
62+
I'll need 5 quick answers to build your config:
63+
1. Execution: local (detected from your request)
64+
2. Deployment: vllm (confirmed)
65+
3. Export: wandb (confirmed)
66+
4. Model type: chat or base? [chat]
67+
5. Benchmark category: standard (confirmed)
68+
69+
Now let me research the model card...
70+
✓ Found: temperature=0.6, top_p=0.95, context=128K
71+
✓ Optimal TP=8 for your 2x H100 setup
72+
73+
Generated config: Nemotron-3-Nano-30B-A3B.yaml
74+
Ready to run!
75+
```
76+
77+
---
78+
79+
## How It Works: Three Phases
80+
81+
### Phase 1: Configure
82+
83+
The skill starts by asking five targeted questions to establish context:
84+
85+
1. **Execution environment:** Local or SLURM?
86+
2. **Deployment backend:** vLLM, SGLang, NVIDIA NIM, NVIDIA TensorRT-LLM, or external?
87+
3. **Export destination:** None, MLflow, or Weights & Biases?
88+
4. **Model type:** Base, chat, or reasoning?
89+
5. **Benchmark categories:** Standard, code, math, safety, or multilingual?
90+
91+
From these answers, it calls:
92+
93+
```bash
94+
nel skills build-config \
95+
--execution local \
96+
--deployment vllm \
97+
--model-type chat \
98+
--benchmarks standard
99+
```
100+
101+
This deep-merges modular YAML templates into tested, schema-compliant fragments that compose into structurally valid configs and minimizes syntax errors. With the skill alongside, the agent never generates free-form YAML, eliminating syntax errors.
102+
103+
Next, the agent automatically analyzes the model card and applies optimal configuration parameters.
104+
105+
Give the agent a HuggingFace handle `NVIDIA-Nemotron-3-Nano-30B-A3B-BF16` or checkpoint path, and it uses WebSearch to extract:
106+
107+
- **Sampling params:** Temperature, top_p
108+
- **Hardware logic:** Optimal TP/DP settings based on your GPU count
109+
- **Reasoning config:** System prompts, payload modifiers (e.g., `enable_thinking` for o1-style models)
110+
- **Context length:** Max model length for vLLM `--max-model-len`
111+
112+
Developers no longer need to search through model cards to find the right settings. The agent reads the model details and applies the correct parameters automatically.
113+
114+
Without the skill, this usually means jumping between Hugging Face, blog posts, and documentation. It takes time and breaks focus. With the skill, the setup happens in seconds.
115+
116+
### Phase 2: Validate and Refine
117+
118+
The skill identifies the remaining `???` values in the YAML:
119+
120+
- **SLURM details:** Account names, partition names, time limits
121+
- **Export URIs:** WandB project names, MLflow tracking URIs
122+
- **API keys:** Environment variables for deployments
123+
124+
You can interactively:
125+
126+
- **Add/remove tasks:** Browse `nel ls tasks` and pick exactly what you want
127+
- **Override per-task settings:** "Use temperature=0 for HumanEval but 0.7 for MMLU"
128+
- **Configure advanced scaling:** For >120B models, set up data-parallel multi-node with HAProxy load balancing
129+
- **Add reasoning interceptors:** Strip `<think>` tokens, cache reasoning traces
130+
131+
### Phase 3: Run and Monitor
132+
133+
The agent proposes a three-tier staged rollout: **Dry run**, **Smoke test**, and **Full run**.
134+
135+
```bash
136+
# 1. Dry run - validate without execution
137+
nel run --config nemotron-3-nano.yaml --dry-run
138+
139+
# 2. Smoke test - 10 samples per task
140+
nel run --config nemotron-3-nano.yaml \
141+
-o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10
142+
143+
# 3. Full run
144+
nel run --config nemotron-3-nano.yaml
145+
```
146+
147+
Once submitted, progress can be monitored directly in Cursor using commands for status, detailed metrics, and live logs. You never leave your coding environment!
148+
149+
```
150+
> Please, check the evaluation progress.
151+
152+
# Agent runs: nel status -nemotron-3-nano-20260212-143022 && nel info ...
153+
154+
Status: RUNNING
155+
Progress: 3/8 tasks completed
156+
- ✓ mmlu: 65.2% accuracy (5 hours)
157+
- ✓ hellaswag: 78.4% accuracy (2 hours)
158+
- ✓ arc_challenge: 53.8% accuracy (1 hour)
159+
- ⏳ truthfulqa_mc2: 45% complete...
160+
- ⏳ winogrande: In queue
161+
- ⏳ gsm8k: In queue
162+
- ⏳ humaneval: In queue
163+
- ⏳ mbpp: In queue
164+
```
165+
166+
---
167+
168+
## Technical Details
169+
170+
### Template-Based Generation
171+
172+
Instead of generating YAML from scratch, nel-assistant merges modular templates for execution, deployment, benchmarks, and exports. This deep merge ensures structural validity.
173+
174+
### Model Card Extraction Pipeline
175+
176+
1. Cursor or your agentic IDE fetches the HuggingFace model card via web search.
177+
2. Extraction via regex identifies parameters and chat templates.
178+
3. Hardware logic calculates optimal TP/DP based on model size and available GPU memory.
179+
4. Reasoning detection checks for keywords like "reasoning" or "chain-of-thought."
180+
5. Values are injected directly into the config YAML.
181+
182+
Generic LLMs hallucinate YAML syntax. They mix incompatible backends. They invent flags that don't exist.
183+
184+
Instead of generating YAML from scratch, `nel skills build-config` merges modular templates:
185+
186+
```
187+
templates/
188+
├── execution/
189+
│ ├── local.yaml # Docker execution
190+
│ └── slurm.yaml # SLURM execution
191+
├── deployment/
192+
│ ├── vllm.yaml # vLLM backend
193+
│ ├── sglang.yaml # SGLang backend
194+
│ └── nim.yaml # NVIDIA NIM
195+
├── benchmarks/
196+
│ ├── reasoning.yaml # GPQA-D, HellaSwag, SciCode, MATH, AIME
197+
│ └── agentic.yaml # TerminalBench, SWE-Bench
198+
│ ├── longcontext.yaml # AA-LCR, RULER
199+
│ ├── instruction.yaml # IFBench, ArenaHard
200+
│ ├── multi-lingual.yaml # MMLU-ProX, WMT24++
201+
└── export/
202+
├── wandb.yaml # W&B integration
203+
└── mlflow.yaml # MLflow integration
204+
```
205+
206+
**Deep merge = structural validity.** You can't produce invalid YAML when you're composing pre-validated fragments.
207+
208+
The nel-assistant uses `build-config` to merge tested templates. Every config is structurally valid by construction. The agent composes YAML like a type-safe compiler, not a text generator.
209+
210+
---
211+
212+
## Configuration Should Not Be a Bottleneck
213+
214+
LLM evaluation already involves important decisions — selecting benchmarks, interpreting results, and comparing models. Configuration should support that process, not slow it down.
215+
216+
The nel-assistant skill makes it invisible. You describe what you want in natural language, and the agent handles the rest: researching model cards, generating configs, validating setups, staging rollouts, and monitoring progress.
217+
218+
**No more manually created complex YAML files. No more hunting through documentation. No more syntax errors.**
219+
220+
Just: *"Evaluate this model on these benchmarks."*
221+
222+
---
223+
224+
## Resources
225+
226+
- **GitHub:** [NVIDIA NeMo Evaluator Library](https://github.com/NVIDIA-NeMo/Evaluator)
227+
- **Skill:** [nel-assistant](https://github.com/NVIDIA-NeMo/Evaluator/blob/main/packages/nemo-evaluator-launcher/.claude/skills/nel-assistant/SKILL.md)
228+
229+
The nel-assistant skill is open-source and ships with NVIDIA NeMo Evaluator 26.01+.
230+
Contributions welcome on GitHub!

0 commit comments

Comments
 (0)