Skip to content

Commit bb8e740

Browse files
authored
Merge pull request #266 from hud-evals/l/docs-updates-5
platform publishing
2 parents 6225282 + 73e35d8 commit bb8e740

File tree

7 files changed

+151
-5
lines changed

7 files changed

+151
-5
lines changed

docs/docs.json

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,12 +203,18 @@
203203
]
204204
},
205205
{
206-
"group": "Platform Guide",
206+
"group": "Concepts",
207207
"pages": [
208208
"platform/models",
209209
"platform/environments",
210210
"platform/tasksets"
211211
]
212+
},
213+
{
214+
"group": "Guides",
215+
"pages": [
216+
"platform/publishing-leaderboards"
217+
]
212218
}
213219
]
214220
}
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
---
2+
title: "Publishing Leaderboards"
3+
description: "Create a taskset, run evaluations, validate results, and publish to the community."
4+
icon: "trophy"
5+
---
6+
7+
A leaderboard is a published taskset with public evaluation results. This guide walks through the complete workflow—from empty taskset to public benchmark.
8+
9+
<Note>
10+
**Prerequisites**: You need an environment with at least one scenario. See [Environments](/platform/environments) if you haven't deployed one yet.
11+
</Note>
12+
13+
## Create a Taskset
14+
15+
Go to [hud.ai/evalsets](https://hud.ai/evalsets)**New Taskset**. Name it something descriptive—this becomes your leaderboard title once published.
16+
17+
<Frame>
18+
<img src="/src/images/platform-leaderboards-taskset.png" alt="Empty taskset page" />
19+
</Frame>
20+
21+
## Add Tasks
22+
23+
Tasks are what agents get evaluated on. Each task references a scenario from your environment with specific arguments. Click **Upload Tasks** (cloud icon) to bulk add tasks via JSON.
24+
25+
See [Tasksets → Adding Tasks](/platform/tasksets#adding-tasks) for the full upload format and options.
26+
27+
<Tip>
28+
Aim for 20–50 tasks. Fewer leads to high variance; more gives better signal but takes longer to run.
29+
</Tip>
30+
31+
## Run Evaluations
32+
33+
Click **Run Taskset** in the header. The run modal lets you configure:
34+
35+
- **Models** — Select one or more models to evaluate. Multi-select runs the same tasks across all selected models.
36+
- **Group Size** — How many times to run each task per model (more runs = higher confidence)
37+
- **Max Steps** — Limit agent actions per task
38+
39+
<Frame>
40+
<img src="/src/images/platform-leaderboards-run.png" alt="Run taskset modal" />
41+
</Frame>
42+
43+
<Warning>
44+
**Run at least 3 different models** before publishing. A single-entry leaderboard isn't useful for comparison.
45+
</Warning>
46+
47+
Jobs appear in the **Jobs** tab as they run. Click a job to see individual trace results.
48+
49+
<Frame>
50+
<img src="/src/images/platform-leaderboards-jobs.png" alt="Jobs tab showing evaluation jobs" />
51+
</Frame>
52+
53+
## Review and Validate
54+
55+
Before publishing, check your results.
56+
57+
### Leaderboard Tab
58+
59+
Shows aggregated rankings—agent scores, task-by-task breakdown, result distributions.
60+
61+
<Frame>
62+
<img src="/src/images/platform-leaderboards-leaderboard.png" alt="Leaderboard tab showing agent rankings" />
63+
</Frame>
64+
65+
Look for:
66+
67+
- **Reasonable scores** — 0% or 100% everywhere signals something's wrong
68+
- **Variance** — Good benchmarks have range
69+
- **Outliers** — Unexpectedly high or low scores worth investigating
70+
71+
### Traces
72+
73+
Click into jobs to review individual runs. Check that grading reflects actual agent performance. Look for environment issues or grading bugs.
74+
75+
### Invalidate Bad Runs
76+
77+
Found issues? Select affected jobs in the **Jobs** tab and click **Invalidate**.
78+
79+
Invalidated jobs:
80+
- Are excluded from leaderboard calculations
81+
- Show with a striped background
82+
- Cannot be published
83+
- Remain visible for reference
84+
85+
Common reasons to invalidate: environment bugs, incorrect grading logic, external service outages, test runs with wrong configuration.
86+
87+
<Note>
88+
Invalidation is permanent. To get fresh results, re-run the evaluation.
89+
</Note>
90+
91+
## Publish
92+
93+
Click **Publish** in the taskset header.
94+
95+
The modal shows:
96+
97+
1. **Evalset Status** — Whether the taskset itself is already public
98+
2. **Jobs to Include** — Select which jobs to make public (invalidated jobs don't appear)
99+
3. **Already Public** — Previously published jobs are checked and disabled
100+
101+
<Warning>
102+
**Publishing is permanent.** Once published, jobs and traces are publicly accessible. This cannot be undone.
103+
</Warning>
104+
105+
### What Gets Published
106+
107+
| Item | Visibility |
108+
|------|------------|
109+
| Taskset name | Public |
110+
| Task configurations | Public |
111+
| Selected job results | Public |
112+
| Trace details | Public |
113+
| Your team name | Public |
114+
| Non-selected jobs | Private |
115+
| Invalidated jobs | Never published |
116+
117+
### Adding More Later
118+
119+
After initial publication, run new models and return to **Publish** to add them. Previously published jobs stay public.
120+
121+
## Best Practices
122+
123+
Before publishing:
124+
125+
- **Verify grading** — Manually check 5–10 traces. Look for false positives and false negatives.
126+
- **Test stability** — Flaky environments produce inconsistent results that undermine leaderboard validity.
127+
- **Include baselines** — Always include well-known models (GPT-4o, Claude) as reference points.
128+
- **Document clearly** — Add a description explaining what skills are tested and expected difficulty.
129+
130+
A quality leaderboard has diverse tasks, multiple agents (3–5 minimum), reasonable difficulty (20–80% average success), and fair, consistent grading.
131+
132+
<CardGroup cols={2}>
133+
<Card title="Tasksets" icon="list-check" href="/platform/tasksets">
134+
Detailed taskset management
135+
</Card>
136+
137+
<Card title="Environments" icon="cube" href="/platform/environments">
138+
Create environments with scenarios
139+
</Card>
140+
</CardGroup>

docs/platform/tasksets.mdx

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -144,11 +144,11 @@ Tasks are defined with:
144144
## Next Steps
145145

146146
<CardGroup cols={2}>
147-
<Card title="Environments" icon="cube" href="/platform/environments">
148-
Create environments with scenarios
147+
<Card title="Publishing Leaderboards" icon="trophy" href="/platform/publishing-leaderboards">
148+
Run evaluations and publish public benchmarks
149149
</Card>
150150

151-
<Card title="A/B Testing" icon="flask-vial" href="/quick-links/ab-testing">
152-
Compare models with variants
151+
<Card title="Environments" icon="cube" href="/platform/environments">
152+
Create environments with scenarios
153153
</Card>
154154
</CardGroup>
173 KB
Loading
129 KB
Loading
80.3 KB
Loading
22.8 KB
Loading

0 commit comments

Comments
 (0)