|
| 1 | +--- |
| 2 | +title: "Publishing Leaderboards" |
| 3 | +description: "Create a taskset, run evaluations, validate results, and publish to the community." |
| 4 | +icon: "trophy" |
| 5 | +--- |
| 6 | + |
| 7 | +A leaderboard is a published taskset with public evaluation results. This guide walks through the complete workflow—from empty taskset to public benchmark. |
| 8 | + |
| 9 | +<Note> |
| 10 | + **Prerequisites**: You need an environment with at least one scenario. See [Environments](/platform/environments) if you haven't deployed one yet. |
| 11 | +</Note> |
| 12 | + |
| 13 | +## Create a Taskset |
| 14 | + |
| 15 | +Go to [hud.ai/evalsets](https://hud.ai/evalsets) → **New Taskset**. Name it something descriptive—this becomes your leaderboard title once published. |
| 16 | + |
| 17 | +<Frame> |
| 18 | + <img src="/src/images/platform-leaderboards-taskset.png" alt="Empty taskset page" /> |
| 19 | +</Frame> |
| 20 | + |
| 21 | +## Add Tasks |
| 22 | + |
| 23 | +Tasks are what agents get evaluated on. Each task references a scenario from your environment with specific arguments. Click **Upload Tasks** (cloud icon) to bulk add tasks via JSON. |
| 24 | + |
| 25 | +See [Tasksets → Adding Tasks](/platform/tasksets#adding-tasks) for the full upload format and options. |
| 26 | + |
| 27 | +<Tip> |
| 28 | + Aim for 20–50 tasks. Fewer leads to high variance; more gives better signal but takes longer to run. |
| 29 | +</Tip> |
| 30 | + |
| 31 | +## Run Evaluations |
| 32 | + |
| 33 | +Click **Run Taskset** in the header. The run modal lets you configure: |
| 34 | + |
| 35 | +- **Models** — Select one or more models to evaluate. Multi-select runs the same tasks across all selected models. |
| 36 | +- **Group Size** — How many times to run each task per model (more runs = higher confidence) |
| 37 | +- **Max Steps** — Limit agent actions per task |
| 38 | + |
| 39 | +<Frame> |
| 40 | + <img src="/src/images/platform-leaderboards-run.png" alt="Run taskset modal" /> |
| 41 | +</Frame> |
| 42 | + |
| 43 | +<Warning> |
| 44 | + **Run at least 3 different models** before publishing. A single-entry leaderboard isn't useful for comparison. |
| 45 | +</Warning> |
| 46 | + |
| 47 | +Jobs appear in the **Jobs** tab as they run. Click a job to see individual trace results. |
| 48 | + |
| 49 | +<Frame> |
| 50 | + <img src="/src/images/platform-leaderboards-jobs.png" alt="Jobs tab showing evaluation jobs" /> |
| 51 | +</Frame> |
| 52 | + |
| 53 | +## Review and Validate |
| 54 | + |
| 55 | +Before publishing, check your results. |
| 56 | + |
| 57 | +### Leaderboard Tab |
| 58 | + |
| 59 | +Shows aggregated rankings—agent scores, task-by-task breakdown, result distributions. |
| 60 | + |
| 61 | +<Frame> |
| 62 | + <img src="/src/images/platform-leaderboards-leaderboard.png" alt="Leaderboard tab showing agent rankings" /> |
| 63 | +</Frame> |
| 64 | + |
| 65 | +Look for: |
| 66 | + |
| 67 | +- **Reasonable scores** — 0% or 100% everywhere signals something's wrong |
| 68 | +- **Variance** — Good benchmarks have range |
| 69 | +- **Outliers** — Unexpectedly high or low scores worth investigating |
| 70 | + |
| 71 | +### Traces |
| 72 | + |
| 73 | +Click into jobs to review individual runs. Check that grading reflects actual agent performance. Look for environment issues or grading bugs. |
| 74 | + |
| 75 | +### Invalidate Bad Runs |
| 76 | + |
| 77 | +Found issues? Select affected jobs in the **Jobs** tab and click **Invalidate**. |
| 78 | + |
| 79 | +Invalidated jobs: |
| 80 | +- Are excluded from leaderboard calculations |
| 81 | +- Show with a striped background |
| 82 | +- Cannot be published |
| 83 | +- Remain visible for reference |
| 84 | + |
| 85 | +Common reasons to invalidate: environment bugs, incorrect grading logic, external service outages, test runs with wrong configuration. |
| 86 | + |
| 87 | +<Note> |
| 88 | + Invalidation is permanent. To get fresh results, re-run the evaluation. |
| 89 | +</Note> |
| 90 | + |
| 91 | +## Publish |
| 92 | + |
| 93 | +Click **Publish** in the taskset header. |
| 94 | + |
| 95 | +The modal shows: |
| 96 | + |
| 97 | +1. **Evalset Status** — Whether the taskset itself is already public |
| 98 | +2. **Jobs to Include** — Select which jobs to make public (invalidated jobs don't appear) |
| 99 | +3. **Already Public** — Previously published jobs are checked and disabled |
| 100 | + |
| 101 | +<Warning> |
| 102 | + **Publishing is permanent.** Once published, jobs and traces are publicly accessible. This cannot be undone. |
| 103 | +</Warning> |
| 104 | + |
| 105 | +### What Gets Published |
| 106 | + |
| 107 | +| Item | Visibility | |
| 108 | +|------|------------| |
| 109 | +| Taskset name | Public | |
| 110 | +| Task configurations | Public | |
| 111 | +| Selected job results | Public | |
| 112 | +| Trace details | Public | |
| 113 | +| Your team name | Public | |
| 114 | +| Non-selected jobs | Private | |
| 115 | +| Invalidated jobs | Never published | |
| 116 | + |
| 117 | +### Adding More Later |
| 118 | + |
| 119 | +After initial publication, run new models and return to **Publish** to add them. Previously published jobs stay public. |
| 120 | + |
| 121 | +## Best Practices |
| 122 | + |
| 123 | +Before publishing: |
| 124 | + |
| 125 | +- **Verify grading** — Manually check 5–10 traces. Look for false positives and false negatives. |
| 126 | +- **Test stability** — Flaky environments produce inconsistent results that undermine leaderboard validity. |
| 127 | +- **Include baselines** — Always include well-known models (GPT-4o, Claude) as reference points. |
| 128 | +- **Document clearly** — Add a description explaining what skills are tested and expected difficulty. |
| 129 | + |
| 130 | +A quality leaderboard has diverse tasks, multiple agents (3–5 minimum), reasonable difficulty (20–80% average success), and fair, consistent grading. |
| 131 | + |
| 132 | +<CardGroup cols={2}> |
| 133 | + <Card title="Tasksets" icon="list-check" href="/platform/tasksets"> |
| 134 | + Detailed taskset management |
| 135 | + </Card> |
| 136 | + |
| 137 | + <Card title="Environments" icon="cube" href="/platform/environments"> |
| 138 | + Create environments with scenarios |
| 139 | + </Card> |
| 140 | +</CardGroup> |
0 commit comments