-
Notifications
You must be signed in to change notification settings - Fork 82
Open
Description
Currently, our benchmark ranking system uses simple sorting: higher score = better rank. This results in misleadingly precise rankings when the score differences are negligible (e.g. 99.92 vs 99.93 being treated as a different rank, despite being statistically insignificant or irrelevant in practice).
π§© Problem
The current behavior:
- Ranks are assigned based on the strict ordering of scores.
- Very close scores (e.g., due to noise, floating-point jitter, or micro-optimizations) are given separate ranks.
- This creates noise in the leaderboard and overemphasizes tiny performance differences.
β Desired behavior
-
Group almost-equal scores into the same rank, to reflect meaningful differences only.
-
Use either:
- Tolerance-based ranking: if scores differ less than
epsilon
, they are considered equal. - Upper-bound bucket ranking: scores are grouped into pre-defined thresholds or quantized steps.
- Tolerance-based ranking: if scores differ less than
π οΈ Proposed solution
-
Option A: Epsilon-tolerant rank
type BenchmarkResult = { name: string; score: number }; function tolerantRank(data: Array<BenchmarkResult>, epsilon = 0.01) { const sorted = [...data].sort((a, b) => b.score - a.score); const ranked: Array<{ rank: number; name: string; score: number }> = []; let currentRank = 1; for (let i = 0; i < sorted.length; i++) { if ( i > 0 && Math.abs(sorted[i].score - sorted[i - 1].score) > epsilon ) { currentRank = ranked.length + 1; } ranked.push({ rank: currentRank, name: sorted[i].name, score: sorted[i].score, }); } return ranked; }
-
Option B: Upper-bound bucket rank
const buckets = [95, 98, 99.5, 99.9, 100]; function upperBoundRank(score: number) { for (const bucket of buckets) { if (score <= bucket) return bucket; } return 100; }
Apply this during score preprocessing, and then rank the buckets instead of raw scores.
π¬ Open questions
- Should the grouping tolerance (
epsilon
) be static or adaptive based on the score range? - Do we prefer visually clear bucket labels (e.g.
β€ 99.5%
) or a fuzzy equality? - Should we show confidence intervals or standard deviations in the future?
π Goal
- Make benchmark ranking more trustworthy, human-readable, and resilient to noise.
We can also make anomaly detection for it. What inspires me is this benchmark: https://web.lmarena.ai/leaderboard
Related:
Metadata
Metadata
Assignees
Labels
No labels