Skip to content

Improve benchmark ranking system to group similar scores using tolerant ranking algorithmΒ #2073

@ImBIOS

Description

@ImBIOS

Currently, our benchmark ranking system uses simple sorting: higher score = better rank. This results in misleadingly precise rankings when the score differences are negligible (e.g. 99.92 vs 99.93 being treated as a different rank, despite being statistically insignificant or irrelevant in practice).

🧩 Problem

The current behavior:

  • Ranks are assigned based on the strict ordering of scores.
  • Very close scores (e.g., due to noise, floating-point jitter, or micro-optimizations) are given separate ranks.
  • This creates noise in the leaderboard and overemphasizes tiny performance differences.

βœ… Desired behavior

  • Group almost-equal scores into the same rank, to reflect meaningful differences only.

  • Use either:

    • Tolerance-based ranking: if scores differ less than epsilon, they are considered equal.
    • Upper-bound bucket ranking: scores are grouped into pre-defined thresholds or quantized steps.

πŸ› οΈ Proposed solution

  1. Option A: Epsilon-tolerant rank

    type BenchmarkResult = { name: string; score: number };
    
    function tolerantRank(data: Array<BenchmarkResult>, epsilon = 0.01) {
      const sorted = [...data].sort((a, b) => b.score - a.score);
    
      const ranked: Array<{ rank: number; name: string; score: number }> = [];
      let currentRank = 1;
    
      for (let i = 0; i < sorted.length; i++) {
        if (
          i > 0 &&
          Math.abs(sorted[i].score - sorted[i - 1].score) > epsilon
        ) {
          currentRank = ranked.length + 1;
        }
    
        ranked.push({
          rank: currentRank,
          name: sorted[i].name,
          score: sorted[i].score,
        });
      }
    
      return ranked;
    }
  2. Option B: Upper-bound bucket rank

    const buckets = [95, 98, 99.5, 99.9, 100];
    
    function upperBoundRank(score: number) {
      for (const bucket of buckets) {
        if (score <= bucket) return bucket;
      }
      return 100;
    }

    Apply this during score preprocessing, and then rank the buckets instead of raw scores.


πŸ’¬ Open questions

  • Should the grouping tolerance (epsilon) be static or adaptive based on the score range?
  • Do we prefer visually clear bucket labels (e.g. ≀ 99.5%) or a fuzzy equality?
  • Should we show confidence intervals or standard deviations in the future?

🏁 Goal

  • Make benchmark ranking more trustworthy, human-readable, and resilient to noise.

We can also make anomaly detection for it. What inspires me is this benchmark: https://web.lmarena.ai/leaderboard

Related:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions