Skip to content

feat: add aggregation metrics to evaluation score card#1523

Open
IgnazioDS wants to merge 1 commit intolmnr-ai:mainfrom
IgnazioDS:feat/eval-aggregation-metrics
Open

feat: add aggregation metrics to evaluation score card#1523
IgnazioDS wants to merge 1 commit intolmnr-ai:mainfrom
IgnazioDS:feat/eval-aggregation-metrics

Conversation

@IgnazioDS
Copy link
Copy Markdown

@IgnazioDS IgnazioDS commented Mar 26, 2026

Summary

Adds additional aggregation metrics to the evaluation score card, addressing #637.

Currently the evaluation results page only shows the average of numeric scores. This PR adds:

  • Median — robust central tendency, less sensitive to outliers
  • Standard Deviation — measures score consistency/spread
  • Min / Max — range boundaries
  • Count — number of data points

These metrics are displayed in a compact grid below the existing average display, preserving the current UI hierarchy.

Design Decisions

  • All metrics are universally applicable — no need to distinguish between error metrics, quality scores, or classifications. This avoids the complexity discussed in Support more aggregations of numeric eval output besides average and histogram #637 around RMSE semantics.
  • Population std deviation (not sample) — since we're computing over the full evaluation run, not a sample.
  • Zero-cost for empty results — all fields default to 0 when no valid scores exist.
  • Comparison mode preserved — the primary average + comparison arrow UI is untouched; secondary metrics appear below.

Files Changed

File Change
lib/evaluation/types.ts Added minValue, maxValue, stdDeviation, medianValue, count to EvaluationScoreStatistics
lib/actions/evaluation/utils.ts Extended calculateScoreStatistics() to compute all metrics
components/evaluation/score-card.tsx Added secondary metrics grid below average display

Test Plan

  • Evaluation with numeric scores shows all 6 metrics (avg, median, std dev, min, max, count)
  • Evaluation with no scores shows avg = 0 and no secondary metrics grid
  • Comparison mode (two evaluations) still renders correctly with arrow + percentage change
  • Metrics are accurate: verify against manual calculation on a small dataset

Addresses #637


Note

Low Risk
Low risk UI/data-display change that extends computed score statistics; main risk is any downstream code expecting EvaluationScoreStatistics to only contain averageValue.

Overview
Extends EvaluationScoreStatistics to include medianValue, stdDeviation, minValue, maxValue, and count, and updates calculateScoreStatistics() to compute these aggregates (including population std dev and median).

Updates the evaluation score card to render a compact secondary-metrics grid (median/std dev/min/max + count) beneath the existing average/comparison display, hiding the grid when there are no valid scores.

Written by Cursor Bugbot for commit d6b1215. This will update automatically on new commits. Configure here.

Extend evaluation statistics with additional aggregation metrics beyond
the existing average. All metrics are universally applicable regardless
of whether the score represents an error, a classification result, or
a quality score.

Changes:
- types.ts: Expand EvaluationScoreStatistics with new fields
- utils.ts: Compute median, std deviation, min, max, count
- score-card.tsx: Display secondary metrics in a compact grid below
  the primary average display

Addresses lmnr-ai#637
@IgnazioDS
Copy link
Copy Markdown
Author

@Rainhunter13 @skull8888888 This adds median, std dev, min, max, and count to the evaluation score card — addresses #637. All metrics are universally applicable (no semantic metadata needed). Happy to adjust the UI layout or add/remove metrics based on your feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant