Skip to content

[feature] Improve evaluation runs page(s) / table(s)#3016

Merged
ardaerzin merged 272 commits intorelease/v0.66.0from
frontend-feat/new-evaluations-pages
Dec 8, 2025
Merged

[feature] Improve evaluation runs page(s) / table(s)#3016
ardaerzin merged 272 commits intorelease/v0.66.0from
frontend-feat/new-evaluations-pages

Conversation

@ardaerzin
Copy link
Copy Markdown
Contributor

tba...

Copilot AI review requested due to automatic review settings November 19, 2025 14:28
@vercel
Copy link
Copy Markdown

vercel bot commented Nov 19, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
agenta-documentation Ready Ready Preview Comment Dec 5, 2025 3:53pm

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Nov 19, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces significant improvements to the evaluation runs page(s) and table(s) in the frontend, implementing a comprehensive redesign of the evaluation run details interface. The changes add new views (Overview, Scenarios, Configuration), enhanced comparison capabilities, and improved data visualization components.

Key Changes

  • Added Overview view with metric comparisons, spider charts, and temporal metrics visualization
  • Implemented Configuration view with detailed run metadata, evaluator settings, and variant information
  • Enhanced table functionality with focus drawer for detailed scenario inspection
  • Added run comparison features with support for multiple run comparisons

Reviewed Changes

Copilot reviewed 117 out of 324 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
OverviewPlaceholders.tsx Loading and empty state placeholders with animated radar chart mock
OverviewMetricComparison.tsx Metric comparison logic aggregating data across runs
MetricComparisonCard.tsx Chart component for displaying metric distributions across runs
MetadataSummaryTable.tsx Comprehensive metadata table showing run details and metrics
EvaluatorTemporalMetricsChart.tsx Time-series chart for evaluator metrics with area/line visualization
BaseRunMetricsSection.tsx Section displaying base run metrics with temporal and static views
AggregatedOverviewSection.tsx Aggregated overview combining metadata table and spider chart
OverviewView.tsx Main overview view component orchestrating run comparisons
ConfigurationView/utils.ts Utility functions for parsing run configuration data
ConfigurationView/index.tsx Main configuration view with synchronized scrolling columns
TestsetSection.tsx Testset configuration display component
QuerySection.tsx Query configuration with filters and sampling rate display
PromptConfigCard.tsx Prompt configuration card with message normalization
InvocationSection.tsx Invocation configuration with variant details
GeneralSection.tsx General run information with editable name/description
EvaluatorSection.tsx Evaluator configuration display with JSON toggle
Reference components Reusable components for displaying application/variant/testset references
TableHeaders/StepGroupHeader.tsx Dynamic table headers with reference resolution
TableCells Cell renderers for metrics, invocations, inputs, and actions
FocusDrawer components Drawer for detailed scenario inspection with navigation
EvaluatorMetricsSpiderChart Spider/radar chart for multi-dimensional metric visualization
EvaluatorMetricsChart Chart components for evaluator metric distributions
CompareRunsMenu.tsx UI for selecting and managing run comparisons
Page.tsx Main page component with tab navigation
Various atoms/state State management for comparison, focus drawer, and table data

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings November 19, 2025 15:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 117 out of 326 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings November 20, 2025 15:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 120 out of 336 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@junaway junaway changed the title [Frontend / Feat] Improve evaluation runs page(s) / table(s) [feature] Improve evaluation runs page(s) / table(s) Nov 20, 2025
Copilot AI review requested due to automatic review settings November 21, 2025 11:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

ardaerzin and others added 30 commits December 4, 2025 18:26
…ogram bins, improve duration formatting

- Only use invocation step (type="invocation") for shared analytics keys (duration, tokens, costs) to avoid showing evaluator execution metrics instead of LLM metrics
- Add invocationStepKeys parameter to flattenRunLevelMetricData to identify correct step
- Aggregate histogram bins when count exceeds MAX_DISPLAY_BINS (6) for clearer visualization
- Fix duration formatting in metric popover by
…and refactor utility functions

- Remove commented debug logs and unused code blocks across multiple components
- Add error handling and user feedback for testset name fetching failures
- Add input validation for scenario and run IDs to prevent SSRF attacks
- Improve prompt key resolution with helper function in PromptConfigCard
- Add clarifying comments for fallback logic and depth limits
- Refactor sample rate formatting with
- Remove word splitting, filtering, and humanization logic from humanizeEvaluatorName
- Return evaluator label unchanged instead of processing it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants