Add more agent evals STG-653 #961

tkattkat · 2025-08-12T17:57:50Z

part of STG-653

why

Adds more evals to agent

what changed

Added ~ 15 new evals

test plan

tested locally
tested on browserbase

changeset-bot · 2025-08-12T17:57:53Z

🦋 Changeset detected

Latest commit: 72a8bdf

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

greptile-apps

Greptile Summary

This PR adds 17 new agent evaluation tasks to the Stagehand evaluation suite as part of STG-653. The new evaluations test the AI agent's capabilities across diverse real-world scenarios including e-commerce (Amazon shoes, Google Shopping, UberEats), entertainment platforms (Steam Games, Apple TV), research tools (arXiv, Hugging Face, WolframAlpha), and various web services (GitHub, Google Maps, NBA trades on ESPN, hotel booking).

All new evaluation files follow the established pattern from the existing agent evaluation framework:

Navigate to target website using stagehand.page.goto()
Create an agent with dynamic provider selection based on model name (Claude uses "anthropic", others use "openai")
Execute specific instructions with defined step limits (typically 14-30 steps)
Evaluate success based on agentResult.success property
Include proper error handling, logging, and resource cleanup

The tasks are added to evals.config.json under the 'agent' category, integrating them into the existing evaluation pipeline. These evaluations expand test coverage to validate agent performance across complex multi-step workflows like checkout processes, search filtering, information extraction, and form filling on production websites.

Confidence score: 3/5

This PR requires careful review due to several evaluation quality issues and potential risks from using production websites
Score lowered due to lack of proper result validation in most tasks, reliance on production sites that may change, and some logical flaws in evaluation criteria
Pay close attention to evals/tasks/agent/kith.ts for payment form risks, evals/tasks/agent/hotel_booking.ts for validation gaps, and the formatting issue in evals.config.json

_{17 files reviewed, 6 comments}

_{Edit Code Review Bot Settings | Greptile}

evals/tasks/agent/nba_trades.ts

evals/tasks/agent/arxiv_gpt_report.ts

evals/tasks/agent/github.ts

evals/tasks/agent/kith.ts

evals/tasks/agent/wolframalpha_weight_loss.ts

greptile-apps

Greptile Summary

This review covers only the changes made since the last review (commit e71810e), not the entire PR.

The most recent changes complete a major refactoring of the agent evaluation system by centralizing agent initialization logic. The key changes include:

Agent initialization centralization: All agent evaluation functions have been updated to receive a pre-configured agent parameter instead of creating their own agent instances. This eliminates the duplicate model selection and provider mapping logic that was scattered across individual evaluation files.
Type system updates: The StagehandInitResult type in types/evals.ts now includes an agent property using ReturnType<Stagehand["agent"]>, enabling evaluation functions to access agent functionality through dependency injection.
Centralized configuration: The initStagehand.ts file now includes Computer Use Agent (CUA) model detection logic that automatically determines if a model supports computer use capabilities (checking for 'computer-use-preview' or models starting with 'claude') and creates appropriate agent configurations with proper provider mapping.
Standardized evaluation pattern: All ~20 agent evaluation files now follow a consistent pattern where they receive a pre-initialized agent, execute instructions using agent.execute(), and validate results based on agentResult.success. This creates uniformity across the evaluation suite.

The refactoring moves from a decentralized approach where each evaluation file handled its own agent setup to a centralized dependency injection pattern. This architectural change reduces code duplication, ensures consistent agent configuration across all evaluations, and provides better maintainability for global agent behavior modifications. The changes integrate with the existing evaluation framework by extending the StagehandInitResult interface and updating the initialization flow to provide agent functionality to evaluation tasks.

Confidence score: 4/5

This PR is safe to merge with good architectural improvements and consistent patterns
Score reflects clean refactoring with proper type safety, though some evaluations lack robust result validation
Pay close attention to files with time-dependent instructions or weak validation logic

_{29 files reviewed, 4 comments}

_{Edit Code Review Bot Settings | Greptile}

evals/tasks/agent/sign_in.ts

evals/tasks/agent/google_flights.ts

evals/initStagehand.ts

evals/tasks/agent/google_maps_3.ts

tkattkat added 4 commits August 10, 2025 13:38

add more evals for agent

5173a8d

add more evals

4515d69

update evals config

438c1be

add changeset

e71810e

greptile-apps bot reviewed Aug 12, 2025

View reviewed changes

tkattkat changed the title ~~More evals~~ Add more ore agent evals STG-653 Aug 12, 2025

tkattkat requested a review from seanmcguire12 August 12, 2025 21:37

tkattkat added 2 commits August 12, 2025 16:00

update eval

91f8afc

offload initialization of agent to the runner

a59e407

tkattkat marked this pull request as draft August 13, 2025 00:10

tkattkat added 2 commits August 12, 2025 17:10

update type

0b727d0

update steps on eval

72a8bdf

tkattkat marked this pull request as ready for review August 13, 2025 00:24

greptile-apps bot reviewed Aug 13, 2025

View reviewed changes

evals/tasks/agent/sign_in.ts Show resolved Hide resolved

evals/tasks/agent/google_flights.ts Show resolved Hide resolved

evals/initStagehand.ts Show resolved Hide resolved

evals/tasks/agent/google_maps_3.ts Show resolved Hide resolved

tkattkat changed the title ~~Add more ore agent evals STG-653~~ Add more agent evals STG-653 Aug 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add more agent evals STG-653 #961

Add more agent evals STG-653 #961

tkattkat commented Aug 12, 2025

Uh oh!

changeset-bot bot commented Aug 12, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add more agent evals STG-653 #961

Are you sure you want to change the base?

Add more agent evals STG-653 #961

Conversation

tkattkat commented Aug 12, 2025

why

what changed

test plan

Uh oh!

changeset-bot bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 3/5

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 4/5

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

changeset-bot bot commented Aug 12, 2025 •

edited

Loading