Skip to content

Add more agent evals STG-653 #961

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Add more agent evals STG-653 #961

wants to merge 8 commits into from

Conversation

tkattkat
Copy link
Collaborator

part of STG-653

why

Adds more evals to agent

what changed

Added ~ 15 new evals

test plan

  • tested locally
  • tested on browserbase

Copy link

changeset-bot bot commented Aug 12, 2025

🦋 Changeset detected

Latest commit: 72a8bdf

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR adds 17 new agent evaluation tasks to the Stagehand evaluation suite as part of STG-653. The new evaluations test the AI agent's capabilities across diverse real-world scenarios including e-commerce (Amazon shoes, Google Shopping, UberEats), entertainment platforms (Steam Games, Apple TV), research tools (arXiv, Hugging Face, WolframAlpha), and various web services (GitHub, Google Maps, NBA trades on ESPN, hotel booking).

All new evaluation files follow the established pattern from the existing agent evaluation framework:

  • Navigate to target website using stagehand.page.goto()
  • Create an agent with dynamic provider selection based on model name (Claude uses "anthropic", others use "openai")
  • Execute specific instructions with defined step limits (typically 14-30 steps)
  • Evaluate success based on agentResult.success property
  • Include proper error handling, logging, and resource cleanup

The tasks are added to evals.config.json under the 'agent' category, integrating them into the existing evaluation pipeline. These evaluations expand test coverage to validate agent performance across complex multi-step workflows like checkout processes, search filtering, information extraction, and form filling on production websites.

Confidence score: 3/5

  • This PR requires careful review due to several evaluation quality issues and potential risks from using production websites
  • Score lowered due to lack of proper result validation in most tasks, reliance on production sites that may change, and some logical flaws in evaluation criteria
  • Pay close attention to evals/tasks/agent/kith.ts for payment form risks, evals/tasks/agent/hotel_booking.ts for validation gaps, and the formatting issue in evals.config.json

17 files reviewed, 6 comments

Edit Code Review Bot Settings | Greptile

@tkattkat tkattkat changed the title More evals Add more ore agent evals STG-653 Aug 12, 2025
@tkattkat tkattkat requested a review from seanmcguire12 August 12, 2025 21:37
@tkattkat tkattkat marked this pull request as draft August 13, 2025 00:10
@tkattkat tkattkat marked this pull request as ready for review August 13, 2025 00:24
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This review covers only the changes made since the last review (commit e71810e), not the entire PR.

The most recent changes complete a major refactoring of the agent evaluation system by centralizing agent initialization logic. The key changes include:

  1. Agent initialization centralization: All agent evaluation functions have been updated to receive a pre-configured agent parameter instead of creating their own agent instances. This eliminates the duplicate model selection and provider mapping logic that was scattered across individual evaluation files.

  2. Type system updates: The StagehandInitResult type in types/evals.ts now includes an agent property using ReturnType<Stagehand["agent"]>, enabling evaluation functions to access agent functionality through dependency injection.

  3. Centralized configuration: The initStagehand.ts file now includes Computer Use Agent (CUA) model detection logic that automatically determines if a model supports computer use capabilities (checking for 'computer-use-preview' or models starting with 'claude') and creates appropriate agent configurations with proper provider mapping.

  4. Standardized evaluation pattern: All ~20 agent evaluation files now follow a consistent pattern where they receive a pre-initialized agent, execute instructions using agent.execute(), and validate results based on agentResult.success. This creates uniformity across the evaluation suite.

The refactoring moves from a decentralized approach where each evaluation file handled its own agent setup to a centralized dependency injection pattern. This architectural change reduces code duplication, ensures consistent agent configuration across all evaluations, and provides better maintainability for global agent behavior modifications. The changes integrate with the existing evaluation framework by extending the StagehandInitResult interface and updating the initialization flow to provide agent functionality to evaluation tasks.

Confidence score: 4/5

  • This PR is safe to merge with good architectural improvements and consistent patterns
  • Score reflects clean refactoring with proper type safety, though some evaluations lack robust result validation
  • Pay close attention to files with time-dependent instructions or weak validation logic

29 files reviewed, 4 comments

Edit Code Review Bot Settings | Greptile

@tkattkat tkattkat changed the title Add more ore agent evals STG-653 Add more agent evals STG-653 Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant