-
Notifications
You must be signed in to change notification settings - Fork 963
Add more agent evals STG-653 #961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🦋 Changeset detectedLatest commit: 72a8bdf The changes in this PR will be included in the next version bump. Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR adds 17 new agent evaluation tasks to the Stagehand evaluation suite as part of STG-653. The new evaluations test the AI agent's capabilities across diverse real-world scenarios including e-commerce (Amazon shoes, Google Shopping, UberEats), entertainment platforms (Steam Games, Apple TV), research tools (arXiv, Hugging Face, WolframAlpha), and various web services (GitHub, Google Maps, NBA trades on ESPN, hotel booking).
All new evaluation files follow the established pattern from the existing agent evaluation framework:
- Navigate to target website using
stagehand.page.goto()
- Create an agent with dynamic provider selection based on model name (Claude uses "anthropic", others use "openai")
- Execute specific instructions with defined step limits (typically 14-30 steps)
- Evaluate success based on
agentResult.success
property - Include proper error handling, logging, and resource cleanup
The tasks are added to evals.config.json
under the 'agent' category, integrating them into the existing evaluation pipeline. These evaluations expand test coverage to validate agent performance across complex multi-step workflows like checkout processes, search filtering, information extraction, and form filling on production websites.
Confidence score: 3/5
- This PR requires careful review due to several evaluation quality issues and potential risks from using production websites
- Score lowered due to lack of proper result validation in most tasks, reliance on production sites that may change, and some logical flaws in evaluation criteria
- Pay close attention to
evals/tasks/agent/kith.ts
for payment form risks,evals/tasks/agent/hotel_booking.ts
for validation gaps, and the formatting issue inevals.config.json
17 files reviewed, 6 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This review covers only the changes made since the last review (commit e71810e), not the entire PR.
The most recent changes complete a major refactoring of the agent evaluation system by centralizing agent initialization logic. The key changes include:
-
Agent initialization centralization: All agent evaluation functions have been updated to receive a pre-configured
agent
parameter instead of creating their own agent instances. This eliminates the duplicate model selection and provider mapping logic that was scattered across individual evaluation files. -
Type system updates: The
StagehandInitResult
type intypes/evals.ts
now includes anagent
property usingReturnType<Stagehand["agent"]>
, enabling evaluation functions to access agent functionality through dependency injection. -
Centralized configuration: The
initStagehand.ts
file now includes Computer Use Agent (CUA) model detection logic that automatically determines if a model supports computer use capabilities (checking for 'computer-use-preview' or models starting with 'claude') and creates appropriate agent configurations with proper provider mapping. -
Standardized evaluation pattern: All ~20 agent evaluation files now follow a consistent pattern where they receive a pre-initialized agent, execute instructions using
agent.execute()
, and validate results based onagentResult.success
. This creates uniformity across the evaluation suite.
The refactoring moves from a decentralized approach where each evaluation file handled its own agent setup to a centralized dependency injection pattern. This architectural change reduces code duplication, ensures consistent agent configuration across all evaluations, and provides better maintainability for global agent behavior modifications. The changes integrate with the existing evaluation framework by extending the StagehandInitResult
interface and updating the initialization flow to provide agent functionality to evaluation tasks.
Confidence score: 4/5
- This PR is safe to merge with good architectural improvements and consistent patterns
- Score reflects clean refactoring with proper type safety, though some evaluations lack robust result validation
- Pay close attention to files with time-dependent instructions or weak validation logic
29 files reviewed, 4 comments
part of STG-653
why
Adds more evals to agent
what changed
Added ~ 15 new evals
test plan