-
Notifications
You must be signed in to change notification settings - Fork 1k
work on adding voyager to evals #959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 20 commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
2e3e151
work on adding voyager to evals
filip-michalsky d272670
add gaia evals
filip-michalsky 1e27e07
Merge branch 'main' into fm/stg-661-add-web-voyager
filip-michalsky 71cfcb0
refactor
filip-michalsky 16e3a52
add sampling of suites
filip-michalsky 0461322
updating evals
filip-michalsky 4aaca65
linting
filip-michalsky 1922e38
Merge branch 'main' into fm/stg-661-add-web-voyager
filip-michalsky b9b1702
remove logs, small updates
filip-michalsky 22c9fe7
remove logs
filip-michalsky dcaeb83
revert unwanted change
filip-michalsky df880ac
more revert
filip-michalsky 087f8cd
load env at root
filip-michalsky 7c1d5a0
add changeset
filip-michalsky bd7352b
update
filip-michalsky 4be8864
update ci
filip-michalsky b78d824
ci update
filip-michalsky 70a7b7c
Merge main branch
filip-michalsky 9a7c057
Merge main branch
filip-michalsky f9817c6
Merge fm/stg-670-add-agent-to-ci into fm/stg-661-add-web-voyager
filip-michalsky c04a226
Update .github/workflows/ci.yml
filip-michalsky bc85c12
lint
filip-michalsky 5081d3b
exclude gaia and voyager from agent ci
filip-michalsky ce6abd4
update stagehandInitType to send eval inputs in EvalInput
filip-michalsky f5abf91
add external agent benchmarks as a category to CI
filip-michalsky 51246f6
Update .github/workflows/ci.yml
filip-michalsky File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
"@browserbasehq/stagehand": patch | ||
--- | ||
|
||
add webvoyager evals |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
import fs from "fs"; | ||
import { tasksByName } from "../taskConfig"; | ||
import type { SummaryResult } from "@/types/evals"; | ||
|
||
export const generateSummary = async ( | ||
results: SummaryResult[], | ||
experimentName: string, | ||
) => { | ||
const passed = results | ||
.filter((r) => r.output._success) | ||
.map((r) => ({ | ||
eval: r.input.name, | ||
model: r.input.modelName, | ||
categories: tasksByName[r.input.name].categories, | ||
})); | ||
|
||
const failed = results | ||
.filter((r) => !r.output._success) | ||
.map((r) => ({ | ||
eval: r.input.name, | ||
model: r.input.modelName, | ||
categories: tasksByName[r.input.name].categories, | ||
})); | ||
|
||
const categorySuccessCounts: Record< | ||
string, | ||
{ total: number; success: number } | ||
> = {}; | ||
for (const taskName of Object.keys(tasksByName)) { | ||
const taskCategories = tasksByName[taskName].categories; | ||
const taskResults = results.filter((r) => r.input.name === taskName); | ||
const successCount = taskResults.filter((r) => r.output._success).length; | ||
|
||
for (const cat of taskCategories) { | ||
if (!categorySuccessCounts[cat]) { | ||
categorySuccessCounts[cat] = { total: 0, success: 0 }; | ||
} | ||
categorySuccessCounts[cat].total += taskResults.length; | ||
categorySuccessCounts[cat].success += successCount; | ||
} | ||
} | ||
|
||
const categories: Record<string, number> = {}; | ||
for (const [cat, counts] of Object.entries(categorySuccessCounts)) { | ||
categories[cat] = Math.round((counts.success / counts.total) * 100); | ||
} | ||
|
||
const models: Record<string, number> = {}; | ||
const allModels = [...new Set(results.map((r) => r.input.modelName))]; | ||
for (const model of allModels) { | ||
const modelResults = results.filter((r) => r.input.modelName === model); | ||
const successCount = modelResults.filter((r) => r.output._success).length; | ||
models[model] = Math.round((successCount / modelResults.length) * 100); | ||
} | ||
|
||
const formattedSummary = { | ||
experimentName, | ||
passed, | ||
failed, | ||
categories, | ||
models, | ||
}; | ||
|
||
fs.writeFileSync( | ||
"eval-summary.json", | ||
JSON.stringify(formattedSummary, null, 2), | ||
); | ||
console.log("Evaluation summary written to eval-summary.json"); | ||
}; |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.