Merged
Conversation
🧙 Wizard CIRun the Wizard CI and test your changes against wizard-workbench example apps by replying with a GitHub comment using one of the following commands: Test all apps:
Test all apps in a directory:
Test an individual app:
Show more apps
Results will be posted here when complete. |
edwinyjlim
approved these changes
Feb 20, 2026
Contributor
Author
|
:kek: Probably lost a few screws cherry picking this out of my messy branches. Lemme fix the issues |
Contributor
Author
|
Hmmm actually, sonnet reads the prompt files more than once. I need to rework the detection logic a bit. Weird |
Contributor
Author
|
I drafted this. Moving the code from my branch onto main doesn't seem to just work lol. |
Contributor
Author
|
There's also some different output from the API: We used to rely on the input tokens, which ends up being |
Contributor
Author
Fixed |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Here's a simplified version of the benchmarking tools I used for my experiments.
Screen.Recording.2026-02-20.at.2.20.25.PM.mov
Benchmarking middleware
Benchmarking tools all conform to some middleware shape that implements
onMessageonPhaseTransitionandonFinalizeBasically these are hooks into the wizard agent life cycle. These methods are called at the appropriate points of the middleware.
We can implement what ever tracking we want like this.
Registering new middleware
Middleware code path is entirely controlled by this
options.benchmarkvariable. When it'sfalse, we don't add the observable and none of the code runs. This protects any benchmarking stuff from polluting the rest of code.Config files
We read a
.benchmark-config.jsonconfig that we can optionally override during the run. Inexport function createBenchmarkPipeline(, optionally enable different "plugins" for the benchmark middleware.It's all pretty modular, check
https://github.com/PostHog/wizard-workbench/pull/409for how it's meant to be run.