feat: benchmark tools by gewenyu99 · Pull Request #280 · PostHog/wizard

gewenyu99 · 2026-02-20T19:34:05Z

Here's a simplified version of the benchmarking tools I used for my experiments.

Screen.Recording.2026-02-20.at.2.20.25.PM.mov

Benchmarking middleware

Benchmarking tools all conform to some middleware shape that implements onMessage onPhaseTransition and onFinalize

Basically these are hooks into the wizard agent life cycle. These methods are called at the appropriate points of the middleware.

We can implement what ever tracking we want like this.

export interface Middleware {
  /** Unique name for this middleware (used in config and store keys) */
  readonly name: string;
  /** Called once when the pipeline initializes */
  onInit?(ctx: MiddlewareContext): void;
  /** Called for every SDK message */
  onMessage?(
    message: SDKMessage,
    ctx: MiddlewareContext,
    store: MiddlewareStore,
  ): void;
  /** Called when a phase transition is detected */
  onPhaseTransition?(
    fromPhase: string,
    toPhase: string,
    ctx: MiddlewareContext,
    store: MiddlewareStore,
  ): void;
  /** Called at the end of the agent run. Return value from last middleware is used. */
  onFinalize?(
    resultMessage: any,
    totalDurationMs: number,
    ctx: MiddlewareContext,
    store: MiddlewareStore,
  ): any;
}

Registering new middleware

Middleware code path is entirely controlled by this options.benchmark variable. When it's false, we don't add the observable and none of the code runs. This protects any benchmarking stuff from polluting the rest of code.

  const middleware = options.benchmark
    ? createBenchmarkPipeline(spinner, options)
    : undefined;

  const agentResult = await runAgent(
    agent,
    integrationPrompt,
	@@ -206,6 +210,7 @@ export async function runAgentWizard(
      successMessage: config.ui.successMessage,
      errorMessage: 'Integration failed',
    },
    middleware,
  );

Config files

We read a .benchmark-config.json config that we can optionally override during the run. In export function createBenchmarkPipeline(, optionally enable different "plugins" for the benchmark middleware.

It's all pretty modular, check https://github.com/PostHog/wizard-workbench/pull/409 for how it's meant to be run.

github-actions · 2026-02-20T19:34:17Z

🧙 Wizard CI

Run the Wizard CI and test your changes against wizard-workbench example apps by replying with a GitHub comment using one of the following commands:

Test all apps:

/wizard-ci all

Test all apps in a directory:

/wizard-ci android
/wizard-ci angular
/wizard-ci astro
/wizard-ci django
/wizard-ci fastapi
/wizard-ci flask
/wizard-ci laravel
/wizard-ci next-js
/wizard-ci nuxt
/wizard-ci react-native
/wizard-ci react-router
/wizard-ci sveltekit
/wizard-ci swift
/wizard-ci tanstack-router
/wizard-ci tanstack-start
/wizard-ci vue

Test an individual app:

/wizard-ci android/Jetchat
/wizard-ci angular/angular-saas
/wizard-ci astro/astro-hybrid-marketing

Show more apps

/wizard-ci astro/astro-ssr-docs
/wizard-ci astro/astro-static-marketing
/wizard-ci astro/astro-view-transitions-marketing
/wizard-ci django/django3-saas
/wizard-ci fastapi/fastapi3-ai-saas
/wizard-ci flask/flask3-social-media
/wizard-ci laravel/laravel12-saas
/wizard-ci next-js/15-app-router-saas
/wizard-ci next-js/15-app-router-todo
/wizard-ci next-js/15-pages-router-saas
/wizard-ci next-js/15-pages-router-todo
/wizard-ci nuxt/movies-nuxt-3-6
/wizard-ci nuxt/movies-nuxt-4
/wizard-ci react-native/expo-react-native-hacker-news
/wizard-ci react-native/react-native-saas
/wizard-ci react-router/react-router-v7-project
/wizard-ci react-router/rrv7-starter
/wizard-ci react-router/saas-template
/wizard-ci react-router/shopper
/wizard-ci sveltekit/CMSaasStarter
/wizard-ci swift/hackers-ios
/wizard-ci tanstack-router/tanstack-router-code-based-saas
/wizard-ci tanstack-router/tanstack-router-file-based-saas
/wizard-ci tanstack-start/tanstack-start-saas
/wizard-ci vue/movies

Results will be posted here when complete.

edwinyjlim

so sick

I'm seeing a weird quirk where each phase/stage is showing $0.00 . Idk if that's expected

src/lib/agent-interface.ts

gewenyu99 · 2026-02-20T23:46:35Z

:kek: Probably lost a few screws cherry picking this out of my messy branches. Lemme fix the issues

gewenyu99 · 2026-02-21T02:07:06Z

Hmmm actually, sonnet reads the prompt files more than once. I need to rework the detection logic a bit. Weird

gewenyu99 · 2026-02-21T03:27:40Z

I drafted this. Moving the code from my branch onto main doesn't seem to just work lol.

gewenyu99 · 2026-02-21T03:30:45Z

There's also some different output from the API:

    "input_tokens": 2,
      "cache_creation_input_tokens": 40243,

We used to rely on the input tokens, which ends up being 2 here, because everything is now under cache_creation_input_tokens...

gewenyu99 · 2026-02-21T04:02:56Z

┌  Welcome to the PostHog setup wizard ✨
│
●  Running in CI mode
│
◆  Detected integration: Vue

┌   PostHog Vue wizard (agent-powered) 
│
●  [BETA] The Vue wizard is in beta. Questions or feedback? Email wizard@posthog.com
│
●  We're about to read your project using our LLM gateway.
│  
│  .env* file contents will not leave your machine.
│  
│  Other files will be read and edited to provide a fully-custom PostHog integration.
│
●  CI mode: continuing with uncommitted/untracked files in repo
│
●  Using provided API key (CI mode - OAuth bypassed)
│
◇  Initializing Claude agent...
│
◇  Verbose logs: /tmp/posthog-wizard.log
│
◆  Agent initialized. Let's get cooking!
│
●  [BENCHMARK] Verbose logs: /tmp/posthog-wizard.log
│
●  [BENCHMARK] Benchmark data will be written to: /tmp/posthog-wizard-benchmark.json
│
◇  This whole process should take about 5 minutes including error checking and fixes.
│  
│  Grab some coffee!
│
◇  Checking project structure.
│
◇  [BENCHMARK] setup: 23s, 6 turns, cost: $0.25
  in: 7, out: 316, cache_read: 219.9K, cache_5m: 46.8K, cache_1h: 0
  ctx_out: 46.8K
│
●  [BENCHMARK] Starting phase: 1.0-begin
│
◇  Verifying PostHog dependencies.
│
◇  Inserting PostHog capture code.
│
◇  [BENCHMARK] 1.0-begin: 32s, 8 turns, cost: $0.23
  in: 8, out: 338, cache_read: 446.0K, cache_5m: 23.5K, cache_1h: 0
  ctx_out: 70.4K
│
●  [BENCHMARK] Starting phase: 1.1-edit
│
◇  Planning edits for `src/composables/useAuth.ts` — Add `posthog.identify()` on login, `posthog.capture('user_logged_in')`, `posthog.capture('user_logged_out')`, and `posthog.reset()` on logout.
│
◇  Planning edits for `src/views/LoginView.vue` — Capture `login_failed` event when login throws an error.
│
◇  Planning edits for `src/views/MediaDetailView.vue` — Add `media_detail_viewed`, `trailer_played`, `trailer_closed`, and `media_load_failed` events.
│
◇  Planning edits for `src/views/SearchView.vue` — Add `search_performed` event with query and result count, and `search_failed` on errors.
│
◇  Edited `src/main.js`, `src/composables/useAuth.ts`, `src/views/LoginView.vue`, `src/views/MediaDetailView.vue`, `src/views/SearchView.vue`.
│
◇  Finding and correcting errors.
│
◇  [BENCHMARK] 1.1-edit: 58s, 13 turns, cost: $0.39
  in: 13, out: 3.4K, cache_read: 980.2K, cache_5m: 12.3K, cache_1h: 0
  ctx_out: 82.7K
│
●  [BENCHMARK] Starting phase: 1.2-revise
│
◇  Linting, building and prettying — Build ✓ (no errors).
│
◇  Configured dashboard: https://us.posthog.com/project/198052/dashboard/1296746
│
◇  [BENCHMARK] 1.2-revise: 2m 14s, 13 turns, cost: $0.63
  in: 6.0K, out: 1.7K, cache_read: 1.3M, cache_5m: 50.0K, cache_1h: 0
  ctx_out: 132.7K
│
●  [BENCHMARK] Starting phase: 1.3-conclude
│
◑  Integrating PostHog (1.3-conclude)...│
│
●  ◇ [BENCHMARK] 5 phases in 4m 17s, cost: $1.73
│
●    total in: 3.4M, out: 5.7K, cache_read: 3.3M, cache_5m: 0, cache_1h: 0
│
│
●  ● [BENCHMARK] Summary by phase:
│
●  setup: 23s, 6 turns, cost: $0.27
│    in: 7, out: 316, cache_read: 219.9K, cache_5m: 46.8K, cache_1h: 0
│    ctx_out: 46.8K
│
●  1.0-begin: 32s, 8 turns, cost: $0.25
│    in: 8, out: 338, cache_read: 446.0K, cache_5m: 23.5K, cache_1h: 0
│    ctx_out: 70.4K
│
●  1.1-edit: 58s, 13 turns, cost: $0.43
│    in: 13, out: 3.4K, cache_read: 980.2K, cache_5m: 12.3K, cache_1h: 0
│    ctx_out: 82.7K
│
●  1.2-revise: 2m 14s, 13 turns, cost: $0.69
│    in: 6.0K, out: 1.7K, cache_read: 1.3M, cache_5m: 50.0K, cache_1h: 0
│    ctx_out: 132.7K
│
●  1.3-conclude: 10s, 2 turns, cost: $0.09
│    in: 4, out: 2, cache_read: 265.6K, cache_5m: 1.3K, cache_1h: 0
│    ctx_out: 134.0K
│
│
●  ● [BENCHMARK] Results written to /tmp/posthog-wizard-benchmark.json
◇  PostHog integration complete
│
●  Skipping MCP installation (CI mode)
│
└  
Successfully installed PostHog!

What the agent did:
• Analyzed your Vue project structure
• Created and configured PostHog initializers
• Integrated PostHog into your application
• Added environment variables to .env file

Next steps:
• Start your development server to see PostHog in action
• Visit your PostHog dashboard to see incoming events
• Upload your Project API key to your hosting provider

Learn more: https://posthog.com/docs/libraries/vue

Note: This wizard uses an LLM agent to analyze and modify your 

How did this work for you? Drop us a line: wizard@posthog.com


Benchmark completed in 263.7s
Results: /tmp/posthog-wizard-benchmark.json

Fixed

… benchmarking

benchmark tools

8654f50

gewenyu99 changed the title ~~benchmark tools~~ feat: benchmark tools Feb 20, 2026

gewenyu99 requested a review from a team February 20, 2026 19:37

edwinyjlim approved these changes Feb 20, 2026

View reviewed changes

src/lib/agent-interface.ts Outdated Show resolved Hide resolved

src/lib/agent-interface.ts Outdated Show resolved Hide resolved

gewenyu99 marked this pull request as draft February 21, 2026 03:23

gewenyu99 marked this pull request as ready for review February 21, 2026 04:03

gewenyu99 and others added 5 commits February 20, 2026 23:03

Anthropic, stop changing

1a412d0

Merge branch 'main' into benchmarking

8bc19f1

Add some error handling

7f7cfdc

Merge branch 'benchmarking' of https://github.com/PostHog/wizard into…

ad9ec4c

… benchmarking

Merge branch 'main' into benchmarking

f0077fd

gewenyu99 merged commit 0f79a24 into main Feb 24, 2026
14 checks passed

gewenyu99 deleted the benchmarking branch February 24, 2026 17:13

releaser-wizard bot mentioned this pull request Feb 24, 2026

chore(main): release 1.35.0 #286

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: benchmark tools#280

feat: benchmark tools#280
gewenyu99 merged 6 commits intomainfrom
benchmarking

gewenyu99 commented Feb 20, 2026

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

edwinyjlim left a comment

Uh oh!

Uh oh!

Uh oh!

gewenyu99 commented Feb 20, 2026

Uh oh!

gewenyu99 commented Feb 21, 2026

Uh oh!

gewenyu99 commented Feb 21, 2026

Uh oh!

gewenyu99 commented Feb 21, 2026

Uh oh!

gewenyu99 commented Feb 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

gewenyu99 commented Feb 20, 2026

Benchmarking middleware

Registering new middleware

Config files

Uh oh!

github-actions bot commented Feb 20, 2026

🧙 Wizard CI

Uh oh!

edwinyjlim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gewenyu99 commented Feb 20, 2026

Uh oh!

gewenyu99 commented Feb 21, 2026

Uh oh!

gewenyu99 commented Feb 21, 2026

Uh oh!

gewenyu99 commented Feb 21, 2026

Uh oh!

gewenyu99 commented Feb 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants