Skip to content

Comments

feat: benchmark tools#280

Merged
gewenyu99 merged 6 commits intomainfrom
benchmarking
Feb 24, 2026
Merged

feat: benchmark tools#280
gewenyu99 merged 6 commits intomainfrom
benchmarking

Conversation

@gewenyu99
Copy link
Contributor

Here's a simplified version of the benchmarking tools I used for my experiments.

Screen.Recording.2026-02-20.at.2.20.25.PM.mov

Benchmarking middleware

Benchmarking tools all conform to some middleware shape that implements onMessage onPhaseTransition and onFinalize

Basically these are hooks into the wizard agent life cycle. These methods are called at the appropriate points of the middleware.

We can implement what ever tracking we want like this.

export interface Middleware {
  /** Unique name for this middleware (used in config and store keys) */
  readonly name: string;
  /** Called once when the pipeline initializes */
  onInit?(ctx: MiddlewareContext): void;
  /** Called for every SDK message */
  onMessage?(
    message: SDKMessage,
    ctx: MiddlewareContext,
    store: MiddlewareStore,
  ): void;
  /** Called when a phase transition is detected */
  onPhaseTransition?(
    fromPhase: string,
    toPhase: string,
    ctx: MiddlewareContext,
    store: MiddlewareStore,
  ): void;
  /** Called at the end of the agent run. Return value from last middleware is used. */
  onFinalize?(
    resultMessage: any,
    totalDurationMs: number,
    ctx: MiddlewareContext,
    store: MiddlewareStore,
  ): any;
}

Registering new middleware

Middleware code path is entirely controlled by this options.benchmark variable. When it's false, we don't add the observable and none of the code runs. This protects any benchmarking stuff from polluting the rest of code.

  const middleware = options.benchmark
    ? createBenchmarkPipeline(spinner, options)
    : undefined;

  const agentResult = await runAgent(
    agent,
    integrationPrompt,
	@@ -206,6 +210,7 @@ export async function runAgentWizard(
      successMessage: config.ui.successMessage,
      errorMessage: 'Integration failed',
    },
    middleware,
  );

Config files

We read a .benchmark-config.json config that we can optionally override during the run. In export function createBenchmarkPipeline(, optionally enable different "plugins" for the benchmark middleware.

It's all pretty modular, check https://github.com/PostHog/wizard-workbench/pull/409 for how it's meant to be run.

@github-actions
Copy link

🧙 Wizard CI

Run the Wizard CI and test your changes against wizard-workbench example apps by replying with a GitHub comment using one of the following commands:

Test all apps:

  • /wizard-ci all

Test all apps in a directory:

  • /wizard-ci android
  • /wizard-ci angular
  • /wizard-ci astro
  • /wizard-ci django
  • /wizard-ci fastapi
  • /wizard-ci flask
  • /wizard-ci laravel
  • /wizard-ci next-js
  • /wizard-ci nuxt
  • /wizard-ci react-native
  • /wizard-ci react-router
  • /wizard-ci sveltekit
  • /wizard-ci swift
  • /wizard-ci tanstack-router
  • /wizard-ci tanstack-start
  • /wizard-ci vue

Test an individual app:

  • /wizard-ci android/Jetchat
  • /wizard-ci angular/angular-saas
  • /wizard-ci astro/astro-hybrid-marketing
Show more apps
  • /wizard-ci astro/astro-ssr-docs
  • /wizard-ci astro/astro-static-marketing
  • /wizard-ci astro/astro-view-transitions-marketing
  • /wizard-ci django/django3-saas
  • /wizard-ci fastapi/fastapi3-ai-saas
  • /wizard-ci flask/flask3-social-media
  • /wizard-ci laravel/laravel12-saas
  • /wizard-ci next-js/15-app-router-saas
  • /wizard-ci next-js/15-app-router-todo
  • /wizard-ci next-js/15-pages-router-saas
  • /wizard-ci next-js/15-pages-router-todo
  • /wizard-ci nuxt/movies-nuxt-3-6
  • /wizard-ci nuxt/movies-nuxt-4
  • /wizard-ci react-native/expo-react-native-hacker-news
  • /wizard-ci react-native/react-native-saas
  • /wizard-ci react-router/react-router-v7-project
  • /wizard-ci react-router/rrv7-starter
  • /wizard-ci react-router/saas-template
  • /wizard-ci react-router/shopper
  • /wizard-ci sveltekit/CMSaasStarter
  • /wizard-ci swift/hackers-ios
  • /wizard-ci tanstack-router/tanstack-router-code-based-saas
  • /wizard-ci tanstack-router/tanstack-router-file-based-saas
  • /wizard-ci tanstack-start/tanstack-start-saas
  • /wizard-ci vue/movies

Results will be posted here when complete.

@gewenyu99 gewenyu99 changed the title benchmark tools feat: benchmark tools Feb 20, 2026
@gewenyu99 gewenyu99 requested a review from a team February 20, 2026 19:37
Copy link
Member

@edwinyjlim edwinyjlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so sick

I'm seeing a weird quirk where each phase/stage is showing $0.00 . Idk if that's expected

Image

@gewenyu99
Copy link
Contributor Author

:kek: Probably lost a few screws cherry picking this out of my messy branches. Lemme fix the issues

@gewenyu99
Copy link
Contributor Author

Hmmm actually, sonnet reads the prompt files more than once. I need to rework the detection logic a bit. Weird

@gewenyu99 gewenyu99 marked this pull request as draft February 21, 2026 03:23
@gewenyu99
Copy link
Contributor Author

I drafted this. Moving the code from my branch onto main doesn't seem to just work lol.

@gewenyu99
Copy link
Contributor Author

There's also some different output from the API:

    "input_tokens": 2,
      "cache_creation_input_tokens": 40243,

We used to rely on the input tokens, which ends up being 2 here, because everything is now under cache_creation_input_tokens...

@gewenyu99
Copy link
Contributor Author

┌  Welcome to the PostHog setup wizard ✨
│
●  Running in CI mode
│
◆  Detected integration: Vue

┌   PostHog Vue wizard (agent-powered) 
│
●  [BETA] The Vue wizard is in beta. Questions or feedback? Email wizard@posthog.com
│
●  We're about to read your project using our LLM gateway.
│  
│  .env* file contents will not leave your machine.
│  
│  Other files will be read and edited to provide a fully-custom PostHog integration.
│
●  CI mode: continuing with uncommitted/untracked files in repo
│
●  Using provided API key (CI mode - OAuth bypassed)
│
◇  Initializing Claude agent...
│
◇  Verbose logs: /tmp/posthog-wizard.log
│
◆  Agent initialized. Let's get cooking!
│
●  [BENCHMARK] Verbose logs: /tmp/posthog-wizard.log
│
●  [BENCHMARK] Benchmark data will be written to: /tmp/posthog-wizard-benchmark.json
│
◇  This whole process should take about 5 minutes including error checking and fixes.
│  
│  Grab some coffee!
│
◇  Checking project structure.
│
◇  [BENCHMARK] setup: 23s, 6 turns, cost: $0.25
  in: 7, out: 316, cache_read: 219.9K, cache_5m: 46.8K, cache_1h: 0
  ctx_out: 46.8K
│
●  [BENCHMARK] Starting phase: 1.0-begin
│
◇  Verifying PostHog dependencies.
│
◇  Inserting PostHog capture code.
│
◇  [BENCHMARK] 1.0-begin: 32s, 8 turns, cost: $0.23
  in: 8, out: 338, cache_read: 446.0K, cache_5m: 23.5K, cache_1h: 0
  ctx_out: 70.4K
│
●  [BENCHMARK] Starting phase: 1.1-edit
│
◇  Planning edits for `src/composables/useAuth.ts` — Add `posthog.identify()` on login, `posthog.capture('user_logged_in')`, `posthog.capture('user_logged_out')`, and `posthog.reset()` on logout.
│
◇  Planning edits for `src/views/LoginView.vue` — Capture `login_failed` event when login throws an error.
│
◇  Planning edits for `src/views/MediaDetailView.vue` — Add `media_detail_viewed`, `trailer_played`, `trailer_closed`, and `media_load_failed` events.
│
◇  Planning edits for `src/views/SearchView.vue` — Add `search_performed` event with query and result count, and `search_failed` on errors.
│
◇  Edited `src/main.js`, `src/composables/useAuth.ts`, `src/views/LoginView.vue`, `src/views/MediaDetailView.vue`, `src/views/SearchView.vue`.
│
◇  Finding and correcting errors.
│
◇  [BENCHMARK] 1.1-edit: 58s, 13 turns, cost: $0.39
  in: 13, out: 3.4K, cache_read: 980.2K, cache_5m: 12.3K, cache_1h: 0
  ctx_out: 82.7K
│
●  [BENCHMARK] Starting phase: 1.2-revise
│
◇  Linting, building and prettying — Build ✓ (no errors).
│
◇  Configured dashboard: https://us.posthog.com/project/198052/dashboard/1296746
│
◇  [BENCHMARK] 1.2-revise: 2m 14s, 13 turns, cost: $0.63
  in: 6.0K, out: 1.7K, cache_read: 1.3M, cache_5m: 50.0K, cache_1h: 0
  ctx_out: 132.7K
│
●  [BENCHMARK] Starting phase: 1.3-conclude
│
◑  Integrating PostHog (1.3-conclude)...│
│
●  ◇ [BENCHMARK] 5 phases in 4m 17s, cost: $1.73
│
●    total in: 3.4M, out: 5.7K, cache_read: 3.3M, cache_5m: 0, cache_1h: 0
│
│
●  ● [BENCHMARK] Summary by phase:
│
●  setup: 23s, 6 turns, cost: $0.27
│    in: 7, out: 316, cache_read: 219.9K, cache_5m: 46.8K, cache_1h: 0
│    ctx_out: 46.8K
│
●  1.0-begin: 32s, 8 turns, cost: $0.25
│    in: 8, out: 338, cache_read: 446.0K, cache_5m: 23.5K, cache_1h: 0
│    ctx_out: 70.4K
│
●  1.1-edit: 58s, 13 turns, cost: $0.43
│    in: 13, out: 3.4K, cache_read: 980.2K, cache_5m: 12.3K, cache_1h: 0
│    ctx_out: 82.7K
│
●  1.2-revise: 2m 14s, 13 turns, cost: $0.69
│    in: 6.0K, out: 1.7K, cache_read: 1.3M, cache_5m: 50.0K, cache_1h: 0
│    ctx_out: 132.7K
│
●  1.3-conclude: 10s, 2 turns, cost: $0.09
│    in: 4, out: 2, cache_read: 265.6K, cache_5m: 1.3K, cache_1h: 0
│    ctx_out: 134.0K
│
│
●  ● [BENCHMARK] Results written to /tmp/posthog-wizard-benchmark.json
◇  PostHog integration complete
│
●  Skipping MCP installation (CI mode)
│
└  
Successfully installed PostHog!

What the agent did:
• Analyzed your Vue project structure
• Created and configured PostHog initializers
• Integrated PostHog into your application
• Added environment variables to .env file

Next steps:
• Start your development server to see PostHog in action
• Visit your PostHog dashboard to see incoming events
• Upload your Project API key to your hosting provider

Learn more: https://posthog.com/docs/libraries/vue

Note: This wizard uses an LLM agent to analyze and modify your 

How did this work for you? Drop us a line: wizard@posthog.com


Benchmark completed in 263.7s
Results: /tmp/posthog-wizard-benchmark.json

Fixed

@gewenyu99 gewenyu99 marked this pull request as ready for review February 21, 2026 04:03
@gewenyu99 gewenyu99 merged commit 0f79a24 into main Feb 24, 2026
14 checks passed
@gewenyu99 gewenyu99 deleted the benchmarking branch February 24, 2026 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants