chore(compass-assistant): automated evaluation tests for prompts COMPASS-9609 #7216

lerouxb · 2025-08-20T09:43:50Z

For those less used to working on compass that might want to test this or work on the prompts

You need a newish version of node (probably 22) and npm (11-ish). See nvm if you don't have it yet. Clone this repo, switch to this branch (chat-playground), npm run bootstrap which will do npm install followed by a compile (probably not strictly needed, but should make vscode happier).

Running this locally

You'll need a braintrust API key (here somewhere, mongodb-ai-education organisation), then set it with:

export BRAINTRUST_API_KEY=blah

This key is used for braintrust, but also by the braintrust proxy so that we can use other LLMs (gpt-4.1 at this point) to score these results. The proxy functionality is only used by Factuality at the moment.

Then in packages/compass-assistant you can run the following:

npx braintrust eval test/assistant.eval.ts --verbose

Then your results should end up here as a new entry. They should stream in while it runs. With the temperature set to 0 (see CHAT_TEMPERATURE and SCORER_TEMPERATURE env vars) the braintrust proxy might even cache some things for us.

Overview

The only scorers are Factuality for judging the text and binaryNdcgAtK (totally stolen from the chatbot project) for judging the sources/links. See the autoevals repo for more possibilities.

You'll see that there are broadly two kinds of eval cases: entrypoint prompts and user eval cases. The only entrypoint one I've added so far is the explain plan one. Since generating that prompt takes parameters I've come up with a way to manage that. Open to suggestions for how that should work.

Adding more eval cases

Add a file in packages/compass-assistant/test/eval-cases (see others for inspiration) and then import/register the file in the index.ts in that folder. We'll probably add automation around this over time and I'm still trying to come up with the nicest, most ergonomic layout. Let me know how it goes!

Some open questions, probably for later

I haven't linked this up with CI yet - I don't think we want to fail anything if the scores drop yet anyway plus for now our experiments will probably all land in the same pool. If this runs on every PR that will probably get cluttered pretty quickly. Still iterating on that.
Do we want a way to separate these experiments easily? As mentioned it all lands in the same pool and maybe we want to run different ones. Maybe a bunch of things should be more configurable, probably through env vars.
I'm still trying to decide on the types for input, output and expected. The simplest they could be (and what they are by default) is just to make those strings. The chatbot has them as more complicated objects than we have here, but most of that is overkill for us. I have it somewhere in the middle where each contains a messages array and each message is an object that contains a text string to try and keep it slightly future-proof. The moment you get any more complicated than just a string the braintrust UI's table gets a bit ugly but I think that's unavoidable because we probably at least want to also pull out the links used so we can check those. This is probably something we'll keep iterating on.
I'm using the braintrust CLI for now. We can also run the evals ourselves programmatically, but then I think we're responsible for uploading the results. Would give us more flexibility, but requires more work up front. We can also just do that refactor later.

packages/compass-assistant/test/assistant.eval.ts

Copilot

Pull Request Overview

This PR adds evaluation capabilities for the compass assistant using the Braintrust platform. The evaluation framework allows testing assistant responses against expected outputs with automated scoring.

Introduces a complete evaluation framework with test cases for the MongoDB compass assistant
Implements custom scoring functions for factuality and source link matching
Sets up evaluation test cases covering MongoDB topics like data modeling, aggregation pipelines, and search filtering

Reviewed Changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`packages/compass-assistant/test/assistant.eval.ts`	Main evaluation framework setup with Braintrust integration and scoring functions
`packages/compass-assistant/test/fuzzylinkmatch.ts`	Utility for fuzzy URL matching copied from chatbot project
`packages/compass-assistant/test/binaryndcgatk.ts`	Binary NDCG@K scoring implementation for evaluating source link relevance
`packages/compass-assistant/test/eval-cases/*.ts`	Test case definitions for various MongoDB topics
`packages/compass-assistant/test/eval-cases/index.ts`	Central export for all evaluation test cases
`packages/compass-assistant/package.json`	Adds dependencies for autoevals and braintrust packages

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

packages/compass-assistant/test/assistant.eval.ts

This reverts commit 408e9bd.

lerouxb · 2025-08-21T11:44:46Z

packages/compass-assistant/package.json

    "depcheck": "^1.4.1",
    "mocha": "^10.2.0",
    "nyc": "^15.1.0",
+    "openai": "^4.104.0",


There is a 5 already, but the types are different from what autoevals' init() expects and that's the only place we use this at the moment.

lerouxb · 2025-08-21T11:45:37Z

packages/compass-assistant/test/assistant.eval.ts

+  apiKey: process.env.BRAINTRUST_API_KEY,
+});
+
+init({ client });


You can also not use init() and then pass extra options to the relevant scorers. Not sure which is best, we can easily change later if we have to. I think there are probably a bunch of ways of configuring or using LLMs as a judge.

… into chat-playground

lerouxb · 2025-08-21T12:25:56Z

packages/compass-assistant/test/entrypoints/explain-plan.ts

+  },
+};
+
+export function buildPrompt(): string {


I figured over time these functions could grow to take parameters and then these hardcoded values would just be the defaults. ie. we could test multiple different explain plans.

Problem is there's a difference between passing in an explain plan and passing in the expected output given that each one's output should be quite different. That would probably have to be specified in place of calling buildExpected() and buildExpectedSources() 🤔

gagik

Looks reasonable for me, if package-lock is intended to be like this then we can go ahead.

gagik · 2025-08-22T10:36:53Z

configs/webpack-config-compass/src/webpack-plugin-multicompiler-progress.ts

    ...(Array.from(bars.values()).map((bar) =>
      // eslint-disable-next-line @typescript-eslint/no-explicit-any
-      bar ? (bar as any).payload.msg.trim().length : 0
+      bar ? ((bar as any).payload.msg || '').trim().length : 0


what is this for?

The package-lock.json changes caused some webpack plugin to update and now payload.msg is sometimes not a string (undefined or null? can't remember) and this ONLY happens in CI causing every package-compass task to fail. Couldn't reproduce it locally, just worked around it. webpack build progress still seems to work. 🤷

gagik · 2025-08-22T10:37:13Z

package-lock.json

        "node": ">=6.0.0"
      }
    },
+    "node_modules/@asteasolutions/zod-to-openapi": {


is this right? 2000 lines of additions?

yeah.. transitive dependencies. Sergey and I both checked it, I re-did the package-lock.json changes with three different versions of npm.. Seems to be correct 🤷

WIP

a811fe7

gagik reviewed Aug 20, 2025

View reviewed changes

packages/compass-assistant/test/assistant.eval.ts Show resolved Hide resolved

gagik reviewed Aug 20, 2025

View reviewed changes

packages/compass-assistant/test/assistant.eval.ts Outdated Show resolved Hide resolved

lerouxb added 2 commits August 20, 2025 14:29

add some eval cases

9bb69f2

Factuality and BinaryNdcgAtK

c3b345e

lerouxb changed the title ~~WIP: compass assistant eval cases~~ chore(compass-assistant): compass assistant eval cases COMPASS-9609 Aug 20, 2025

lerouxb marked this pull request as ready for review August 20, 2025 15:55

Copilot AI review requested due to automatic review settings August 20, 2025 15:55

lerouxb requested a review from a team as a code owner August 20, 2025 15:55

Copilot AI reviewed Aug 20, 2025

View reviewed changes

packages/compass-assistant/test/assistant.eval.ts Show resolved Hide resolved

packages/compass-assistant/test/assistant.eval.ts Show resolved Hide resolved

packages/compass-assistant/test/assistant.eval.ts Show resolved Hide resolved

lerouxb changed the title ~~chore(compass-assistant): compass assistant eval cases COMPASS-9609~~ chore(compass-assistant): automated evaluation tests for prompts COMPASS-9609 Aug 20, 2025

lerouxb and others added 6 commits August 20, 2025 17:03

re-package-lock.json...

82505e0

tweak

4425277

add openai dep

408e9bd

Revert "add openai dep"

8390611

This reverts commit 408e9bd.

openai 4 for now so the types match what's expected in autoevals

3146d1f

Merge branch 'main' into chat-playground

e174595

lerouxb commented Aug 21, 2025

View reviewed changes

lerouxb added 4 commits August 21, 2025 12:48

edit some comments

c176e60

Merge branch 'chat-playground' of https://github.com/mongodb-js/compass…

54f596a

… into chat-playground

configurable temp

e525587

add support for entry point prompts (just explain for now)

8ff842a

lerouxb commented Aug 21, 2025

View reviewed changes

gagik approved these changes Aug 22, 2025

View reviewed changes

lerouxb merged commit 6a784a2 into main Aug 22, 2025
82 of 88 checks passed

lerouxb deleted the chat-playground branch August 22, 2025 14:43

chore(compass-assistant): automated evaluation tests for prompts COMPASS-9609 #7216

chore(compass-assistant): automated evaluation tests for prompts COMPASS-9609 #7216

Uh oh!

Conversation

lerouxb commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

For those less used to working on compass that might want to test this or work on the prompts

Running this locally

Overview

Adding more eval cases

Some open questions, probably for later

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lerouxb Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

lerouxb Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lerouxb Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

gagik Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

gagik left a comment

Choose a reason for hiding this comment

Uh oh!

gagik Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

lerouxb Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

gagik Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

lerouxb Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lerouxb commented Aug 20, 2025 •

edited

Loading

lerouxb Aug 21, 2025 •

edited

Loading