Skip to content

Conversation

@lerouxb
Copy link
Contributor

@lerouxb lerouxb commented Aug 20, 2025

COMPASS-9609

For those less used to working on compass that might want to test this or work on the prompts

You need a newish version of node (probably 22) and npm (11-ish). See nvm if you don't have it yet. Clone this repo, switch to this branch (chat-playground), npm run bootstrap which will do npm install followed by a compile (probably not strictly needed, but should make vscode happier).

Running this locally

You'll need a braintrust API key (here somewhere, mongodb-ai-education organisation), then set it with:

export BRAINTRUST_API_KEY=blah

This key is used for braintrust, but also by the braintrust proxy so that we can use other LLMs (gpt-4.1 at this point) to score these results. The proxy functionality is only used by Factuality at the moment.

Then in packages/compass-assistant you can run the following:

npx braintrust eval test/assistant.eval.ts --verbose

Then your results should end up here as a new entry. They should stream in while it runs. With the temperature set to 0 (see CHAT_TEMPERATURE and SCORER_TEMPERATURE env vars) the braintrust proxy might even cache some things for us.

Overview

The only scorers are Factuality for judging the text and binaryNdcgAtK (totally stolen from the chatbot project) for judging the sources/links. See the autoevals repo for more possibilities.

You'll see that there are broadly two kinds of eval cases: entrypoint prompts and user eval cases. The only entrypoint one I've added so far is the explain plan one. Since generating that prompt takes parameters I've come up with a way to manage that. Open to suggestions for how that should work.

Adding more eval cases

Add a file in packages/compass-assistant/test/eval-cases (see others for inspiration) and then import/register the file in the index.ts in that folder. We'll probably add automation around this over time and I'm still trying to come up with the nicest, most ergonomic layout. Let me know how it goes!

Some open questions, probably for later

  • I haven't linked this up with CI yet - I don't think we want to fail anything if the scores drop yet anyway plus for now our experiments will probably all land in the same pool. If this runs on every PR that will probably get cluttered pretty quickly. Still iterating on that.
  • Do we want a way to separate these experiments easily? As mentioned it all lands in the same pool and maybe we want to run different ones. Maybe a bunch of things should be more configurable, probably through env vars.
  • I'm still trying to decide on the types for input, output and expected. The simplest they could be (and what they are by default) is just to make those strings. The chatbot has them as more complicated objects than we have here, but most of that is overkill for us. I have it somewhere in the middle where each contains a messages array and each message is an object that contains a text string to try and keep it slightly future-proof. The moment you get any more complicated than just a string the braintrust UI's table gets a bit ugly but I think that's unavoidable because we probably at least want to also pull out the links used so we can check those. This is probably something we'll keep iterating on.
  • I'm using the braintrust CLI for now. We can also run the evals ourselves programmatically, but then I think we're responsible for uploading the results. Would give us more flexibility, but requires more work up front. We can also just do that refactor later.

@lerouxb lerouxb changed the title WIP: compass assistant eval cases chore(compass-assistant): compass assistant eval cases COMPASS-9609 Aug 20, 2025
@lerouxb lerouxb marked this pull request as ready for review August 20, 2025 15:55
Copilot AI review requested due to automatic review settings August 20, 2025 15:55
@lerouxb lerouxb requested a review from a team as a code owner August 20, 2025 15:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds evaluation capabilities for the compass assistant using the Braintrust platform. The evaluation framework allows testing assistant responses against expected outputs with automated scoring.

  • Introduces a complete evaluation framework with test cases for the MongoDB compass assistant
  • Implements custom scoring functions for factuality and source link matching
  • Sets up evaluation test cases covering MongoDB topics like data modeling, aggregation pipelines, and search filtering

Reviewed Changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/compass-assistant/test/assistant.eval.ts Main evaluation framework setup with Braintrust integration and scoring functions
packages/compass-assistant/test/fuzzylinkmatch.ts Utility for fuzzy URL matching copied from chatbot project
packages/compass-assistant/test/binaryndcgatk.ts Binary NDCG@K scoring implementation for evaluating source link relevance
packages/compass-assistant/test/eval-cases/*.ts Test case definitions for various MongoDB topics
packages/compass-assistant/test/eval-cases/index.ts Central export for all evaluation test cases
packages/compass-assistant/package.json Adds dependencies for autoevals and braintrust packages

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@lerouxb lerouxb changed the title chore(compass-assistant): compass assistant eval cases COMPASS-9609 chore(compass-assistant): automated evaluation tests for prompts COMPASS-9609 Aug 20, 2025
"depcheck": "^1.4.1",
"mocha": "^10.2.0",
"nyc": "^15.1.0",
"openai": "^4.104.0",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a 5 already, but the types are different from what autoevals' init() expects and that's the only place we use this at the moment.

apiKey: process.env.BRAINTRUST_API_KEY,
});

init({ client });
Copy link
Contributor Author

@lerouxb lerouxb Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also not use init() and then pass extra options to the relevant scorers. Not sure which is best, we can easily change later if we have to. I think there are probably a bunch of ways of configuring or using LLMs as a judge.

},
};

export function buildPrompt(): string {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured over time these functions could grow to take parameters and then these hardcoded values would just be the defaults. ie. we could test multiple different explain plans.

Problem is there's a difference between passing in an explain plan and passing in the expected output given that each one's output should be quite different. That would probably have to be specified in place of calling buildExpected() and buildExpectedSources() 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

Copy link
Contributor

@gagik gagik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable for me, if package-lock is intended to be like this then we can go ahead.

...(Array.from(bars.values()).map((bar) =>
// eslint-disable-next-line @typescript-eslint/no-explicit-any
bar ? (bar as any).payload.msg.trim().length : 0
bar ? ((bar as any).payload.msg || '').trim().length : 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package-lock.json changes caused some webpack plugin to update and now payload.msg is sometimes not a string (undefined or null? can't remember) and this ONLY happens in CI causing every package-compass task to fail. Couldn't reproduce it locally, just worked around it. webpack build progress still seems to work. 🤷

"node": ">=6.0.0"
}
},
"node_modules/@asteasolutions/zod-to-openapi": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this right? 2000 lines of additions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah.. transitive dependencies. Sergey and I both checked it, I re-did the package-lock.json changes with three different versions of npm.. Seems to be correct 🤷

@lerouxb lerouxb merged commit 6a784a2 into main Aug 22, 2025
82 of 88 checks passed
@lerouxb lerouxb deleted the chat-playground branch August 22, 2025 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants