-
Notifications
You must be signed in to change notification settings - Fork 244
chore(compass-assistant): automated evaluation tests for prompts COMPASS-9609 #7216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds evaluation capabilities for the compass assistant using the Braintrust platform. The evaluation framework allows testing assistant responses against expected outputs with automated scoring.
- Introduces a complete evaluation framework with test cases for the MongoDB compass assistant
- Implements custom scoring functions for factuality and source link matching
- Sets up evaluation test cases covering MongoDB topics like data modeling, aggregation pipelines, and search filtering
Reviewed Changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
packages/compass-assistant/test/assistant.eval.ts |
Main evaluation framework setup with Braintrust integration and scoring functions |
packages/compass-assistant/test/fuzzylinkmatch.ts |
Utility for fuzzy URL matching copied from chatbot project |
packages/compass-assistant/test/binaryndcgatk.ts |
Binary NDCG@K scoring implementation for evaluating source link relevance |
packages/compass-assistant/test/eval-cases/*.ts |
Test case definitions for various MongoDB topics |
packages/compass-assistant/test/eval-cases/index.ts |
Central export for all evaluation test cases |
packages/compass-assistant/package.json |
Adds dependencies for autoevals and braintrust packages |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| "depcheck": "^1.4.1", | ||
| "mocha": "^10.2.0", | ||
| "nyc": "^15.1.0", | ||
| "openai": "^4.104.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a 5 already, but the types are different from what autoevals' init() expects and that's the only place we use this at the moment.
| apiKey: process.env.BRAINTRUST_API_KEY, | ||
| }); | ||
|
|
||
| init({ client }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also not use init() and then pass extra options to the relevant scorers. Not sure which is best, we can easily change later if we have to. I think there are probably a bunch of ways of configuring or using LLMs as a judge.
| }, | ||
| }; | ||
|
|
||
| export function buildPrompt(): string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured over time these functions could grow to take parameters and then these hardcoded values would just be the defaults. ie. we could test multiple different explain plans.
Problem is there's a difference between passing in an explain plan and passing in the expected output given that each one's output should be quite different. That would probably have to be specified in place of calling buildExpected() and buildExpectedSources() 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
gagik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable for me, if package-lock is intended to be like this then we can go ahead.
| ...(Array.from(bars.values()).map((bar) => | ||
| // eslint-disable-next-line @typescript-eslint/no-explicit-any | ||
| bar ? (bar as any).payload.msg.trim().length : 0 | ||
| bar ? ((bar as any).payload.msg || '').trim().length : 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The package-lock.json changes caused some webpack plugin to update and now payload.msg is sometimes not a string (undefined or null? can't remember) and this ONLY happens in CI causing every package-compass task to fail. Couldn't reproduce it locally, just worked around it. webpack build progress still seems to work. 🤷
| "node": ">=6.0.0" | ||
| } | ||
| }, | ||
| "node_modules/@asteasolutions/zod-to-openapi": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this right? 2000 lines of additions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah.. transitive dependencies. Sergey and I both checked it, I re-did the package-lock.json changes with three different versions of npm.. Seems to be correct 🤷
COMPASS-9609
For those less used to working on compass that might want to test this or work on the prompts
You need a newish version of node (probably 22) and npm (11-ish). See nvm if you don't have it yet. Clone this repo, switch to this branch (chat-playground),
npm run bootstrapwhich will donpm installfollowed by a compile (probably not strictly needed, but should make vscode happier).Running this locally
You'll need a braintrust API key (here somewhere,
mongodb-ai-education organisation), then set it with:This key is used for braintrust, but also by the braintrust proxy so that we can use other LLMs (gpt-4.1 at this point) to score these results. The proxy functionality is only used by Factuality at the moment.
Then in
packages/compass-assistantyou can run the following:Then your results should end up here as a new entry. They should stream in while it runs. With the temperature set to 0 (see CHAT_TEMPERATURE and SCORER_TEMPERATURE env vars) the braintrust proxy might even cache some things for us.
Overview
The only scorers are Factuality for judging the text and binaryNdcgAtK (totally stolen from the chatbot project) for judging the sources/links. See the autoevals repo for more possibilities.
You'll see that there are broadly two kinds of eval cases: entrypoint prompts and user eval cases. The only entrypoint one I've added so far is the explain plan one. Since generating that prompt takes parameters I've come up with a way to manage that. Open to suggestions for how that should work.
Adding more eval cases
Add a file in
packages/compass-assistant/test/eval-cases(see others for inspiration) and then import/register the file in the index.ts in that folder. We'll probably add automation around this over time and I'm still trying to come up with the nicest, most ergonomic layout. Let me know how it goes!Some open questions, probably for later