-
Notifications
You must be signed in to change notification settings - Fork 97
chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 84 commits
Commits
Show all changes
91 commits
Select commit
Hold shift + click to select a range
f63e48a
chore: LangChain based accuracy tests
himanshusinghs 7efe7be
chore: use vercel AI SDK instead of langchain
himanshusinghs 6f7b99a
chore: integrate capturing accuracy snapshots
himanshusinghs add4204
chore: correct env names
himanshusinghs f0c1d38
chore: more consolidated prompt tests
himanshusinghs 8fe4942
chore: add a few more tests and some more models
himanshusinghs d220f22
chore: add AzureOpenAI model in the model list
himanshusinghs 1c58427
chore: use ListDatabasesTool response creator for tests
himanshusinghs 5ce954e
chore: use ListCollectionsTool response creators in tests
himanshusinghs cfce256
chore: tests for collection-indexes tool
himanshusinghs c3a0a72
modify prompt for list-collections prompt and log tools provided
himanshusinghs c71ac44
chore: have mock generators return Promise of ToolResult as well
himanshusinghs f6a8fcd
chore: tests for collection-schema tool
himanshusinghs ed0a6da
chore: do not fail tests on dropped accuracy
himanshusinghs c6da0b5
chore: added tests for find tool
himanshusinghs 774640b
chore: tests for insert-many tool
himanshusinghs 6e894bc
chore: tests for delete-many tool
himanshusinghs 942bfc0
chore: add oepnai provider
himanshusinghs 34bd4c2
chore: fixes accuracy scorer for position independent matching
himanshusinghs 537fe2a
chore: replace mock mcp client with real (mockable) mcp client
himanshusinghs 0bd9167
chore: moved all existing tests to vercel mcp client
himanshusinghs efefd9d
chore: adds tests for the rest of the tools
himanshusinghs 06422a7
chore: adds missed out tests for tools
himanshusinghs 6039b1d
chore: MongoDB based snapshot storage for accuracy runs
himanshusinghs 8b39a1c
chore: remove file based snapshot
himanshusinghs ca49d40
wip: snapshot summary generator
himanshusinghs 92413df
chore: single entry point for running accuracy tests with different c…
himanshusinghs 8c50ecf
chore: reformat
himanshusinghs 8c8a25b
chore: lint fixes
himanshusinghs ebe14d5
chore: simplified toolCallingAccuracy calculation
himanshusinghs ad316f7
chore: account for types moved around
himanshusinghs b34f6bc
chore: adds accuracyRunStatus to snapshot entries
himanshusinghs 815952d
chore: add disk based accuracy storage for local runs
himanshusinghs 5c99f85
chore: revert changes done to any of the src files
himanshusinghs 0d6938a
chore: handle test failures and appropriately mark them as failed in …
himanshusinghs cbb137a
chore: make snapshot storage independent of accuracyRunId and commitSHA
himanshusinghs 9321563
chore: bail on first failure and add some explanation for update-accu…
himanshusinghs f636c3f
chore: refactor to make tests writing simpler and other QOL improveme…
himanshusinghs ebcc19d
chore: generate accuracy test summary post test
himanshusinghs b1bf731
chore: add Github workflow to trigger test runs
himanshusinghs 2e08208
chore: fix permissions issue
himanshusinghs 509a23c
chore: bring back packages post merge
himanshusinghs be957b5
chore: update report generation to include comparison with baseline a…
himanshusinghs bad3012
Update .github/workflows/accuracy-tests.yml
himanshusinghs bc6e755
Update .github/workflows/accuracy-tests.yml
himanshusinghs 3e094fa
Update .github/workflows/accuracy-tests.yml
himanshusinghs dca7217
Update .github/workflows/accuracy-tests.yml
himanshusinghs 05c81c0
chore: secrets as per conventions
himanshusinghs e47922f
chore: updated how we store accuracy result
himanshusinghs fe47c61
chore: move accuracy scripts inside accuracy
himanshusinghs 727be10
chore: addresses more PR feedback
himanshusinghs a0b9802
chore: use @ai-sdk/google
himanshusinghs f4ddec2
chore: use npm script in ci
himanshusinghs ea25ac5
chore: shift only when arguments are passed to the script
himanshusinghs d50824d
chore: azure url is on vars
himanshusinghs 772a0a3
chore: use env vars for mongo namespace
himanshusinghs 1c2295a
chore: ensure the generated asset directory is present
himanshusinghs a3ba9e0
chore: generate a markdown brief for PR comments
himanshusinghs bf0e696
chore: use lockfile for updating local test results
himanshusinghs e845e1a
chore: make expectedToolCalls part of PromptResult
himanshusinghs 4f41af5
chore: make omitted fields a const
himanshusinghs e421125
chore: update formatRunStatus as per feedback
himanshusinghs 2c2c428
chore: move saveModelResponseForPromptAtomic to atomic update pipeline
himanshusinghs 34214ad
chore: prefer exclusive reads for public interface
himanshusinghs 508f906
chore: minor refactor of disk-storage (#370)
nirinchev d3f1f73
chore: simplify getAccuracyResult
himanshusinghs ea127bf
chore: simplified the update pipeline and added tool call serialization
himanshusinghs acba3b4
chore: use $literal instead of serializing the tool calls
himanshusinghs f0d9c79
chore: don't import what is not used
himanshusinghs 7798eb1
chore: should use $literal also for expectedToolCalls
himanshusinghs f303bb4
chore: should recreate comment and hide previous one
himanshusinghs eb24505
chore: rebase fixes and move to vitest
himanshusinghs 8db0e6f
chore: run unit and integration for test script
himanshusinghs 83157d3
chore: PR feedback
himanshusinghs 6c57c38
chore: add return type annotation for accuracy testing client
himanshusinghs ba37196
chore: update test file names per naming convention
himanshusinghs c2a51fd
chore: update sdk file names per naming convention
himanshusinghs a66553b
chore: update accuracy file name per convention
himanshusinghs ab99613
chore: move test config out of functions
himanshusinghs 093ebcf
chore: move left out test config out of functions
himanshusinghs 8496b03
chore: remove unused func
himanshusinghs 4bbcba1
chore: remove orphan checks
himanshusinghs 7c3061d
chore: update the test prompt
himanshusinghs ec52ee5
chore: allow adding custom parameter scorers
himanshusinghs 743cbfa
chore: ts fixes
himanshusinghs 3491a3b
fix: tweak the arg shapes to improve tool accuracy (#381)
nirinchev 2909e8a
Replace the matcher framework
nirinchev 49bfac4
remove microdiff
nirinchev 356512b
fix tests
nirinchev 8a5a9d2
don't omit fields for MongoDB storage
nirinchev 2d4e750
fix test coverage
nirinchev File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
name: Accuracy Tests | ||
|
||
on: | ||
workflow_dispatch: | ||
push: | ||
branches: | ||
- main | ||
pull_request: | ||
types: | ||
- labeled | ||
|
||
jobs: | ||
run-accuracy-tests: | ||
name: Run Accuracy Tests | ||
runs-on: ubuntu-latest | ||
permissions: | ||
contents: read | ||
pull-requests: write | ||
if: | | ||
github.event_name == 'workflow_dispatch' || | ||
(github.event_name == 'pull_request' && github.event.label.name == 'accuracy-tests') | ||
env: | ||
MDB_OPEN_AI_API_KEY: ${{ secrets.ACCURACY_OPEN_AI_API_KEY }} | ||
MDB_GEMINI_API_KEY: ${{ secrets.ACCURACY_GEMINI_API_KEY }} | ||
MDB_AZURE_OPEN_AI_API_KEY: ${{ secrets.ACCURACY_AZURE_OPEN_AI_API_KEY }} | ||
MDB_AZURE_OPEN_AI_API_URL: ${{ vars.ACCURACY_AZURE_OPEN_AI_API_URL }} | ||
MDB_ACCURACY_MDB_URL: ${{ secrets.ACCURACY_MDB_CONNECTION_STRING }} | ||
MDB_ACCURACY_MDB_DB: ${{ vars.ACCURACY_MDB_DB }} | ||
MDB_ACCURACY_MDB_COLLECTION: ${{ vars.ACCURACY_MDB_COLLECTION }} | ||
MDB_ACCURACY_BASELINE_COMMIT: ${{ github.event.pull_request.base.sha || '' }} | ||
steps: | ||
- uses: GitHubSecurityLab/actions-permissions/monitor@v1 | ||
- uses: actions/checkout@v4 | ||
- uses: actions/setup-node@v4 | ||
with: | ||
node-version-file: package.json | ||
cache: "npm" | ||
- name: Install dependencies | ||
run: npm ci | ||
- name: Run accuracy tests | ||
run: npm run test:accuracy | ||
- name: Upload accuracy test summary | ||
if: always() | ||
uses: actions/upload-artifact@v4 | ||
with: | ||
name: accuracy-test-summary | ||
path: .accuracy/test-summary.html | ||
- name: Comment summary on PR | ||
if: github.event_name == 'pull_request' && github.event.label.name == 'accuracy-tests' | ||
uses: marocchino/sticky-pull-request-comment@d2ad0de260ae8b0235ce059e63f2949ba9e05943 # v2 | ||
with: | ||
# Hides the previous comment and add a comment at the end | ||
hide_and_recreate: true | ||
hide_classify: "OUTDATED" | ||
path: .accuracy/test-brief.md |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,3 +11,5 @@ state.json | |
|
||
tests/tmp | ||
coverage | ||
# Generated assets by accuracy runs | ||
.accuracy |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.