Skip to content

chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 91 commits into from
Jul 18, 2025
Merged
Show file tree
Hide file tree
Changes from 84 commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
f63e48a
chore: LangChain based accuracy tests
himanshusinghs Jun 28, 2025
7efe7be
chore: use vercel AI SDK instead of langchain
himanshusinghs Jun 30, 2025
6f7b99a
chore: integrate capturing accuracy snapshots
himanshusinghs Jun 30, 2025
add4204
chore: correct env names
himanshusinghs Jun 30, 2025
f0c1d38
chore: more consolidated prompt tests
himanshusinghs Jun 30, 2025
8fe4942
chore: add a few more tests and some more models
himanshusinghs Jun 30, 2025
d220f22
chore: add AzureOpenAI model in the model list
himanshusinghs Jul 1, 2025
1c58427
chore: use ListDatabasesTool response creator for tests
himanshusinghs Jul 1, 2025
5ce954e
chore: use ListCollectionsTool response creators in tests
himanshusinghs Jul 1, 2025
cfce256
chore: tests for collection-indexes tool
himanshusinghs Jul 1, 2025
c3a0a72
modify prompt for list-collections prompt and log tools provided
himanshusinghs Jul 1, 2025
c71ac44
chore: have mock generators return Promise of ToolResult as well
himanshusinghs Jul 1, 2025
f6a8fcd
chore: tests for collection-schema tool
himanshusinghs Jul 1, 2025
ed0a6da
chore: do not fail tests on dropped accuracy
himanshusinghs Jul 1, 2025
c6da0b5
chore: added tests for find tool
himanshusinghs Jul 1, 2025
774640b
chore: tests for insert-many tool
himanshusinghs Jul 3, 2025
6e894bc
chore: tests for delete-many tool
himanshusinghs Jul 3, 2025
942bfc0
chore: add oepnai provider
himanshusinghs Jul 3, 2025
34bd4c2
chore: fixes accuracy scorer for position independent matching
himanshusinghs Jul 4, 2025
537fe2a
chore: replace mock mcp client with real (mockable) mcp client
himanshusinghs Jul 4, 2025
0bd9167
chore: moved all existing tests to vercel mcp client
himanshusinghs Jul 6, 2025
efefd9d
chore: adds tests for the rest of the tools
himanshusinghs Jul 7, 2025
06422a7
chore: adds missed out tests for tools
himanshusinghs Jul 7, 2025
6039b1d
chore: MongoDB based snapshot storage for accuracy runs
himanshusinghs Jul 8, 2025
8b39a1c
chore: remove file based snapshot
himanshusinghs Jul 8, 2025
ca49d40
wip: snapshot summary generator
himanshusinghs Jul 8, 2025
92413df
chore: single entry point for running accuracy tests with different c…
himanshusinghs Jul 8, 2025
8c50ecf
chore: reformat
himanshusinghs Jul 8, 2025
8c8a25b
chore: lint fixes
himanshusinghs Jul 8, 2025
ebe14d5
chore: simplified toolCallingAccuracy calculation
himanshusinghs Jul 8, 2025
ad316f7
chore: account for types moved around
himanshusinghs Jul 8, 2025
b34f6bc
chore: adds accuracyRunStatus to snapshot entries
himanshusinghs Jul 8, 2025
815952d
chore: add disk based accuracy storage for local runs
himanshusinghs Jul 8, 2025
5c99f85
chore: revert changes done to any of the src files
himanshusinghs Jul 8, 2025
0d6938a
chore: handle test failures and appropriately mark them as failed in …
himanshusinghs Jul 8, 2025
cbb137a
chore: make snapshot storage independent of accuracyRunId and commitSHA
himanshusinghs Jul 9, 2025
9321563
chore: bail on first failure and add some explanation for update-accu…
himanshusinghs Jul 9, 2025
f636c3f
chore: refactor to make tests writing simpler and other QOL improveme…
himanshusinghs Jul 9, 2025
ebcc19d
chore: generate accuracy test summary post test
himanshusinghs Jul 10, 2025
b1bf731
chore: add Github workflow to trigger test runs
himanshusinghs Jul 10, 2025
2e08208
chore: fix permissions issue
himanshusinghs Jul 10, 2025
509a23c
chore: bring back packages post merge
himanshusinghs Jul 10, 2025
be957b5
chore: update report generation to include comparison with baseline a…
himanshusinghs Jul 10, 2025
bad3012
Update .github/workflows/accuracy-tests.yml
himanshusinghs Jul 11, 2025
bc6e755
Update .github/workflows/accuracy-tests.yml
himanshusinghs Jul 11, 2025
3e094fa
Update .github/workflows/accuracy-tests.yml
himanshusinghs Jul 11, 2025
dca7217
Update .github/workflows/accuracy-tests.yml
himanshusinghs Jul 11, 2025
05c81c0
chore: secrets as per conventions
himanshusinghs Jul 11, 2025
e47922f
chore: updated how we store accuracy result
himanshusinghs Jul 13, 2025
fe47c61
chore: move accuracy scripts inside accuracy
himanshusinghs Jul 13, 2025
727be10
chore: addresses more PR feedback
himanshusinghs Jul 13, 2025
a0b9802
chore: use @ai-sdk/google
himanshusinghs Jul 13, 2025
f4ddec2
chore: use npm script in ci
himanshusinghs Jul 14, 2025
ea25ac5
chore: shift only when arguments are passed to the script
himanshusinghs Jul 14, 2025
d50824d
chore: azure url is on vars
himanshusinghs Jul 14, 2025
772a0a3
chore: use env vars for mongo namespace
himanshusinghs Jul 14, 2025
1c2295a
chore: ensure the generated asset directory is present
himanshusinghs Jul 14, 2025
a3ba9e0
chore: generate a markdown brief for PR comments
himanshusinghs Jul 14, 2025
bf0e696
chore: use lockfile for updating local test results
himanshusinghs Jul 14, 2025
e845e1a
chore: make expectedToolCalls part of PromptResult
himanshusinghs Jul 14, 2025
4f41af5
chore: make omitted fields a const
himanshusinghs Jul 14, 2025
e421125
chore: update formatRunStatus as per feedback
himanshusinghs Jul 14, 2025
2c2c428
chore: move saveModelResponseForPromptAtomic to atomic update pipeline
himanshusinghs Jul 14, 2025
34214ad
chore: prefer exclusive reads for public interface
himanshusinghs Jul 15, 2025
508f906
chore: minor refactor of disk-storage (#370)
nirinchev Jul 15, 2025
d3f1f73
chore: simplify getAccuracyResult
himanshusinghs Jul 15, 2025
ea127bf
chore: simplified the update pipeline and added tool call serialization
himanshusinghs Jul 15, 2025
acba3b4
chore: use $literal instead of serializing the tool calls
himanshusinghs Jul 15, 2025
f0d9c79
chore: don't import what is not used
himanshusinghs Jul 15, 2025
7798eb1
chore: should use $literal also for expectedToolCalls
himanshusinghs Jul 15, 2025
f303bb4
chore: should recreate comment and hide previous one
himanshusinghs Jul 15, 2025
eb24505
chore: rebase fixes and move to vitest
himanshusinghs Jul 16, 2025
8db0e6f
chore: run unit and integration for test script
himanshusinghs Jul 16, 2025
83157d3
chore: PR feedback
himanshusinghs Jul 16, 2025
6c57c38
chore: add return type annotation for accuracy testing client
himanshusinghs Jul 17, 2025
ba37196
chore: update test file names per naming convention
himanshusinghs Jul 17, 2025
c2a51fd
chore: update sdk file names per naming convention
himanshusinghs Jul 17, 2025
a66553b
chore: update accuracy file name per convention
himanshusinghs Jul 17, 2025
ab99613
chore: move test config out of functions
himanshusinghs Jul 17, 2025
093ebcf
chore: move left out test config out of functions
himanshusinghs Jul 17, 2025
8496b03
chore: remove unused func
himanshusinghs Jul 17, 2025
4bbcba1
chore: remove orphan checks
himanshusinghs Jul 17, 2025
7c3061d
chore: update the test prompt
himanshusinghs Jul 17, 2025
ec52ee5
chore: allow adding custom parameter scorers
himanshusinghs Jul 18, 2025
743cbfa
chore: ts fixes
himanshusinghs Jul 18, 2025
3491a3b
fix: tweak the arg shapes to improve tool accuracy (#381)
nirinchev Jul 18, 2025
2909e8a
Replace the matcher framework
nirinchev Jul 18, 2025
49bfac4
remove microdiff
nirinchev Jul 18, 2025
356512b
fix tests
nirinchev Jul 18, 2025
8a5a9d2
don't omit fields for MongoDB storage
nirinchev Jul 18, 2025
2d4e750
fix test coverage
nirinchev Jul 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions .github/workflows/accuracy-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
name: Accuracy Tests

on:
workflow_dispatch:
push:
branches:
- main
pull_request:
types:
- labeled

jobs:
run-accuracy-tests:
name: Run Accuracy Tests
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
if: |
github.event_name == 'workflow_dispatch' ||
(github.event_name == 'pull_request' && github.event.label.name == 'accuracy-tests')
env:
MDB_OPEN_AI_API_KEY: ${{ secrets.ACCURACY_OPEN_AI_API_KEY }}
MDB_GEMINI_API_KEY: ${{ secrets.ACCURACY_GEMINI_API_KEY }}
MDB_AZURE_OPEN_AI_API_KEY: ${{ secrets.ACCURACY_AZURE_OPEN_AI_API_KEY }}
MDB_AZURE_OPEN_AI_API_URL: ${{ vars.ACCURACY_AZURE_OPEN_AI_API_URL }}
MDB_ACCURACY_MDB_URL: ${{ secrets.ACCURACY_MDB_CONNECTION_STRING }}
MDB_ACCURACY_MDB_DB: ${{ vars.ACCURACY_MDB_DB }}
MDB_ACCURACY_MDB_COLLECTION: ${{ vars.ACCURACY_MDB_COLLECTION }}
MDB_ACCURACY_BASELINE_COMMIT: ${{ github.event.pull_request.base.sha || '' }}
steps:
- uses: GitHubSecurityLab/actions-permissions/monitor@v1
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version-file: package.json
cache: "npm"
- name: Install dependencies
run: npm ci
- name: Run accuracy tests
run: npm run test:accuracy
- name: Upload accuracy test summary
if: always()
uses: actions/upload-artifact@v4
with:
name: accuracy-test-summary
path: .accuracy/test-summary.html
- name: Comment summary on PR
if: github.event_name == 'pull_request' && github.event.label.name == 'accuracy-tests'
uses: marocchino/sticky-pull-request-comment@d2ad0de260ae8b0235ce059e63f2949ba9e05943 # v2
with:
# Hides the previous comment and add a comment at the end
hide_and_recreate: true
hide_classify: "OUTDATED"
path: .accuracy/test-brief.md
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,5 @@ state.json

tests/tmp
coverage
# Generated assets by accuracy runs
.accuracy
Loading
Loading