-
Notifications
You must be signed in to change notification settings - Fork 42
Description
Is there an existing issue for this?
- I have searched the existing issues
Problem statement
The Model Context Protocol (MCP) has rapidly become the de‑facto USB‑C‑style connector between LLM clients and tool servers. Databricks customers evaluating an MCP server may ask two questions:
1. “Will my agent actually pick the right tool and call it correctly?”
2. “How stable is this server as we add more metadata, indexes, or Genie Spaces?”
A lightweight, reproducible evaluation suite shipped with the server lets teams answer both questions quantitatively, the same way OpenAI ships evals and the OSS community maintains ToolBench, LangChain’s Tool‑Usage Benchmarks, and τ‑Bench for agents .
Proposed Solution
2 | Scope
Ship three components inside the mcp‑server repo:
| Component | Purpose | Deliverables |
|---|---|---|
| Test Client | Naïve reference agent that only enumerates the server’s /tools endpoint, calls LLM API for tool selection and final assistant response | Small configurable client that integrates with eval harness/builder |
| Evaluation Harness | Exhaustively runs (model × metadata‑level × tool‑count) matrix and scores each run | • evaluate.py with judge config • results in Databricks agent observability |
| Eval Set + Builder | Curated YAML describing expected tool + args for each prompt (plus script to regenerate from your schema) | • evals/ folder with tier‑0/1/2 sets • scripts/make_eval_set.py |
3 | Design Details
A single simple reference implementation keeps the surface area tiny while mirroring best‑practice “function‑calling” prompts. It:
1. Calls GET /tools → builds the JSON schema block.
2. Injects schema + user prompt into the LLM.
3. Parses the LLM’s function‑call response and executes MCP tools.
4. Passes tool execution results to LLM to return final assistant response
3.2 | Evaluation Matrix
For every combination of metadata_depth × entity_count × model, the harness records two primary metrics:
| Dimension(i) | Values |
|---|---|
| Metadata Depth | 0 (types only) • 1 (+descriptions) • 2 (+lineage, lineage, delta log, etc.) |
| Entity Count | tools {1, 5, 20, 100} × indexes {0, 2, 5} × GenieSpaces {0, 2} |
| Model | gpt‑4o‑mini, claude‑3‑sonnet,... |
For each harness sweep, N = Metadata Depth • Entity Count • Model
3.3 | Eval Set
The general methodology I propose follows ToolBench that generates realistic user queries from the tool schemas. I see two options for creating eval sets:
-
- Fixed approach: Define a fixed set of tools as a "test fixture" and run the eval generation
-
- Introspection-based approach: Instead of a fixed set of tools, use an approach that can introspect the server.
I think that 2. is ultimately preferable as it will eventually support allowing users to generate their own eval sets. Practically speaking, I think it makes sense to configure test fixture server(s) using YAML (specifying tools, indexes, genie spaces, metadata, ...)
- Introspection-based approach: Instead of a fixed set of tools, use an approach that can introspect the server.
4 | Implementation
- Add fixture servers + setup/teardown. These could be real or mocks, configured by something like YAML.
- Create the test client. Fetches tools, builds prompts, parses response
- Generate eval sets. For each fixture pull schemas, synthesize k requests (expected tool + args), can persist these in version control as well for the public version of the eval sets.
- Build evaluation harness scripts using MLflow + Databricks evals
- [Stretch] Run in CI, run metrics checks against PRs?
Additional Context
No response