Skip to content

[FEATURE]: Evaluation harness + Dataset #24

@FMurray

Description

@FMurray

Is there an existing issue for this?

  • I have searched the existing issues

Problem statement

The Model Context Protocol (MCP) has rapidly become the de‑facto USB‑C‑style connector between LLM clients and tool servers. Databricks customers evaluating an MCP server may ask two questions:
1. “Will my agent actually pick the right tool and call it correctly?”
2. “How stable is this server as we add more metadata, indexes, or Genie Spaces?”

A lightweight, reproducible evaluation suite shipped with the server lets teams answer both questions quantitatively, the same way OpenAI ships evals and the OSS community maintains ToolBench, LangChain’s Tool‑Usage Benchmarks, and τ‑Bench for agents .

Proposed Solution

2  |  Scope

Ship three components inside the mcp‑server repo:

Component Purpose Deliverables
Test Client Naïve reference agent that only enumerates the server’s /tools endpoint, calls LLM API for tool selection and final assistant response Small configurable client that integrates with eval harness/builder
Evaluation Harness Exhaustively runs (model × metadata‑level × tool‑count) matrix and scores each run • evaluate.py with judge config • results in Databricks agent observability
Eval Set + Builder Curated YAML describing expected tool + args for each prompt (plus script to regenerate from your schema) • evals/ folder with tier‑0/1/2 sets • scripts/make_eval_set.py

3  |  Design Details

A single simple reference implementation keeps the surface area tiny while mirroring best‑practice “function‑calling” prompts. It:
1. Calls GET /tools → builds the JSON schema block.
2. Injects schema + user prompt into the LLM.
3. Parses the LLM’s function‑call response and executes MCP tools.
4. Passes tool execution results to LLM to return final assistant response

3.2  | Evaluation Matrix

For every combination of metadata_depth × entity_count × model, the harness records two primary metrics:

Dimension(i) Values
Metadata Depth 0 (types only) • 1 (+descriptions) • 2 (+lineage, lineage, delta log, etc.)
Entity Count tools {1, 5, 20, 100} × indexes {0, 2, 5} × GenieSpaces {0, 2}
Model gpt‑4o‑mini, claude‑3‑sonnet,...

For each harness sweep, N = Metadata Depth • Entity Count • Model

3.3 | Eval Set

The general methodology I propose follows ToolBench that generates realistic user queries from the tool schemas. I see two options for creating eval sets:

    1. Fixed approach: Define a fixed set of tools as a "test fixture" and run the eval generation
    1. Introspection-based approach: Instead of a fixed set of tools, use an approach that can introspect the server.
      I think that 2. is ultimately preferable as it will eventually support allowing users to generate their own eval sets. Practically speaking, I think it makes sense to configure test fixture server(s) using YAML (specifying tools, indexes, genie spaces, metadata, ...)

4 | Implementation

  1. Add fixture servers + setup/teardown. These could be real or mocks, configured by something like YAML.
  2. Create the test client. Fetches tools, builds prompts, parses response
  3. Generate eval sets. For each fixture pull schemas, synthesize k requests (expected tool + args), can persist these in version control as well for the public version of the eval sets.
  4. Build evaluation harness scripts using MLflow + Databricks evals
  5. [Stretch] Run in CI, run metrics checks against PRs?

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions