[FEATURE]: Evaluation harness + Dataset

### Is there an existing issue for this?

- [x] I have searched the existing issues

### Problem statement

The Model Context Protocol (MCP) has rapidly become the de‑facto USB‑C‑style connector between LLM clients and tool servers. Databricks customers evaluating an MCP server may ask two questions:
	1.	“Will my agent actually pick the right tool and call it correctly?”
	2.	“How stable is this server as we add more metadata, indexes, or Genie Spaces?”

A lightweight, reproducible evaluation suite shipped with the server lets teams answer both questions quantitatively, the same way OpenAI ships evals and the OSS community maintains ToolBench￼, LangChain’s Tool‑Usage Benchmarks, and τ‑Bench for agents .

### Proposed Solution

# 2  |  Scope

Ship three components inside the mcp‑server repo:

| Component | Purpose | Deliverables |
|--------|--------|--------|
| Test Client | Naïve reference agent that only enumerates the server’s /tools endpoint, calls LLM API for tool selection and final assistant response | Small configurable client that integrates with eval harness/builder |
| Evaluation Harness | Exhaustively runs (model × metadata‑level × tool‑count) matrix and scores each run | • evaluate.py with judge config • results in Databricks agent observability |
| Eval Set + Builder | Curated YAML describing expected tool + args for each prompt (plus script to regenerate from your schema) | • evals/ folder with tier‑0/1/2 sets • scripts/make_eval_set.py | 

# 3  |  Design Details

A single simple reference implementation keeps the surface area tiny while mirroring best‑practice “function‑calling” prompts￼. It:
	1.	Calls GET /tools → builds the JSON schema block.
	2.	Injects schema + user prompt into the LLM.
	3.	Parses the LLM’s function‑call response and executes MCP tools.
	4.     Passes tool execution results to LLM to return final assistant response

## 3.2  | Evaluation Matrix

For every combination of metadata_depth × entity_count × model, the harness records two primary metrics:

| Dimension(i) | Values |
|--------|--------|
| Metadata Depth | 0 (types only) • 1 (+descriptions) • 2 (+lineage, lineage, delta log, etc.) |
| Entity Count | tools {1, 5, 20, 100} × indexes {0, 2, 5} × GenieSpaces {0, 2} |
| Model | gpt‑4o‑mini, claude‑3‑sonnet,... |

For each harness sweep, `N = Metadata Depth • Entity Count • Model`


## 3.3 | Eval Set

The general methodology I propose follows [ToolBench](https://github.com/OpenBMB/ToolBench) that generates realistic user queries from the tool schemas. I see two options for creating eval sets:
- 1. **Fixed approach**: Define a fixed set of tools as a "test fixture" and run the eval generation 
- 2. **Introspection-based approach**: Instead of a fixed set of tools, use an approach that can introspect the server. 
I think that 2. is ultimately preferable as it will eventually support allowing users to generate their own eval sets. Practically speaking, I think it makes sense to configure test fixture server(s) using YAML (specifying tools, indexes, genie spaces, metadata, ...)

## 4 | Implementation

1. Add fixture servers + setup/teardown. These could be real or mocks, configured by something like YAML. 
2. Create the test client. Fetches tools, builds prompts, parses response
3. Generate eval sets. For each fixture pull schemas, synthesize k requests (expected tool + args), can persist these in version control as well for the public version of the eval sets. 
4. Build evaluation harness scripts using MLflow + Databricks evals
5. [Stretch] Run in CI, run metrics checks against PRs? 



### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE]: Evaluation harness + Dataset #24

Is there an existing issue for this?

Problem statement

Proposed Solution

2  |  Scope

3  |  Design Details

3.2  | Evaluation Matrix

3.3 | Eval Set

4 | Implementation

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Purpose	Deliverables
Test Client	Naïve reference agent that only enumerates the server’s /tools endpoint, calls LLM API for tool selection and final assistant response	Small configurable client that integrates with eval harness/builder
Evaluation Harness	Exhaustively runs (model × metadata‑level × tool‑count) matrix and scores each run	• evaluate.py with judge config • results in Databricks agent observability
Eval Set + Builder	Curated YAML describing expected tool + args for each prompt (plus script to regenerate from your schema)	• evals/ folder with tier‑0/1/2 sets • scripts/make_eval_set.py

Dimension(i)	Values
Metadata Depth	0 (types only) • 1 (+descriptions) • 2 (+lineage, lineage, delta log, etc.)
Entity Count	tools {1, 5, 20, 100} × indexes {0, 2, 5} × GenieSpaces {0, 2}
Model	gpt‑4o‑mini, claude‑3‑sonnet,...

[FEATURE]: Evaluation harness + Dataset #24

Description

Is there an existing issue for this?

Problem statement

Proposed Solution

2 | Scope

3 | Design Details

3.2 | Evaluation Matrix

3.3 | Eval Set

4 | Implementation

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2  |  Scope

3  |  Design Details

3.2  | Evaluation Matrix