Skip to content
This repository was archived by the owner on Feb 23, 2026. It is now read-only.

Commit 57c8eee

Browse files
authored
Merge pull request #49 from runbasehq/dev
feat: add scoring system for tool call evaluation
2 parents 0d56910 + 0fbd14c commit 57c8eee

File tree

7 files changed

+322
-93
lines changed

7 files changed

+322
-93
lines changed

README.md

Lines changed: 92 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
# mcp-check
2+
⚠️ It's not usable yet, but docs and a stable release are coming in the next few weeks. Stay tuned.
23

34
<p align="center">
45
<a href="https://x.com/fveiras_">
@@ -35,8 +36,7 @@ const mcpServer = new McpServer({
3536

3637
// Execute a prompt with multiple AI models
3738
const result = await client(mcpServer, ["claude-3-haiku-20240307", "gpt-4"])
38-
.prompt("What tools are available and how do they work?")
39-
.execute();
39+
.prompt("What tools are available and how do they work?");
4040

4141
// Get comprehensive results
4242
const executionResult = result.getExecutionResult();
@@ -65,7 +65,7 @@ const mcpServer = new McpServer({
6565
});
6666
```
6767

68-
### client(mcpServer, models, config?)
68+
### client(mcpServer, models, scorers?, config?)
6969

7070
Create a client instance to execute prompts:
7171

@@ -80,13 +80,27 @@ const result = await client(mcpServer, ["claude-3-haiku-20240307", "gpt-4"], {
8080
onError: (data) => console.error("Error:", data.error)
8181
}
8282
})
83-
.prompt("Your prompt here")
84-
.execute();
83+
.scorers([
84+
{
85+
name: "contains id",
86+
tool: "list_branches",
87+
scorer: ({ output }) => {
88+
try {
89+
const branches = JSON.parse(output[0]?.text);
90+
return branches.some(branch => branch.id) ? 1 : 0;
91+
} catch {
92+
return 0;
93+
}
94+
}
95+
}
96+
])
97+
.prompt("Your prompt here");
8598
```
8699

87100
**Parameters:**
88101
- `mcpServer`: Configured MCP server instance
89102
- `models`: Array of AI model names to use
103+
- `scorers?`: Optional array of Scorer instances for tool evaluation
90104
- `config?`: Optional configuration including API keys, silent mode, and chunk handlers
91105

92106
**Supported Models:**
@@ -96,17 +110,28 @@ const result = await client(mcpServer, ["claude-3-haiku-20240307", "gpt-4"], {
96110
### Agent Methods
97111

98112
#### `.prompt(text: string)`
99-
Set the prompt to execute against the MCP server.
113+
Execute the prompt against the MCP server and return results. This method automatically executes the prompt.
114+
115+
#### `.scorers(scorers: Array<{name: string; tool: string; scorer: Function}>)`
116+
Configure scorers to evaluate tool call results:
117+
```typescript
118+
.scorers([
119+
{
120+
name: "contains_data",
121+
tool: "fetch_data",
122+
scorer: ({ output, input }) => {
123+
return output?.data ? 1 : 0;
124+
}
125+
}
126+
])
127+
```
100128

101129
#### `.allowTools(tools: string[])`
102130
Restrict which tools can be used by the models.
103131

104-
#### `.execute()`
105-
Execute the prompt and return comprehensive results with tool usage tracking.
106-
107132
### Result Methods
108133

109-
The `execute()` method returns an `AgentsResult` object with these methods:
134+
The `prompt()` method returns an `AgentsResult` object with these methods:
110135

111136
#### `getResponse(model)`
112137
Get the response for a specific model:
@@ -155,6 +180,55 @@ Get list of models that executed successfully.
155180
#### `getFailedAgents()`
156181
Get list of models that failed to execute.
157182

183+
#### `getScores(model)`
184+
Get evaluation scores for a specific model's tool calls:
185+
```typescript
186+
const scores = result.getScores("claude-3-haiku-20240307");
187+
scores.forEach(score => {
188+
console.log(`${score.name}: ${score.score} for tool ${score.tool}`);
189+
});
190+
```
191+
192+
## Scorer System
193+
194+
The scorer system allows you to evaluate and validate tool call results automatically:
195+
196+
```typescript
197+
const result = await client(mcpServer, ["claude-3-haiku-20240307"])
198+
.scorers([
199+
{
200+
name: "valid_branches",
201+
tool: "list_branches",
202+
scorer: ({ output, input }) => {
203+
try {
204+
const branches = JSON.parse(output[0]?.text);
205+
return branches.every(b => b.id && b.name) ? 1 : 0;
206+
} catch {
207+
return 0;
208+
}
209+
}
210+
},
211+
{
212+
name: "has_results",
213+
tool: "search_content",
214+
scorer: ({ output }) => {
215+
return output?.results?.length > 0 ? 1 : 0;
216+
}
217+
}
218+
])
219+
.prompt("List all branches and search for content");
220+
221+
// Get scores for evaluation
222+
const scores = result.getScores("claude-3-haiku-20240307");
223+
console.log("Evaluation results:", scores);
224+
```
225+
226+
Scorer functions receive:
227+
- `output`: The tool's result/response
228+
- `input`: The tool's input arguments
229+
230+
Return a number (typically 0-1) representing the evaluation score.
231+
158232
## Testing Example
159233

160234
```typescript
@@ -170,8 +244,14 @@ const mcpServer = new McpServer({
170244
describe("MCP Server Tests", () => {
171245
test("should use expected tools across multiple models", async () => {
172246
const result = await client(mcpServer, ["claude-3-haiku-20240307", "gpt-4"])
173-
.prompt("Update the content using the available tools.")
174-
.execute();
247+
.scorers([
248+
{
249+
name: "tool_success",
250+
tool: "update_blocks",
251+
scorer: ({ output }) => output?.success ? 1 : 0
252+
}
253+
])
254+
.prompt("Update the content using the available tools.");
175255

176256
// Verify execution summary
177257
const execution = result.getExecutionResult();

packages/mcp-check/CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,14 @@
11
# mcp-testing-library
22

3+
## 0.4.0
4+
5+
### Minor Changes
6+
7+
- - Add scoring system for tool call evaluation (getScores() and .scorers() methods)
8+
- Simplified API: Removed .execute() calls since prompt() now auto-executes
9+
- Scorer System: Added comprehensive section explaining how to use scorers
10+
- Updated Examples
11+
312
## 0.3.3
413

514
### Patch Changes

packages/mcp-check/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"name": "mcp-check",
33
"module": "dist/src/index.js",
4-
"version": "0.3.3",
4+
"version": "0.4.0",
55
"type": "module",
66
"main": "dist/src/index.js",
77
"types": "dist/src/index.d.ts",

packages/mcp-check/src/chunks/types.ts

Lines changed: 29 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,10 @@ export interface ToolCall {
77

88
/**
99
* Represents the result of a streaming operation.
10-
*
10+
*
1111
* Contains the final content generated and information about tools used
1212
* during the streaming process.
13-
*
13+
*
1414
* @example
1515
* ```typescript
1616
* const result: StreamResult = {
@@ -33,10 +33,10 @@ export interface StreamResult {
3333

3434
/**
3535
* Represents a complete agent response with metadata.
36-
*
36+
*
3737
* This interface provides a comprehensive view of an AI agent's response,
3838
* including the model used, content generated, tools utilized, and metadata.
39-
*
39+
*
4040
* @example
4141
* ```typescript
4242
* const response: AgentResponse = {
@@ -76,10 +76,10 @@ export interface AgentResponse {
7676

7777
/**
7878
* Represents the result of executing multiple agents.
79-
*
79+
*
8080
* This interface aggregates responses from multiple AI agents and provides
8181
* summary statistics about the execution.
82-
*
82+
*
8383
* @example
8484
* ```typescript
8585
* const executionResult: AgentsExecutionResult = {
@@ -117,25 +117,25 @@ export interface AgentsExecutionResult {
117117

118118
/**
119119
* Typed tool call with generic arguments and result types.
120-
*
120+
*
121121
* This interface provides type safety for tool calls by allowing specification
122122
* of argument and result types through generics.
123-
*
123+
*
124124
* @template TArgs - Type of the tool arguments
125125
* @template TResult - Type of the tool result
126-
*
126+
*
127127
* @example
128128
* ```typescript
129129
* interface WeatherArgs {
130130
* location: string;
131131
* units: "celsius" | "fahrenheit";
132132
* }
133-
*
133+
*
134134
* interface WeatherResult {
135135
* temperature: number;
136136
* condition: string;
137137
* }
138-
*
138+
*
139139
* const typedCall: TypedToolCall<WeatherArgs, WeatherResult> = {
140140
* args: { location: "NYC", units: "fahrenheit" },
141141
* result: { temperature: 72, condition: "sunny" },
@@ -157,10 +157,10 @@ export interface TypedToolCall<TArgs = Record<string, any>, TResult = any> {
157157

158158
/**
159159
* Statistics for tool call usage.
160-
*
160+
*
161161
* This interface tracks usage patterns and performance metrics for tools
162162
* used by AI agents.
163-
*
163+
*
164164
* @example
165165
* ```typescript
166166
* const stats: ToolCallStats = {
@@ -187,10 +187,10 @@ export interface ToolCallStats {
187187

188188
/**
189189
* Union type of all possible normalized chunk types.
190-
*
190+
*
191191
* Defines the different types of streaming chunks that can be processed
192192
* from AI providers.
193-
*
193+
*
194194
* - `text_delta`: Incremental text content updates
195195
* - `tool_call_start`: Beginning of a tool call
196196
* - `tool_call_done`: Completion of a tool call
@@ -199,7 +199,7 @@ export interface ToolCallStats {
199199
* - `message_done`: Completion of a message
200200
* - `thinking_delta`: Incremental thinking/reasoning updates
201201
* - `error`: Error information
202-
*
202+
*
203203
* @example
204204
* ```typescript
205205
* const chunkType: NormalizedChunkType = "text_delta";
@@ -217,11 +217,11 @@ export type NormalizedChunkType =
217217

218218
/**
219219
* Represents a normalized chunk of streaming data from AI providers.
220-
*
220+
*
221221
* This interface provides a unified format for handling chunks from different
222222
* AI providers (Anthropic, OpenAI) by normalizing their specific formats
223223
* into a common structure.
224-
*
224+
*
225225
* @example
226226
* ```typescript
227227
* const chunk: NormalizedChunk = {
@@ -261,13 +261,13 @@ export interface NormalizedChunk {
261261

262262
/**
263263
* Generic callback function for handling any normalized chunk.
264-
*
264+
*
265265
* This type defines a callback that can process any type of normalized chunk.
266266
* The callback can be synchronous or asynchronous.
267-
*
267+
*
268268
* @param chunk - The normalized chunk to process
269269
* @returns Promise<void> | void
270-
*
270+
*
271271
* @example
272272
* ```typescript
273273
* const callback: ChunkCallback = async (chunk) => {
@@ -280,13 +280,13 @@ export type ChunkCallback = (chunk: NormalizedChunk) => void | Promise<void>;
280280

281281
/**
282282
* Callback function for handling specific chunk types.
283-
*
283+
*
284284
* This type defines a callback that processes the data payload from a specific
285285
* chunk type. The callback can be synchronous or asynchronous.
286-
*
286+
*
287287
* @param data - The data payload from the chunk
288288
* @returns Promise<void> | void
289-
*
289+
*
290290
* @example
291291
* ```typescript
292292
* const textCallback: ChunkTypeCallback = async (data) => {
@@ -296,15 +296,17 @@ export type ChunkCallback = (chunk: NormalizedChunk) => void | Promise<void>;
296296
* };
297297
* ```
298298
*/
299-
export type ChunkTypeCallback = (data: NormalizedChunk["data"]) => void | Promise<void>;
299+
export type ChunkTypeCallback = (
300+
data: NormalizedChunk["data"],
301+
) => void | Promise<void>;
300302

301303
/**
302304
* Configuration object for chunk handlers.
303-
*
305+
*
304306
* This interface defines all the callback functions that can be registered
305307
* to handle different types of chunks during processing. All callbacks are
306308
* optional and can be async.
307-
*
309+
*
308310
* @example
309311
* ```typescript
310312
* const handlers: ChunkHandlerConfig = {

0 commit comments

Comments
 (0)