Skip to content

Commit 8764c63

Browse files
committed
fix: update run-evaluation.ts
1 parent e766cab commit 8764c63

File tree

9 files changed

+989
-404
lines changed

9 files changed

+989
-404
lines changed

.github/workflows/evaluations.yaml

Lines changed: 1 addition & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -37,27 +37,8 @@ jobs:
3737
- name: Build project
3838
run: npm run build
3939

40-
- name: Export tools
41-
run: npm run evals:export-tools
42-
43-
- name: Set up Python
44-
uses: actions/setup-python@v4
45-
with:
46-
python-version: '3.12'
47-
48-
- name: Install uv
49-
run: pip install uv
50-
51-
- name: Create Python virtual environment
52-
run: uv venv
53-
54-
- name: Install Python dependencies
55-
run: uv pip install -e .
56-
5740
- name: Run evaluations
58-
run: |
59-
source .venv/bin/activate
60-
npm run test:evals
41+
run: npm run evals:run
6142
env:
6243
PHOENIX_API_KEY: ${{ secrets.PHOENIX_API_KEY }}
6344
PHOENIX_COLLECTOR_ENDPOINT: ${{ secrets.PHOENIX_COLLECTOR_ENDPOINT }}

evals/README.md

Lines changed: 56 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,56 +1,84 @@
11
# MCP Tool Calling Evaluations
22

3-
Python-based evaluations for the Apify MCP Server using Arize Phoenix platform.
3+
TypeScript-based evaluations for the Apify MCP Server using Arize Phoenix platform.
44

5-
> **Note**: The TypeScript package had connection issues, so we use the Python implementation instead.
5+
## Objectives
6+
7+
The MCP server tool calls evaluation has several key objectives:
8+
9+
1. **Identify problems** in the description of the tools
10+
2. **Create a test suite** that can be run manually or automatically in CI
11+
3. **Allow for quick iteration** on tool descriptions
12+
13+
## 1. ✍️ **Create test cases manually** (Current Implementation)
14+
15+
- **Pros:**
16+
- Straightforward approach
17+
- Simple to create test cases for each tool
18+
- Direct control over test scenarios
19+
20+
- **Cons:**
21+
- Complicated to create flows (several tool calls in a row)
22+
- Requires maintenance when MCP server changes
23+
- Manual effort for comprehensive coverage
24+
25+
## Test case examples
26+
27+
### Simple tool selection
28+
```
29+
"What are the best Instagram scrapers" → "search-actors"
30+
```
31+
32+
### Multi-step flow
33+
```
34+
User: "Search for the weather MCP server and then add it to available tools"
35+
Expected sequence:
36+
1. search-actors (with input: {"search": "weather mcp", "limit": 5})
37+
2. add-actor (to add the found weather MCP server)
38+
```
639

740
## Workflow
841

9-
The evaluation process has 4 steps:
42+
The evaluation process has two steps:
1043

1144
1. **Create dataset** (if not exists) - Upload test cases to Phoenix
12-
2. **Update dataset ID** in `config.py` - Point to the correct Phoenix dataset
13-
3. **Export tools** - Get current MCP tool definitions
14-
4. **Run evaluation** - Test models against ground truth
45+
2. **Run evaluation** - Test models against ground truth
1546

16-
## Quick Start
47+
## Quick start
1748

1849
```bash
1950
# 1. Set environment variables
51+
export PHOENIX_BASE_URL="phoenix_url"
2052
export PHOENIX_API_KEY="your_key"
21-
export OPENAI_API_KEY="your_key"
53+
export OPENAI_API_KEY="your_key"
2254
export ANTHROPIC_API_KEY="your_key"
2355

2456
# 2. Install dependencies
25-
uv pip install -e evals/
57+
npm ci
2658

2759
# 3. Create dataset (one-time)
28-
python3 evals/create_dataset.py
29-
30-
# 4. Update DATASET_NAME in config.py with the returned dataset ID
60+
npm run evals:create-dataset
3161

32-
# 5. Export tools and run evaluation
33-
npm run evals:export-tools
34-
python3 evals/run_evaluation.py
62+
# 5. Run evaluation
63+
npm run evals:run
3564
```
3665

3766
## Files
3867

39-
- `config.py` - Configuration (models, threshold, Phoenix settings)
40-
- `test_cases.json` - Ground truth test cases
41-
- `run_evaluation.py` - Main evaluation script
42-
- `create_dataset.py` - Upload test cases to Phoenix
43-
- `export-tools.ts` - Export MCP tools to JSON
44-
- `evaluation_2025.ipynb` - Interactive analysis notebook
68+
- `config.ts` - Configuration (models, threshold, Phoenix settings)
69+
- `test-cases.json` - Ground truth test cases
70+
- `run-evaluation.ts` - Main evaluation script
71+
- `create-dataset.ts` - Upload test cases to Phoenix
72+
- `evaluation_2025.ipynb` - Interactive analysis notebook (Python-based, requires `pip install -e .`)
4573

4674
## Configuration
4775

48-
Key settings in `config.py`:
76+
Key settings in `config.ts`:
4977
- `MODELS_TO_EVALUATE` - Models to test (default: `['gpt-4o-mini', 'claude-3-5-haiku-latest']`)
5078
- `PASS_THRESHOLD` - Accuracy threshold (default: 0.8)
5179
- `DATASET_NAME` - Phoenix dataset name
5280

53-
## Test Cases
81+
## Test cases
5482

5583
40+ test cases covering 7 tool categories:
5684
- `fetch-actor-details` - Actor information queries
@@ -70,18 +98,14 @@ Key settings in `config.py`:
7098
## Troubleshooting
7199

72100
```bash
73-
# Missing tools.json
74-
npm run evals:export-tools
75-
76101
# Missing dataset
77-
python3 evals/create_dataset.py
102+
npm run evals:create-dataset
78103

79104
# Environment issues
80-
python3 -c "from dotenv import load_dotenv; load_dotenv()"
105+
# Make sure .env file exists with required API keys
81106
```
82107

83-
## Adding Test Cases
108+
## Adding test cases
84109

85-
1. Edit `test_cases.json`
86-
2. Update version number
87-
3. Run `python3 evals/create_dataset.py`
110+
1. Edit `test-cases.json`
111+
3. Run `npm run evals:create-dataset`

evals/config.ts

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@ import { readFileSync } from 'node:fs';
66
import { dirname, join } from 'node:path';
77
import { fileURLToPath } from 'node:url';
88

9-
// Read version from test_cases.json
9+
// Read version from test-cases.json
1010
function getTestCasesVersion(): string {
1111
const currentFilename = fileURLToPath(import.meta.url);
1212
const currentDirname = dirname(currentFilename);
13-
const testCasesPath = join(currentDirname, 'test_cases.json');
13+
const testCasesPath = join(currentDirname, 'test-cases.json');
1414
const testCasesContent = readFileSync(testCasesPath, 'utf-8');
1515
const testCases = JSON.parse(testCasesContent);
1616
return testCases.version;
@@ -19,7 +19,7 @@ function getTestCasesVersion(): string {
1919
// Models to evaluate
2020
export const MODELS_TO_EVALUATE = [
2121
'gpt-4o-mini',
22-
// 'claude-3-5-haiku-latest',
22+
'claude-3-5-haiku-latest',
2323
];
2424

2525
export const PASS_THRESHOLD = 0.8;

evals/create-dataset.ts

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -27,19 +27,19 @@ interface TestCase {
2727
id: string;
2828
category: string;
2929
question: string;
30-
expected_tools: string[];
30+
expectedTools: string[];
3131
}
3232

3333
interface TestData {
3434
version: string;
35-
test_cases: TestCase[];
35+
testCases: TestCase[];
3636
}
3737

3838
// eslint-disable-next-line consistent-return
3939
function loadTestCases(): TestData {
4040
const filename = fileURLToPath(import.meta.url);
4141
const dirname = pathDirname(filename);
42-
const testCasesPath = join(dirname, 'test_cases.json');
42+
const testCasesPath = join(dirname, 'test-cases.json');
4343

4444
try {
4545
const fileContent = readFileSync(testCasesPath, 'utf-8');
@@ -60,14 +60,14 @@ async function createDatasetFromTestCases(): Promise<void> {
6060

6161
// Load test cases
6262
const testData = loadTestCases();
63-
const testCases = testData.test_cases;
63+
const { testCases } = testData;
6464

6565
log.info(`Loaded ${testCases.length} test cases`);
6666

6767
// Convert to format expected by Phoenix
6868
const examples = testCases.map((testCase) => ({
6969
input: { question: testCase.question },
70-
output: { tool_calls: testCase.expected_tools.join(', ') },
70+
output: { tool_calls: testCase.expectedTools.join(', ') },
7171
metadata: { category: testCase.category },
7272
}));
7373

evals/run-evaluation.ts

Lines changed: 33 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,16 @@ dotenv.config({ path: '.env' });
2626

2727
type ExampleInputOnly = { input: Record<string, unknown>, metadata?: Record<string, unknown>, output?: never };
2828

29+
// Type for Phoenix evaluation run results
30+
interface EvaluationRun {
31+
name: string;
32+
result?: {
33+
score?: number;
34+
[key: string]: unknown;
35+
};
36+
[key: string]: unknown;
37+
}
38+
2939
async function loadTools(): Promise<ToolBase[]> {
3040
const apifyClient = new ApifyClient({ token: process.env.APIFY_API_TOKEN || '' });
3141
const urlTools = await processParamsGetTools('', apifyClient);
@@ -55,7 +65,11 @@ function transformToolsToAnthropicFormat(tools: ToolBase[]): Anthropic.Tool[] {
5565
function createOpenAITask(modelName: string, tools: ToolBase[]) {
5666
const toolsOpenAI = transformToolsToOpenAIFormat(tools);
5767

58-
return async (example: ExampleInputOnly): Promise<{ toolCalls: string[] }> => {
68+
return async (example: ExampleInputOnly): Promise<{
69+
toolCalls: string[];
70+
input: Record<string, unknown>,
71+
metadata: Record<string, unknown>,
72+
}> => {
5973
const client = new OpenAI();
6074

6175
const response = await client.chat.completions.create({
@@ -69,14 +83,16 @@ function createOpenAITask(modelName: string, tools: ToolBase[]) {
6983

7084
const toolCalls: string[] = [];
7185
const firstMessage = response.choices?.[0]?.message;
72-
const msg = JSON.stringify(JSON.stringify(firstMessage));
73-
log.debug(`${example.metadata?.category} - ${example.input?.question} - ${msg}`);
7486
if (firstMessage?.tool_calls?.length) {
7587
const toolCall = firstMessage.tool_calls[0];
7688
const name = toolCall?.function?.name;
7789
if (name) toolCalls.push(name);
7890
}
79-
return { toolCalls };
91+
return {
92+
toolCalls,
93+
input: example.input,
94+
metadata: { content: firstMessage },
95+
};
8096
};
8197
}
8298

@@ -99,7 +115,6 @@ function createAnthropicTask(modelName: string, tools: ToolBase[]) {
99115
});
100116

101117
const toolCalls: string[] = [];
102-
log.debug(`${example.input?.question} - ${JSON.stringify(response.content)}`);
103118
for (const content of response.content) {
104119
if (content.type === 'tool_use') {
105120
const toolUseContent = content as Anthropic.ToolUseBlock;
@@ -119,7 +134,7 @@ const toolsMatch = asEvaluator({
119134
name: 'tools_match',
120135
kind: 'CODE',
121136
evaluate: async ({ output, expected }: {
122-
output: { toolCalls?: string[] } | null;
137+
output: { toolCalls?: string[], input?: Record<string, unknown>, metadata?: Record<string, unknown> } | null;
123138
expected?: Record<string, unknown>;
124139
}) => {
125140
const toolCalls = String(expected?.tool_calls ?? '');
@@ -128,15 +143,18 @@ const toolsMatch = asEvaluator({
128143
.map((t) => t.trim())
129144
.filter(Boolean)
130145
.sort();
131-
146+
// console.log(`Output tools: ${JSON.stringify(output?.metadata)} -> ${JSON.stringify(output?.toolCalls)}`);
132147
const actualArr = Array.isArray(output?.toolCalls) ? output.toolCalls : [];
133148
const actual = [...actualArr].sort();
134149
const matches = JSON.stringify(expectedTools) === JSON.stringify(actual);
150+
log.debug(
151+
`-----------------------\n`
152+
+ `Query: ${String(output?.input?.question ?? '')}\n`
153+
+ `LLM response: ${JSON.stringify(output?.metadata?.content ?? '')}\n`
154+
+ `Match: ${matches}, expected tools: ${JSON.stringify(expectedTools)}, actual tools: ${JSON.stringify(actual)}`,
155+
);
135156
return {
136-
label: matches ? 'matches' : 'does not match',
137157
score: matches ? 1 : 0,
138-
explanation: matches ? 'Output tool calls match expected' : 'Mismatch between expected and output tool calls',
139-
metadata: {},
140158
};
141159
},
142160
});
@@ -206,14 +224,14 @@ async function main(): Promise<number> {
206224
evaluators: [toolsMatch],
207225
experimentName,
208226
experimentDescription,
209-
dryRun: 3,
227+
concurrency = 10,
210228
});
211229

212230
const runsMap = experiment.runs ?? {};
213231
const evalRuns = experiment.evaluationRuns ?? [];
214232
totalCases = Object.keys(runsMap).length;
215-
const toolMatchEvals = evalRuns.filter((er: any) => er.name === 'tools_match');
216-
correctCases = toolMatchEvals.filter((er: any) => (er.result?.score ?? 0) > 0.5).length;
233+
const toolMatchEvals = evalRuns.filter((er: EvaluationRun) => er.name === 'tools_match');
234+
correctCases = toolMatchEvals.filter((er: EvaluationRun) => (er.result?.score ?? 0) > 0.5).length;
217235
accuracy = totalCases > 0 ? correctCases / totalCases : 0;
218236
experimentId = experiment.id;
219237

@@ -227,7 +245,7 @@ async function main(): Promise<number> {
227245
results.push({ model: modelName, accuracy, correct: correctCases, total: totalCases, experiment_id: experimentId, error });
228246
}
229247

230-
log.info('\n📊 Results:');
248+
log.info('📊 Results:');
231249
for (const result of results) {
232250
const { model, accuracy, error } = result;
233251
if (error) {
@@ -238,7 +256,7 @@ async function main(): Promise<number> {
238256
}
239257

240258
const allPassed = results.filter((r) => !r.error).every((r) => r.accuracy >= PASS_THRESHOLD);
241-
log.info(`\nPass threshold: ${(PASS_THRESHOLD * 100).toFixed(1)}%`);
259+
log.info(`Pass threshold: ${(PASS_THRESHOLD * 100).toFixed(1)}%`);
242260
if (allPassed) {
243261
log.info('✅ All models passed the threshold');
244262
} else {

0 commit comments

Comments
 (0)