Commit 1eaec5b
authored
feat: Add tool context to judge prompt and improve evaluation accuracy (#334)
* feat(evals): Dataset v1.4 - Add tool context to judge prompt and improve evaluation accuracy
This commit implements comprehensive improvements to the MCP tool selection evaluation system (v1.4),
focusing on adding complete tool descriptions to the judge prompt, clarifying tool intent, implementing
bidirectional tool equivalence, and fixing test case quality issues.
Comparing baseline (v1.4 experiments #1-#4) vs current (v1.4 experiments #5-#8):
- GPT-4o Mini: 99% → **97%** (-2%) - Minor regression
- Claude Haiku 4.5: 95% → **99%** (+4%)
- Gemini 2.5 Flash: 91% → **96%** (+5%)
- GPT-5: 91% → **99%** (+8%)
- GPT-4o Mini: 93% → **97%** (+4%)
- Claude Haiku 4.5: 91% → **95%** (+4%)
- Gemini 2.5 Flash: 89% → **97%** (+8%)
- GPT-5: 88% → **99%** (+11%) ← Largest improvement
**All models now significantly exceed the 70% threshold with more consistent performance.**
**Key Insight:** Adding complete tool descriptions to the judge prompt eliminated false negatives
and improved judge accuracy significantly, especially for GPT-5 (+11%) and Gemini (+8%).
**Before:** Judge prompt had NO tool descriptions at all. The judge was evaluating tool selections
without understanding what each tool does, leading to arbitrary penalization.
**After:** Added comprehensive "Important Tool Context" section with descriptions for ALL tools:
**Tool descriptions added:**
- **search-actors:** Searches Apify Store to find scraping tools/Actors (NOT celebrity actors). Emphasizes informational intent.
- **apify-slash-rag-web-browser:** Browses web to get data immediately (one-time data retrieval). Emphasizes time indicators.
- **call-actor:** Mandatory two-step workflow (step="info" then step="call"). Explains info step is CORRECT and required.
- **fetch-actor-details:** Gets Actor documentation without running it. Notes overlap with call-actor step="info".
- **search-apify-docs:** Searches Apify documentation for platform/feature info.
- **get-actor-output:** Retrieves output data from completed Actor runs using datasetId.
- **fetch-apify-docs:** Fetches full content of specific Apify docs page by URL.
**Keyword Length Guidelines** section added to prevent judge from penalizing thoughtful keyword additions.
**Impact:** Judge now understands tool purposes and correctly evaluates tool selections instead of
arbitrary penalization. This was the PRIMARY cause of LLM-judge improvements (+4% to +11%).
**Before:** No tool normalization existed - direct string comparison only.
**After:** Bidirectional normalization treats `call-actor(step="info")` and `fetch-actor-details` as equivalent.
**Why:** The `call-actor` tool has a mandatory two-step workflow:
- Step 1: `call-actor(step="info")` → Get Actor details
- Step 2: `call-actor(step="call")` → Execute Actor
Since step 1 is functionally identical to `fetch-actor-details`, both should be accepted as correct.
**Implementation:**
- Added `normalizeToolName()` - normalizes expected tools
- Added `normalizeToolCall()` - normalizes actual tool calls, checking step parameter
- Both functions map `call-actor` and `fetch-actor-details` → `fetch-actor-details` for comparison
**Impact:** Eliminates false negatives when models correctly use either equivalent tool.
**Problem:** Models confused when to use `search-actors` (finding tools) vs `apify-slash-rag-web-browser` (getting data).
**Root Cause:**
- `search-actors` incorrectly said "Use this tool whenever user needs to scrape data" → Made it sound like it retrieves data
- `RAG_WEB_BROWSER_ADDITIONAL_DESC` said "for specific sites it is always better to search for a specific Actor" → Discouraged using rag for specific sites
**Solution - search-actors (informational intent):**
- Emphasizes: "FIND and DISCOVER what scraping tools/Actors exist"
- Makes clear: "This tool provides INFORMATION about available Actors - it does NOT retrieve actual data"
- Examples: "What tools can scrape Instagram?", "Find an Actor for Amazon products"
- Guidance: "Do NOT use when user wants immediate data retrieval - use apify-slash-rag-web-browser instead"
**Solution - rag-web-browser (data retrieval intent):**
- Emphasizes: "GET or RETRIEVE actual data immediately (one-time data retrieval)"
- Makes clear: "This tool directly fetches and returns data - it does NOT just find tools"
- Examples: "Get flight prices for tomorrow", "What's the weather today?"
- Time indicators: "today", "current", "latest", "recent", "now"
**Impact:** Models now clearly distinguish between informational intent vs data retrieval intent.
**Changes:**
- Fixed contradictory test cases (search-actors-1, search-actors-15)
- Removed misleading-query-2 (contradictory intent)
- Disambiguated intent-ambiguous queries by adding time indicators ("recent", "current") or "Actor" mentions
- Split search-vs-rag-7 into two clear variants (7a for immediate data, 7b for tool search)
- Updated fetch-actor-details-7 to accept both `fetch-actor-details` and `call-actor`
- Made vague queries more specific (added context to ambiguous-query-3, ambiguous-query-1)
**Example fix - search-actors-1:**
```
Before: Query "How to scrape Instagram posts" with expectedTools=[]
Reference: "Either explain OR call search-actors" ← Contradictory
After: Query "What Actors can scrape Instagram posts?"
expectedTools=["search-actors"] ← Clear intent
```
**Impact:** More consistent test expectations align with model behavior.
Added comprehensive v1.4 changelog documenting all improvements for future reference.
- evals/config.ts - **Added complete tool context section to judge prompt (PRIMARY CHANGE)**
- evals/run-evaluation.ts - Implemented bidirectional tool equivalence normalization
- evals/test-cases.json - Dataset v1.4 with 74 test cases (fixed contradictions, disambiguated queries)
- evals/README.md - Documented v1.4 changes
- src/tools/store_collection.ts - Clarified search-actors as informational intent
- src/const.ts - Clarified rag-web-browser as data retrieval intent
All evaluations significantly exceed the 70% threshold (Phoenix v1.4 experiments #5-#8):
- ✓ Claude Haiku 4.5: 99% exact-match, 95% judge
- ✓ Gemini 2.5 Flash: 96% exact-match, 97% judge
- ✓ GPT-4o Mini: 97% exact-match, 97% judge
- ✓ GPT-5: 99% exact-match, 99% judge
* Address PR review comments: clean up references and fix capitalization
- Fix capitalization: "Important Tool Context" -> "Important tool context"
- Remove change explanation notes from reference fields
- Remove references that only contained PR change notes without judge instructions1 parent d9bd92e commit 1eaec5b
File tree
6 files changed
+163
-45
lines changed- evals
- src
- tools
6 files changed
+163
-45
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
47 | | - | |
48 | | - | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
49 | 61 | | |
50 | 62 | | |
51 | 63 | | |
52 | 64 | | |
53 | | - | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
54 | 81 | | |
55 | 82 | | |
56 | 83 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
81 | 81 | | |
82 | 82 | | |
83 | 83 | | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
84 | 124 | | |
85 | 125 | | |
86 | 126 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
99 | 99 | | |
100 | 100 | | |
101 | 101 | | |
102 | | - | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
103 | 133 | | |
104 | 134 | | |
105 | | - | |
| 135 | + | |
106 | 136 | | |
107 | 137 | | |
108 | 138 | | |
109 | 139 | | |
110 | | - | |
| 140 | + | |
111 | 141 | | |
112 | | - | |
| 142 | + | |
113 | 143 | | |
114 | | - | |
| 144 | + | |
115 | 145 | | |
116 | 146 | | |
117 | 147 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | | - | |
| 2 | + | |
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| |||
42 | 42 | | |
43 | 43 | | |
44 | 44 | | |
45 | | - | |
| 45 | + | |
| 46 | + | |
46 | 47 | | |
47 | 48 | | |
48 | 49 | | |
| |||
65 | 66 | | |
66 | 67 | | |
67 | 68 | | |
68 | | - | |
69 | | - | |
70 | | - | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
71 | 72 | | |
72 | 73 | | |
73 | 74 | | |
| |||
100 | 101 | | |
101 | 102 | | |
102 | 103 | | |
103 | | - | |
| 104 | + | |
104 | 105 | | |
105 | 106 | | |
106 | 107 | | |
| |||
140 | 141 | | |
141 | 142 | | |
142 | 143 | | |
143 | | - | |
| 144 | + | |
144 | 145 | | |
145 | | - | |
| 146 | + | |
146 | 147 | | |
147 | 148 | | |
148 | 149 | | |
149 | 150 | | |
150 | | - | |
| 151 | + | |
151 | 152 | | |
152 | 153 | | |
153 | 154 | | |
| |||
160 | 161 | | |
161 | 162 | | |
162 | 163 | | |
163 | | - | |
164 | | - | |
| 164 | + | |
| 165 | + | |
165 | 166 | | |
166 | 167 | | |
167 | 168 | | |
168 | 169 | | |
169 | | - | |
| 170 | + | |
170 | 171 | | |
171 | 172 | | |
172 | 173 | | |
| |||
210 | 211 | | |
211 | 212 | | |
212 | 213 | | |
213 | | - | |
| 214 | + | |
214 | 215 | | |
215 | 216 | | |
216 | 217 | | |
217 | 218 | | |
218 | 219 | | |
219 | | - | |
| 220 | + | |
220 | 221 | | |
221 | 222 | | |
222 | 223 | | |
| |||
232 | 233 | | |
233 | 234 | | |
234 | 235 | | |
235 | | - | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
236 | 243 | | |
237 | | - | |
| 244 | + | |
238 | 245 | | |
239 | 246 | | |
240 | 247 | | |
| |||
394 | 401 | | |
395 | 402 | | |
396 | 403 | | |
397 | | - | |
398 | | - | |
399 | | - | |
400 | | - | |
401 | | - | |
402 | | - | |
403 | 404 | | |
404 | 405 | | |
405 | 406 | | |
406 | | - | |
| 407 | + | |
407 | 408 | | |
408 | 409 | | |
409 | 410 | | |
410 | 411 | | |
411 | 412 | | |
412 | | - | |
| 413 | + | |
413 | 414 | | |
414 | | - | |
| 415 | + | |
415 | 416 | | |
416 | 417 | | |
417 | 418 | | |
418 | 419 | | |
419 | | - | |
| 420 | + | |
420 | 421 | | |
421 | 422 | | |
422 | 423 | | |
| |||
428 | 429 | | |
429 | 430 | | |
430 | 431 | | |
431 | | - | |
| 432 | + | |
432 | 433 | | |
433 | 434 | | |
434 | 435 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
51 | | - | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
52 | 60 | | |
53 | 61 | | |
54 | 62 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
42 | 42 | | |
43 | 43 | | |
44 | 44 | | |
45 | | - | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
52 | 56 | | |
53 | 57 | | |
54 | 58 | | |
| |||
84 | 88 | | |
85 | 89 | | |
86 | 90 | | |
87 | | - | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
88 | 94 | | |
89 | 95 | | |
90 | 96 | | |
91 | 97 | | |
92 | | - | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
93 | 105 | | |
94 | 106 | | |
95 | 107 | | |
| |||
0 commit comments