feat(evals): Dataset v1.4 - Add tool context to judge prompt and improve evaluation accuracy #334

yfe404 · 2025-11-11T11:30:27Z

This commit implements comprehensive improvements to the MCP tool selection evaluation system (v1.4), focusing on adding complete tool descriptions to the judge prompt, clarifying tool intent, implementing bidirectional tool equivalence, and fixing test case quality issues.

Performance Improvements

Comparing baseline (v1.3 experiments) vs current (v1.4 experiments):

Exact-Match Evaluator (Baseline → Current)

GPT-4o Mini: 98% → 97% (-1%)
Claude Haiku 4.5: 94% → 99% (+5%)
Gemini 2.5 Flash: 90% → 96% (+6%)
GPT-5: 93% → 99% (+6%)

LLM-Judge Evaluator (Baseline → Current)

GPT-4o Mini: 85% → 97% (+12%)
Claude Haiku 4.5: 89% → 95% (+6%)
Gemini 2.5 Flash: 75% → 97% (+22%)
GPT-5: 78% → 99% (+21%)

Key Insight: Adding complete tool descriptions to the judge prompt eliminated false negatives and improved judge accuracy significantly, especially for GPT-5 and Gemini.

Technical Changes

1. Added Complete Tool Context to Judge Prompt (evals/config.ts:84-115)

Before: Judge prompt had NO tool descriptions at all. The judge was evaluating tool selections without understanding what each tool does, leading to arbitrary penalization.

After: Added comprehensive "Important Tool Context" section with descriptions for ALL tools

Keyword Length Guidelines section added to prevent judge from penalizing thoughtful keyword additions.

Impact: Judge now understands tool purposes and correctly evaluates tool selections instead of arbitrary penalization.

2. Implemented Bidirectional Tool Equivalence (evals/run-evaluation.ts:102-144)

Before: No tool normalization existed - direct string comparison only.

After: Bidirectional normalization treats call-actor(step="info") and fetch-actor-details as equivalent.

Why: The call-actor tool has a mandatory two-step workflow:

Step 1: call-actor(step="info") → Get Actor details
Step 2: call-actor(step="call") → Execute Actor

Since step 1 is functionally identical to fetch-actor-details, both should be accepted as correct.

Implementation:

Added normalizeToolName() - normalizes expected tools
Added normalizeToolCall() - normalizes actual tool calls, checking step parameter
Both functions map call-actor and fetch-actor-details → fetch-actor-details for comparison

Impact: Eliminates false negatives when models correctly use either equivalent tool.

3. Clarified Information vs Data Retrieval Intent (src/tools/store_collection.ts:90-126, src/const.ts:51-59)

Problem: Models confused when to use search-actors (finding tools) vs apify-slash-rag-web-browser (getting data).

Root Cause:

search-actors incorrectly said "Use this tool whenever user needs to scrape data" → Made it sound like it retrieves data
RAG_WEB_BROWSER_ADDITIONAL_DESC said "for specific sites it is always better to search for a specific Actor" → Discouraged using rag for specific sites

Solution - search-actors (informational intent):

Emphasizes: "FIND and DISCOVER what scraping tools/Actors exist"
Makes clear: "This tool provides INFORMATION about available Actors - it does NOT retrieve actual data"
Examples: "What tools can scrape Instagram?", "Find an Actor for Amazon products"
Guidance: "Do NOT use when user wants immediate data retrieval - use apify-slash-rag-web-browser instead"

Solution - rag-web-browser (data retrieval intent):

Emphasizes: "GET or RETRIEVE actual data immediately (one-time data retrieval)"
Makes clear: "This tool directly fetches and returns data - it does NOT just find tools"
Examples: "Get flight prices for tomorrow", "What's the weather today?"
Time indicators: "today", "current", "latest", "recent", "now"

Impact: Models now clearly distinguish between informational intent vs data retrieval intent.

4. Fixed Test Case Quality Issues (evals/test-cases.json)

Changes:

Fixed contradictory test cases (search-actors-1, search-actors-15)
Removed misleading-query-2 (contradictory intent)
Disambiguated intent-ambiguous queries by adding time indicators ("recent", "current") or "Actor" mentions
Split search-vs-rag-7 into two clear variants (7a for immediate data, 7b for tool search)
Updated fetch-actor-details-7 to accept both fetch-actor-details and call-actor
Made vague queries more specific (added context to ambiguous-query-3, ambiguous-query-1)

Example fix - search-actors-1:

Before: Query "How to scrape Instagram posts" with expectedTools=[]
        Reference: "Either explain OR call search-actors"  ← Contradictory
After:  Query "What Actors can scrape Instagram posts?"
        expectedTools=["search-actors"]  ← Clear intent

Impact: More consistent test expectations align with model behavior.

jirispilka

I really like it, I believe this can improve the overlap between search-actors vs rag-web-browser

There was one issue with reference. The reference serve for LLM judge as a hint how a tool should be called.

jirispilka · 2025-11-14T15:15:18Z

evals/config.ts

 Tool calls are generated by a separate agent and chosen from a provided list of tools.
 You must judge whether this agent made the correct selection.

+## Important Tool Context


Suggested change

## Important Tool Context

## Important tool context

jirispilka · 2025-11-14T15:17:27Z

evals/config.ts


+## Important Tool Context
+
+**search-actors**: Searches the Apify Store to find scraping tools/Actors (NOT celebrity actors). This finds pre-built scraping solutions.


Do we really need to have the tool list here?

Maybe it is a good idea to have "static" description here, but we will need to maintain it.
Earlier I added the tool description dynamically but it was really fragile. Changing tool description changed evaluation and I was running in circles

Agreed, static descriptions require maintenance. I considered dynamic injection but went with static for the same reason you mentioned: evaluation stability.

Ok, thanks!

jirispilka · 2025-11-14T15:19:16Z

evals/run-evaluation.ts


-        expectedTools = [...expectedTools].sort();
+        // Normalize tool names: treat call-actor with step="info" as equivalent to fetch-actor-details
+        const normalizeToolName = (toolName: string): string => {


believe this is not the best thing to do in the long term. We should revisit the tools and have only one to fetch schema. But for now it is ok!

Agreed, having two tools (fetch-actor-details and call-actor step="info") that do the same thing isn't ideal. The normalization is a workaround to avoid penalizing models that correctly use either.

Happy to create an issue to track consolidating these tools in the future. I propose for now to keep the normalization to keep evaluations fair without blocking this PR.

We added the call-actor(step=info) because sometimes the LLMs did not call the fetch-actor-details even if explicitly prompted in the tools description and only way out of this was to force the two step approach which resulted in LLMs actually calling the info step and getting the Actor input schema and using correct fields instead of halucinating them. It would be best to have only one tool, but we need to make sure it actually works.

Thanks for the context! That makes sense. The normalization ensures we don't penalize models for using either approach during the evaluation, which aligns with this design.

jirispilka · 2025-11-14T15:20:33Z

evals/test-cases.json

      "id": "search-actors-6",
      "category": "search-actors",
-      "query": "Get Facebook data",
+      "query": "Find an Actor to get Facebook data",


I had it intentionally unclear .... I guess it is fine this way if it was causing issues

jirispilka · 2025-11-14T15:21:07Z

evals/test-cases.json

-      "expectedTools": [],
-      "reference": "It should not call any tools, because the query is too general. It should suggest to be more specific about the platform or data type needed."
+      "expectedTools": ["search-actors"],
+      "reference": "While query is general, it explicitly asks about 'actors', so search-actors is appropriate. Changed from [] to allow tool call."


Suggested change

"reference": "While query is general, it explicitly asks about 'actors', so search-actors is appropriate. Changed from [] to allow tool call."

"reference": "While query is general, it explicitly asks about 'actors', so search-actors is appropriate"

jirispilka · 2025-11-14T15:23:30Z

evals/test-cases.json

-      "expectedTools": ["search-apify-docs"]
+      "query": "Show me Apify Actor documentation",
+      "expectedTools": ["search-apify-docs"],
+      "reference": "Made query less vague by adding 'Apify Actor' context."


Suggested change

"reference": "Made query less vague by adding 'Apify Actor' context."

jirispilka · 2025-11-14T15:23:43Z

evals/test-cases.json

-      "expectedTools": ["apify-slash-rag-web-browser"]
+      "query": "Find recent AI articles on tech blogs",
+      "expectedTools": ["apify-slash-rag-web-browser"],
+      "reference": "Added 'recent' to signal immediate data need."


Suggested change

"reference": "Added 'recent' to signal immediate data need."

jirispilka · 2025-11-14T15:25:10Z

src/const.ts

+- User needs to fetch specific content now (e.g., "Fetch news articles from CNN", "Get product info from Amazon")
+- User has time indicators like "today", "current", "latest", "recent", "now"
+
+This is for general web scraping and immediate data needs. For repeated/scheduled scraping of specific platforms (e-commerce, social media), consider suggesting a specialized Actor from the Store for better performance and reliability.`;


Nice!

What about linking the search-actor tool here directly?

jirispilka · 2025-11-14T15:25:57Z

src/tools/store_collection.ts

- It is better to omit such generic terms entirely from the search query and decide later based on the search results.
- If a user asks about "fetching Instagram posts", use "Instagram posts" as keywords.
- The goal is to find Actors that specifically handle the platform and data type the user mentioned.
+- Use 1-3 simple keyword terms maximum (e.g., "Instagram posts", "Twitter", "Amazon products")


Love this! I especially like your insight about etc.

jirispilka · 2025-11-14T15:26:18Z

src/tools/store_collection.ts

        name: HelperTools.STORE_SEARCH,
        description: `
-Search the Apify Store for Actors using keyword-based queries.
+Search the Apify Store to FIND and DISCOVER what scraping tools/Actors exist for specific platforms or use cases.


…ove evaluation accuracy This commit implements comprehensive improvements to the MCP tool selection evaluation system (v1.4), focusing on adding complete tool descriptions to the judge prompt, clarifying tool intent, implementing bidirectional tool equivalence, and fixing test case quality issues. Comparing baseline (v1.4 experiments apify#1-apify#4) vs current (v1.4 experiments apify#5-apify#8): - GPT-4o Mini: 99% → **97%** (-2%) - Minor regression - Claude Haiku 4.5: 95% → **99%** (+4%) - Gemini 2.5 Flash: 91% → **96%** (+5%) - GPT-5: 91% → **99%** (+8%) - GPT-4o Mini: 93% → **97%** (+4%) - Claude Haiku 4.5: 91% → **95%** (+4%) - Gemini 2.5 Flash: 89% → **97%** (+8%) - GPT-5: 88% → **99%** (+11%) ← Largest improvement **All models now significantly exceed the 70% threshold with more consistent performance.** **Key Insight:** Adding complete tool descriptions to the judge prompt eliminated false negatives and improved judge accuracy significantly, especially for GPT-5 (+11%) and Gemini (+8%). **Before:** Judge prompt had NO tool descriptions at all. The judge was evaluating tool selections without understanding what each tool does, leading to arbitrary penalization. **After:** Added comprehensive "Important Tool Context" section with descriptions for ALL tools: **Tool descriptions added:** - **search-actors:** Searches Apify Store to find scraping tools/Actors (NOT celebrity actors). Emphasizes informational intent. - **apify-slash-rag-web-browser:** Browses web to get data immediately (one-time data retrieval). Emphasizes time indicators. - **call-actor:** Mandatory two-step workflow (step="info" then step="call"). Explains info step is CORRECT and required. - **fetch-actor-details:** Gets Actor documentation without running it. Notes overlap with call-actor step="info". - **search-apify-docs:** Searches Apify documentation for platform/feature info. - **get-actor-output:** Retrieves output data from completed Actor runs using datasetId. - **fetch-apify-docs:** Fetches full content of specific Apify docs page by URL. **Keyword Length Guidelines** section added to prevent judge from penalizing thoughtful keyword additions. **Impact:** Judge now understands tool purposes and correctly evaluates tool selections instead of arbitrary penalization. This was the PRIMARY cause of LLM-judge improvements (+4% to +11%). **Before:** No tool normalization existed - direct string comparison only. **After:** Bidirectional normalization treats `call-actor(step="info")` and `fetch-actor-details` as equivalent. **Why:** The `call-actor` tool has a mandatory two-step workflow: - Step 1: `call-actor(step="info")` → Get Actor details - Step 2: `call-actor(step="call")` → Execute Actor Since step 1 is functionally identical to `fetch-actor-details`, both should be accepted as correct. **Implementation:** - Added `normalizeToolName()` - normalizes expected tools - Added `normalizeToolCall()` - normalizes actual tool calls, checking step parameter - Both functions map `call-actor` and `fetch-actor-details` → `fetch-actor-details` for comparison **Impact:** Eliminates false negatives when models correctly use either equivalent tool. **Problem:** Models confused when to use `search-actors` (finding tools) vs `apify-slash-rag-web-browser` (getting data). **Root Cause:** - `search-actors` incorrectly said "Use this tool whenever user needs to scrape data" → Made it sound like it retrieves data - `RAG_WEB_BROWSER_ADDITIONAL_DESC` said "for specific sites it is always better to search for a specific Actor" → Discouraged using rag for specific sites **Solution - search-actors (informational intent):** - Emphasizes: "FIND and DISCOVER what scraping tools/Actors exist" - Makes clear: "This tool provides INFORMATION about available Actors - it does NOT retrieve actual data" - Examples: "What tools can scrape Instagram?", "Find an Actor for Amazon products" - Guidance: "Do NOT use when user wants immediate data retrieval - use apify-slash-rag-web-browser instead" **Solution - rag-web-browser (data retrieval intent):** - Emphasizes: "GET or RETRIEVE actual data immediately (one-time data retrieval)" - Makes clear: "This tool directly fetches and returns data - it does NOT just find tools" - Examples: "Get flight prices for tomorrow", "What's the weather today?" - Time indicators: "today", "current", "latest", "recent", "now" **Impact:** Models now clearly distinguish between informational intent vs data retrieval intent. **Changes:** - Fixed contradictory test cases (search-actors-1, search-actors-15) - Removed misleading-query-2 (contradictory intent) - Disambiguated intent-ambiguous queries by adding time indicators ("recent", "current") or "Actor" mentions - Split search-vs-rag-7 into two clear variants (7a for immediate data, 7b for tool search) - Updated fetch-actor-details-7 to accept both `fetch-actor-details` and `call-actor` - Made vague queries more specific (added context to ambiguous-query-3, ambiguous-query-1) **Example fix - search-actors-1:** ``` Before: Query "How to scrape Instagram posts" with expectedTools=[] Reference: "Either explain OR call search-actors" ← Contradictory After: Query "What Actors can scrape Instagram posts?" expectedTools=["search-actors"] ← Clear intent ``` **Impact:** More consistent test expectations align with model behavior. Added comprehensive v1.4 changelog documenting all improvements for future reference. - evals/config.ts - **Added complete tool context section to judge prompt (PRIMARY CHANGE)** - evals/run-evaluation.ts - Implemented bidirectional tool equivalence normalization - evals/test-cases.json - Dataset v1.4 with 74 test cases (fixed contradictions, disambiguated queries) - evals/README.md - Documented v1.4 changes - src/tools/store_collection.ts - Clarified search-actors as informational intent - src/const.ts - Clarified rag-web-browser as data retrieval intent All evaluations significantly exceed the 70% threshold (Phoenix v1.4 experiments apify#5-apify#8): - ✓ Claude Haiku 4.5: 99% exact-match, 95% judge - ✓ Gemini 2.5 Flash: 96% exact-match, 97% judge - ✓ GPT-4o Mini: 97% exact-match, 97% judge - ✓ GPT-5: 99% exact-match, 99% judge

- Fix capitalization: "Important Tool Context" -> "Important tool context" - Remove change explanation notes from reference fields - Remove references that only contained PR change notes without judge instructions

jirispilka self-requested a review November 12, 2025 09:54

jirispilka assigned yfe404 Nov 14, 2025

jirispilka reviewed Nov 14, 2025

View reviewed changes

yfe404 added 2 commits November 26, 2025 16:15

Address PR review comments: clean up references and fix capitalization

b9078c0

- Fix capitalization: "Important Tool Context" -> "Important tool context" - Remove change explanation notes from reference fields - Remove references that only contained PR change notes without judge instructions

yfe404 force-pushed the master branch from 0044bd5 to b9078c0 Compare November 26, 2025 15:16

jirispilka approved these changes Nov 27, 2025

View reviewed changes

jirispilka merged commit 1eaec5b into apify:master Nov 27, 2025
4 checks passed


		## Important Tool Context

		search-actors: Searches the Apify Store to find scraping tools/Actors (NOT celebrity actors). This finds pre-built scraping solutions.

	"reference": "While query is general, it explicitly asks about 'actors', so search-actors is appropriate. Changed from [] to allow tool call."
	"reference": "While query is general, it explicitly asks about 'actors', so search-actors is appropriate"

feat(evals): Dataset v1.4 - Add tool context to judge prompt and improve evaluation accuracy #334

feat(evals): Dataset v1.4 - Add tool context to judge prompt and improve evaluation accuracy #334

Uh oh!

Conversation

yfe404 commented Nov 11, 2025

Performance Improvements

Exact-Match Evaluator (Baseline → Current)

LLM-Judge Evaluator (Baseline → Current)

Technical Changes

1. Added Complete Tool Context to Judge Prompt (evals/config.ts:84-115)

2. Implemented Bidirectional Tool Equivalence (evals/run-evaluation.ts:102-144)

3. Clarified Information vs Data Retrieval Intent (src/tools/store_collection.ts:90-126, src/const.ts:51-59)

4. Fixed Test Case Quality Issues (evals/test-cases.json)

Uh oh!

jirispilka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants