feat: use Pydantic models for MCP submit_evaluation tool parameters#102
feat: use Pydantic models for MCP submit_evaluation tool parameters#102tarilabs merged 3 commits intoeval-hub:mainfrom
Conversation
Replace parameters types with typed Pydantic models this way FastMCP generates congruent JSON Schema for AI agents Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: tarilabs <matteo.mortari@gmail.com>
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 6 minutes and 45 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unit/test_mcp_server.py (1)
280-365:⚠️ Potential issue | 🟠 MajorAdd schema/wire-path assertions for MCP tool invocation.
Current tests invoke
submit_evaluationdirectly as a Python function, bypassing FastMCP's auto-generatedinputSchemaand JSON argument deserialization. Add at least one test that verifies the tool's schema and/or invokes it through MCP's tool-call mechanism (e.g., viamcp.call_tool()or similar) with JSON-like arguments to validate that typed fields (model,benchmarks,collection,experiment) deserialize correctly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_mcp_server.py` around lines 280 - 365, Add a test that exercises the MCP tool path (not only the Python function) by calling the auto-generated tool schema for submit_evaluation so JSON-like args are deserialized; use the tool invocation API (e.g., mcp.call_tool or the test harness's equivalent) to invoke the submit_evaluation tool with a JSON dict containing fields like model (as object with url/name and optional auth.secret_ref), benchmarks (list of {id,provider_id}), collection (object with id), and experiment (object with name) and assert the resulting request sent to mock_client.jobs.submit has the expected typed fields (reference symbols: submit_evaluation, inputSchema, ModelConfig, BenchmarkConfig, CollectionRef, ExperimentConfig, mock_client.jobs.submit); include at least one test case verifying that benchmarks deserialize into request.benchmarks and one verifying collection deserializes into request.collection, and keep existing ValueError behavior tests unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/evalhub/mcp/server.py`:
- Around line 253-255: The exclusivity check uses truthiness so benchmarks=[] is
treated as absent; change the check to detect presence vs None (use "benchmarks
is not None" instead of "bool(benchmarks)") wherever the request
validation/exclusivity is enforced (referencing the benchmarks and collection
variables in the request handling function in src/evalhub/mcp/server.py) and
ensure the validation logic rejects requests that provide both benchmarks
(including an empty list) and collection; also update any accompanying error
message to reflect that an explicitly provided empty benchmarks list still
counts as supplying benchmarks.
---
Outside diff comments:
In `@tests/unit/test_mcp_server.py`:
- Around line 280-365: Add a test that exercises the MCP tool path (not only the
Python function) by calling the auto-generated tool schema for submit_evaluation
so JSON-like args are deserialized; use the tool invocation API (e.g.,
mcp.call_tool or the test harness's equivalent) to invoke the submit_evaluation
tool with a JSON dict containing fields like model (as object with url/name and
optional auth.secret_ref), benchmarks (list of {id,provider_id}), collection
(object with id), and experiment (object with name) and assert the resulting
request sent to mock_client.jobs.submit has the expected typed fields (reference
symbols: submit_evaluation, inputSchema, ModelConfig, BenchmarkConfig,
CollectionRef, ExperimentConfig, mock_client.jobs.submit); include at least one
test case verifying that benchmarks deserialize into request.benchmarks and one
verifying collection deserializes into request.collection, and keep existing
ValueError behavior tests unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6be1d37e-6161-43f6-936d-0e906026375b
📒 Files selected for processing (2)
src/evalhub/mcp/server.pytests/unit/test_mcp_server.py
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: tarilabs <matteo.mortari@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: tarilabs <matteo.mortari@gmail.com>
What and why
Replace parameters types with typed Pydantic models this way FastMCP generates congruent JSON Schema
for AI agents
Type
Testing
Breaking changes
none (unreleased capability yet, still iterated)
Summary by CodeRabbit
Release Notes