π― The first production-ready MCP server for comprehensive AI agent evaluation with real-time observability
Evaluate single agents and multi-agent workflows with specialized metrics, deployed on AWS AgentCore serverless runtime
π Quick Start β’ π Documentation β’ π― Features β’ ποΈ Architecture β’ π€ Contributing
AI agents are moving to production, but evaluation is still stuck in notebooks. This changes everything:
- β Production-Ready: Deploy evaluation as a service, not a script
- β Multi-Agent Aware: Specialized metrics for Agent-to-Agent workflows
- β Real-Time Observability: CloudWatch + X-Ray + EMF metrics out of the box
- β Developer Experience: Evaluate directly from Kiro IDE via MCP protocol
- β Serverless Scale: AWS AgentCore handles scaling automatically
Single Agent Metrics:
- Accuracy - How correct are responses?
- Relevance - How well does it address the query?
- Helpfulness - How useful is the response?
- Coherence - How well-structured is the output?
- Factuality - How factually accurate is the information?
Multi-Agent Metrics:
- Conversation Quality - Natural inter-agent communication
- Agent Coordination - Task handoff effectiveness
- Workflow Efficiency - Multi-agent system optimization
- A2A Protocol Compliance - Agent-to-Agent standard adherence
- π³ Containerized Deployment on AWS AgentCore
- π Real-time Observability with CloudWatch, X-Ray, EMF
- π Auto-scaling from 1-5 replicas based on load
- β‘ Fast Performance - <2s cold start, ~100ms evaluation
- π Secure - IAM roles, VPC isolation, encrypted data
- π¨ Kiro IDE Integration - Evaluate without leaving your editor
- π Natural Language - "Evaluate my customer service agent with accuracy metrics"
- π Batch Processing - Test multiple agents simultaneously
- π Historical Tracking - Opik platform integration for trends
- AWS CLI configured with appropriate permissions
- Python 3.11+ with UV package manager
- AgentCore CLI installed
- Opik API key
# Clone and setup
git clone https://github.com/anespo/opik-mcp-server.git
cd opik-mcp-server
# Configure environment
cp .env.example .env
# Edit .env with your credentials
# Deploy to AgentCore (takes ~2 minutes)
./scripts/deploy.sh deploy
# Test the deployment
./scripts/deploy.sh test# Copy MCP configuration
cp kiro-integration/mcp.json ~/.kiro/settings/
# Restart Kiro IDENow you can evaluate agents directly from Kiro:
Evaluate my customer service agent with accuracy and relevance metrics using these test cases:
Test Case 1:
- Input: "I need help with my order"
- Expected: "I'll help you with your order inquiry"
- Context: Customer service scenario
- π¨ Developer Layer - Kiro IDE with MCP client integration
- βοΈ AWS AgentCore - Serverless container runtime
- π§ Opik MCP Server - FastMCP-based evaluation engine
- π Opik Platform - Cloud evaluation tracking and analytics
- ποΈ AWS Observability - CloudWatch, X-Ray, EMF metrics
- π€ Agent Systems - Your AI agents under evaluation
graph LR
A[Kiro IDE] --> B[MCP Client]
B --> C[AgentCore Runtime]
C --> D[Evaluation Engine]
D --> E[Opik Platform]
D --> F[AWS Observability]
G[AI Agents] --> D
Production Metrics:
- Cold Start: < 2 seconds on AgentCore
- Evaluation Speed: ~100ms per single agent test case
- Multi-Agent Evaluation: ~200ms per workflow
- Concurrent Processing: Up to 10 parallel evaluations
- Auto-scaling: 1-5 replicas based on load
# Via Kiro IDE or direct MCP call
evaluation_result = await evaluate_agent(
agent_id="customer-service-agent",
test_cases=[
{
"input": "I need help with my order #12345",
"expected_output": "I'll help you check your order status",
"context": {"scenario": "order_inquiry"}
}
],
evaluators=["accuracy", "relevance", "helpfulness"],
project_name="customer-service-evaluation"
)# Evaluate Agent-to-Agent coordination
workflow_result = await evaluate_multiagent_workflow(
workflow_id="customer-support-escalation",
workflow_type="agent2agent",
agents=["intake-agent", "technical-agent", "escalation-agent"],
conversation_messages=[
{
"from_agent": "intake-agent",
"to_agent": "technical-agent",
"message": "Customer reports login issues with premium account",
"metadata": {"priority": "high", "account_type": "premium"}
}
],
evaluators=["conversation_quality", "agent_coordination", "workflow_efficiency"]
)- Deployment Guide - Complete setup and deployment instructions
- Testing Guide - Real-world testing scenarios
# Opik Configuration
OPIK_API_KEY=YOUR_OPIK_API_KEY
OPIK_WORKSPACE=default
OPIK_BASE_URL=https://www.comet.com/opik/api
# AWS Configuration
AWS_REGION=us-east-1
AWS_PROFILE=default
# AgentCore Configuration
AGENTCORE_RUNTIME_ARN=arn:aws:bedrock-agentcore:REGION:ACCOUNT:runtime/opik_mcp_server-XXXXXThe deployment requires these AWS services:
- Bedrock AgentCore - Container runtime
- CloudWatch - Logging and metrics
- X-Ray - Distributed tracing
- IAM - Role and policy management
The system automatically creates CloudWatch dashboards showing:
- Evaluation success rates and performance trends
- Individual agent accuracy over time
- Multi-agent coordination effectiveness
- System health and resource utilization
Extend the system with domain-specific evaluators:
class CustomDomainEvaluator(BaseEvaluator):
"""Custom evaluator for your specific domain"""
async def evaluate(self, input_text: str, output_text: str, expected_output: str, context: dict) -> Score:
# Your custom evaluation logic
return Score(metric="custom_metric", score=0.95, explanation="Custom evaluation result")We welcome contributions! This project is designed to be the definitive solution for AI agent evaluation in production.
- π§ New Evaluators - Domain-specific evaluation metrics
- π Framework Integrations - Strands, LangChain, CrewAI, Pydantic
- π Analytics Features - Advanced trend analysis and insights
- π Performance Optimizations - Faster evaluation algorithms
- π Documentation - Tutorials, examples, best practices
# Clone and setup development environment
git clone https://github.com/anespo/opik-mcp-server.git
cd opik-mcp-server
# Install dependencies
uv sync
# Run tests
uv run pytest
# Local development server
uv run python -m src.opik_mcp_server.mainThis project is licensed under the MIT-0 License with commercial use restrictions. See LICENSE for details.
Commercial use requires explicit written permission from the repository owner.
- π Issues: GitHub Issues
- π¬ Discussions: GitHub Discussions
- π§ Contact: Repository owner for commercial licensing
β Star this repo if you find it useful! β
Built with β€οΈ for the AI agent community
