🚀 Opik MCP Server - Production-Ready AI Agent Evaluation on AWS AgentCore

🎯 The first production-ready MCP server for comprehensive AI agent evaluation with real-time observability

Evaluate single agents and multi-agent workflows with specialized metrics, deployed on AWS AgentCore serverless runtime

🚀 Quick Start • 📖 Documentation • 🎯 Features • 🏗️ Architecture • 🤝 Contributing

🌟 Why This Matters

AI agents are moving to production, but evaluation is still stuck in notebooks. This changes everything:

✅ Production-Ready: Deploy evaluation as a service, not a script
✅ Multi-Agent Aware: Specialized metrics for Agent-to-Agent workflows
✅ Real-Time Observability: CloudWatch + X-Ray + EMF metrics out of the box
✅ Developer Experience: Evaluate directly from Kiro IDE via MCP protocol
✅ Serverless Scale: AWS AgentCore handles scaling automatically

🎯 Features

🔍 Comprehensive Evaluation Metrics

Single Agent Metrics:

Accuracy - How correct are responses?
Relevance - How well does it address the query?
Helpfulness - How useful is the response?
Coherence - How well-structured is the output?
Factuality - How factually accurate is the information?

Multi-Agent Metrics:

Conversation Quality - Natural inter-agent communication
Agent Coordination - Task handoff effectiveness
Workflow Efficiency - Multi-agent system optimization
A2A Protocol Compliance - Agent-to-Agent standard adherence

🏗️ Production Architecture

🐳 Containerized Deployment on AWS AgentCore
📊 Real-time Observability with CloudWatch, X-Ray, EMF
🔄 Auto-scaling from 1-5 replicas based on load
⚡ Fast Performance - <2s cold start, ~100ms evaluation
🔒 Secure - IAM roles, VPC isolation, encrypted data

🛠️ Developer Experience

🎨 Kiro IDE Integration - Evaluate without leaving your editor
📝 Natural Language - "Evaluate my customer service agent with accuracy metrics"
🔄 Batch Processing - Test multiple agents simultaneously
📈 Historical Tracking - Opik platform integration for trends

🚀 Quick Start

Prerequisites

AWS CLI configured with appropriate permissions
Python 3.11+ with UV package manager
AgentCore CLI installed
Opik API key

1-Minute Setup

# Clone and setup
git clone https://github.com/anespo/opik-mcp-server.git
cd opik-mcp-server

# Configure environment
cp .env.example .env
# Edit .env with your credentials

# Deploy to AgentCore (takes ~2 minutes)
./scripts/deploy.sh deploy

# Test the deployment
./scripts/deploy.sh test

Kiro IDE Integration

# Copy MCP configuration
cp kiro-integration/mcp.json ~/.kiro/settings/
# Restart Kiro IDE

Now you can evaluate agents directly from Kiro:

Evaluate my customer service agent with accuracy and relevance metrics using these test cases:

Test Case 1:
- Input: "I need help with my order"  
- Expected: "I'll help you with your order inquiry"
- Context: Customer service scenario

🏗️ Architecture

System Components

🎨 Developer Layer - Kiro IDE with MCP client integration
☁️ AWS AgentCore - Serverless container runtime
🧠 Opik MCP Server - FastMCP-based evaluation engine
📊 Opik Platform - Cloud evaluation tracking and analytics
👁️ AWS Observability - CloudWatch, X-Ray, EMF metrics
🤖 Agent Systems - Your AI agents under evaluation

Evaluation Flow

graph LR
    A[Kiro IDE] --> B[MCP Client]
    B --> C[AgentCore Runtime]
    C --> D[Evaluation Engine]
    D --> E[Opik Platform]
    D --> F[AWS Observability]
    G[AI Agents] --> D

📊 Real-World Performance

Production Metrics:

Cold Start: < 2 seconds on AgentCore
Evaluation Speed: ~100ms per single agent test case
Multi-Agent Evaluation: ~200ms per workflow
Concurrent Processing: Up to 10 parallel evaluations
Auto-scaling: 1-5 replicas based on load

🎯 Usage Examples

Single Agent Evaluation

# Via Kiro IDE or direct MCP call
evaluation_result = await evaluate_agent(
    agent_id="customer-service-agent",
    test_cases=[
        {
            "input": "I need help with my order #12345",
            "expected_output": "I'll help you check your order status", 
            "context": {"scenario": "order_inquiry"}
        }
    ],
    evaluators=["accuracy", "relevance", "helpfulness"],
    project_name="customer-service-evaluation"
)

Multi-Agent Workflow Evaluation

# Evaluate Agent-to-Agent coordination
workflow_result = await evaluate_multiagent_workflow(
    workflow_id="customer-support-escalation",
    workflow_type="agent2agent", 
    agents=["intake-agent", "technical-agent", "escalation-agent"],
    conversation_messages=[
        {
            "from_agent": "intake-agent",
            "to_agent": "technical-agent", 
            "message": "Customer reports login issues with premium account",
            "metadata": {"priority": "high", "account_type": "premium"}
        }
    ],
    evaluators=["conversation_quality", "agent_coordination", "workflow_efficiency"]
)

📖 Documentation

Deployment Guide - Complete setup and deployment instructions
Testing Guide - Real-world testing scenarios

🔧 Configuration

Environment Variables

# Opik Configuration
OPIK_API_KEY=YOUR_OPIK_API_KEY
OPIK_WORKSPACE=default
OPIK_BASE_URL=https://www.comet.com/opik/api

# AWS Configuration  
AWS_REGION=us-east-1
AWS_PROFILE=default

# AgentCore Configuration
AGENTCORE_RUNTIME_ARN=arn:aws:bedrock-agentcore:REGION:ACCOUNT:runtime/opik_mcp_server-XXXXX

AWS Permissions

The deployment requires these AWS services:

Bedrock AgentCore - Container runtime
CloudWatch - Logging and metrics
X-Ray - Distributed tracing
IAM - Role and policy management

🚀 Advanced Features

Observability Dashboard

The system automatically creates CloudWatch dashboards showing:

Evaluation success rates and performance trends
Individual agent accuracy over time
Multi-agent coordination effectiveness
System health and resource utilization

Custom Evaluators

Extend the system with domain-specific evaluators:

class CustomDomainEvaluator(BaseEvaluator):
    """Custom evaluator for your specific domain"""
    
    async def evaluate(self, input_text: str, output_text: str, expected_output: str, context: dict) -> Score:
        # Your custom evaluation logic
        return Score(metric="custom_metric", score=0.95, explanation="Custom evaluation result")

🤝 Contributing

We welcome contributions! This project is designed to be the definitive solution for AI agent evaluation in production.

Areas for Contribution

🧠 New Evaluators - Domain-specific evaluation metrics
🔌 Framework Integrations - Strands, LangChain, CrewAI, Pydantic
📊 Analytics Features - Advanced trend analysis and insights
🚀 Performance Optimizations - Faster evaluation algorithms
📖 Documentation - Tutorials, examples, best practices

Development Setup

# Clone and setup development environment
git clone https://github.com/anespo/opik-mcp-server.git
cd opik-mcp-server

# Install dependencies
uv sync

# Run tests
uv run pytest

# Local development server
uv run python -m src.opik_mcp_server.main

📄 License

This project is licensed under the MIT-0 License with commercial use restrictions. See LICENSE for details.

Commercial use requires explicit written permission from the repository owner.

🙋‍♂️ Support

🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions
📧 Contact: Repository owner for commercial licensing

⭐ Star this repo if you find it useful! ⭐

Built with ❤️ for the AI agent community

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
diagrams		diagrams
kiro-integration		kiro-integration
scripts		scripts
src/opik_mcp_server		src/opik_mcp_server
.bedrock_agentcore.yaml		.bedrock_agentcore.yaml
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
DEPLOYMENT.md		DEPLOYMENT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
REAL_TESTING_GUIDE.md		REAL_TESTING_GUIDE.md
TESTING_GUIDE.md		TESTING_GUIDE.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

🚀 Opik MCP Server - Production-Ready AI Agent Evaluation on AWS AgentCore

🌟 Why This Matters

🎯 Features

🔍 Comprehensive Evaluation Metrics

🏗️ Production Architecture

🛠️ Developer Experience

🚀 Quick Start

Prerequisites

1-Minute Setup

Kiro IDE Integration

🏗️ Architecture

System Components

Evaluation Flow

📊 Real-World Performance

🎯 Usage Examples

Single Agent Evaluation

Multi-Agent Workflow Evaluation

📖 Documentation

🔧 Configuration

Environment Variables

AWS Permissions

🚀 Advanced Features

Observability Dashboard

Custom Evaluators

🤝 Contributing

Areas for Contribution

Development Setup

📄 License

🙋‍♂️ Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages