Skip to content

anespo/opik-mcp-server

Repository files navigation

πŸš€ Opik MCP Server - Production-Ready AI Agent Evaluation on AWS AgentCore

Architecture

License: MIT-0 AWS AgentCore MCP Protocol Opik Platform

🎯 The first production-ready MCP server for comprehensive AI agent evaluation with real-time observability

Evaluate single agents and multi-agent workflows with specialized metrics, deployed on AWS AgentCore serverless runtime

πŸš€ Quick Start β€’ πŸ“– Documentation β€’ 🎯 Features β€’ πŸ—οΈ Architecture β€’ 🀝 Contributing


🌟 Why This Matters

AI agents are moving to production, but evaluation is still stuck in notebooks. This changes everything:

  • βœ… Production-Ready: Deploy evaluation as a service, not a script
  • βœ… Multi-Agent Aware: Specialized metrics for Agent-to-Agent workflows
  • βœ… Real-Time Observability: CloudWatch + X-Ray + EMF metrics out of the box
  • βœ… Developer Experience: Evaluate directly from Kiro IDE via MCP protocol
  • βœ… Serverless Scale: AWS AgentCore handles scaling automatically

🎯 Features

πŸ” Comprehensive Evaluation Metrics

Single Agent Metrics:

  • Accuracy - How correct are responses?
  • Relevance - How well does it address the query?
  • Helpfulness - How useful is the response?
  • Coherence - How well-structured is the output?
  • Factuality - How factually accurate is the information?

Multi-Agent Metrics:

  • Conversation Quality - Natural inter-agent communication
  • Agent Coordination - Task handoff effectiveness
  • Workflow Efficiency - Multi-agent system optimization
  • A2A Protocol Compliance - Agent-to-Agent standard adherence

πŸ—οΈ Production Architecture

  • 🐳 Containerized Deployment on AWS AgentCore
  • πŸ“Š Real-time Observability with CloudWatch, X-Ray, EMF
  • πŸ”„ Auto-scaling from 1-5 replicas based on load
  • ⚑ Fast Performance - <2s cold start, ~100ms evaluation
  • πŸ”’ Secure - IAM roles, VPC isolation, encrypted data

πŸ› οΈ Developer Experience

  • 🎨 Kiro IDE Integration - Evaluate without leaving your editor
  • πŸ“ Natural Language - "Evaluate my customer service agent with accuracy metrics"
  • πŸ”„ Batch Processing - Test multiple agents simultaneously
  • πŸ“ˆ Historical Tracking - Opik platform integration for trends

πŸš€ Quick Start

Prerequisites

1-Minute Setup

# Clone and setup
git clone https://github.com/anespo/opik-mcp-server.git
cd opik-mcp-server

# Configure environment
cp .env.example .env
# Edit .env with your credentials

# Deploy to AgentCore (takes ~2 minutes)
./scripts/deploy.sh deploy

# Test the deployment
./scripts/deploy.sh test

Kiro IDE Integration

# Copy MCP configuration
cp kiro-integration/mcp.json ~/.kiro/settings/
# Restart Kiro IDE

Now you can evaluate agents directly from Kiro:

Evaluate my customer service agent with accuracy and relevance metrics using these test cases:

Test Case 1:
- Input: "I need help with my order"  
- Expected: "I'll help you with your order inquiry"
- Context: Customer service scenario

πŸ—οΈ Architecture

System Components

  1. 🎨 Developer Layer - Kiro IDE with MCP client integration
  2. ☁️ AWS AgentCore - Serverless container runtime
  3. 🧠 Opik MCP Server - FastMCP-based evaluation engine
  4. πŸ“Š Opik Platform - Cloud evaluation tracking and analytics
  5. πŸ‘οΈ AWS Observability - CloudWatch, X-Ray, EMF metrics
  6. πŸ€– Agent Systems - Your AI agents under evaluation

Evaluation Flow

graph LR
    A[Kiro IDE] --> B[MCP Client]
    B --> C[AgentCore Runtime]
    C --> D[Evaluation Engine]
    D --> E[Opik Platform]
    D --> F[AWS Observability]
    G[AI Agents] --> D
Loading

πŸ“Š Real-World Performance

Production Metrics:

  • Cold Start: < 2 seconds on AgentCore
  • Evaluation Speed: ~100ms per single agent test case
  • Multi-Agent Evaluation: ~200ms per workflow
  • Concurrent Processing: Up to 10 parallel evaluations
  • Auto-scaling: 1-5 replicas based on load

🎯 Usage Examples

Single Agent Evaluation

# Via Kiro IDE or direct MCP call
evaluation_result = await evaluate_agent(
    agent_id="customer-service-agent",
    test_cases=[
        {
            "input": "I need help with my order #12345",
            "expected_output": "I'll help you check your order status", 
            "context": {"scenario": "order_inquiry"}
        }
    ],
    evaluators=["accuracy", "relevance", "helpfulness"],
    project_name="customer-service-evaluation"
)

Multi-Agent Workflow Evaluation

# Evaluate Agent-to-Agent coordination
workflow_result = await evaluate_multiagent_workflow(
    workflow_id="customer-support-escalation",
    workflow_type="agent2agent", 
    agents=["intake-agent", "technical-agent", "escalation-agent"],
    conversation_messages=[
        {
            "from_agent": "intake-agent",
            "to_agent": "technical-agent", 
            "message": "Customer reports login issues with premium account",
            "metadata": {"priority": "high", "account_type": "premium"}
        }
    ],
    evaluators=["conversation_quality", "agent_coordination", "workflow_efficiency"]
)

πŸ“– Documentation

πŸ”§ Configuration

Environment Variables

# Opik Configuration
OPIK_API_KEY=YOUR_OPIK_API_KEY
OPIK_WORKSPACE=default
OPIK_BASE_URL=https://www.comet.com/opik/api

# AWS Configuration  
AWS_REGION=us-east-1
AWS_PROFILE=default

# AgentCore Configuration
AGENTCORE_RUNTIME_ARN=arn:aws:bedrock-agentcore:REGION:ACCOUNT:runtime/opik_mcp_server-XXXXX

AWS Permissions

The deployment requires these AWS services:

  • Bedrock AgentCore - Container runtime
  • CloudWatch - Logging and metrics
  • X-Ray - Distributed tracing
  • IAM - Role and policy management

πŸš€ Advanced Features

Observability Dashboard

The system automatically creates CloudWatch dashboards showing:

  • Evaluation success rates and performance trends
  • Individual agent accuracy over time
  • Multi-agent coordination effectiveness
  • System health and resource utilization

Custom Evaluators

Extend the system with domain-specific evaluators:

class CustomDomainEvaluator(BaseEvaluator):
    """Custom evaluator for your specific domain"""
    
    async def evaluate(self, input_text: str, output_text: str, expected_output: str, context: dict) -> Score:
        # Your custom evaluation logic
        return Score(metric="custom_metric", score=0.95, explanation="Custom evaluation result")

🀝 Contributing

We welcome contributions! This project is designed to be the definitive solution for AI agent evaluation in production.

Areas for Contribution

  • 🧠 New Evaluators - Domain-specific evaluation metrics
  • πŸ”Œ Framework Integrations - Strands, LangChain, CrewAI, Pydantic
  • πŸ“Š Analytics Features - Advanced trend analysis and insights
  • πŸš€ Performance Optimizations - Faster evaluation algorithms
  • πŸ“– Documentation - Tutorials, examples, best practices

Development Setup

# Clone and setup development environment
git clone https://github.com/anespo/opik-mcp-server.git
cd opik-mcp-server

# Install dependencies
uv sync

# Run tests
uv run pytest

# Local development server
uv run python -m src.opik_mcp_server.main

πŸ“„ License

This project is licensed under the MIT-0 License with commercial use restrictions. See LICENSE for details.

Commercial use requires explicit written permission from the repository owner.

πŸ™‹β€β™‚οΈ Support


⭐ Star this repo if you find it useful! ⭐

Built with ❀️ for the AI agent community

About

Building Production-Ready AI Agent Evaluation with Opik MCP Server on AWS AgentCore

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors