-
Notifications
You must be signed in to change notification settings - Fork 65
Description
Hi @awsapm could you look into this?
CRITICAL FINDINGS
1. Service Integration Failure
• Root Cause: The nutrition agent (nutrition_agent.DEFAULT) is experiencing 17,824 errors over 24 hours where it cannot find information for specific pet types
• Error Pattern: "Nutrition service could not find information for pet: [rabbit/guinea pig/etc.]"
• Impact: When the nutrition service fails, the AI agent fabricates product recommendations instead of gracefully handling the error
2. Data Inconsistency Issue
• What's Happening: The nutrition service returns legitimate products like "PurrfectChoice Premium Feline, WhiskerWell Grain-Free Delight, MeowMaster Senior Formula"
• The Problem: The AI agent is inventing additional products like:
• "RabbitRich Premium Alfalfa Hay"
• "FeatherFeast Young Bunny Alfalfa Blend"
• "PurrfectChoice Premium Feline Kitten Formula" (variant not in database)
3. Service Architecture Problems
• Nutrition Service: Running on nutrition-service-nodejs in EKS, connected to MongoDB
• AI Agent: Running on Bedrock Agent Core, calling nutrition service tools
• Failure Mode: When nutrition service returns errors, the AI agent hallucinates products instead of saying "product not available"
OPERATIONAL RECOMMENDATIONS
Immediate Actions (Priority 1)
-
Fix Error Handling in AI Agent
• Update the agent's system prompt to explicitly handle tool failures
• Configure agent to respond with "Please consult our veterinarian for specific recommendations" when nutrition service fails
• Remove ability to fabricate product names -
Database Validation
• Audit the MongoDB database in nutrition-service-nodejs
• Ensure all pet types (rabbit, guinea pig, etc.) have proper nutrition data
• Add missing pet nutrition information -
Service Monitoring Enhancement
• Set up CloudWatch alarms for nutrition service errors (currently 700+ errors/hour)
• Monitor the GET /nutrition/:pet_type endpoint specifically
• Alert when error rate exceeds 5%
Medium-term Fixes (Priority 2)
-
Improve Service Reliability
• The nutrition service shows latency spikes (up to 217ms for database queries)
• Consider implementing caching for frequently requested pet nutrition data
• Add retry logic for database connection failures -
Data Governance
• Implement product catalog validation
• Create approved product list that the AI agent can reference
• Add product availability checks before recommendations -
Enhanced Logging
• Add structured logging to track which products are being recommended
• Log when the AI agent falls back to generic recommendations
• Monitor recommendation accuracy
Long-term Improvements (Priority 3)
-
Service Architecture
• Consider implementing a product catalog service separate from nutrition service
• Add circuit breaker pattern for external service calls
• Implement graceful degradation when services are unavailable -
Quality Assurance
• Implement automated testing for AI agent responses
• Add validation rules for product recommendations
• Create approval workflow for new product additions
Monitoring Setup
I recommend setting up these CloudWatch alarms:
• Nutrition service error rate > 5%
• AI agent tool failure rate > 10%
• Database query latency > 500ms
• Invalid product recommendation detection