Complete Problem Resolution for Distributed Multi-Modal AI System
Generated: 2025-09-10
Coverage: 518 API endpoints across 6 distributed VMs
Resolution Time: ~5 minutes average (with this guide)
Before diving into specific issues, run these diagnostic commands:
# 1. Overall system health check
bash run_autobot.sh --status
# 2. Check all service connectivity
python3 scripts/health_check_comprehensive.py
# 3. View recent error logs
tail -f logs/autobot.log | grep -i error
# 4. Check VM network connectivity
ping -c 3 172.16.168.21 # Frontend
ping -c 3 172.16.168.22 # NPU Worker
ping -c 3 172.16.168.23 # Redis
ping -c 3 172.16.168.24 # AI Stack
ping -c 3 172.16.168.25 # Browser Service
# 5. Verify critical services
curl -s http://127.0.0.1:8001/api/health | jq .
redis-cli -h 172.16.168.23 ping
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"- Backend API unresponsive
- Redis connection failures
- VM communication breakdown
- Authentication system failure
- Multi-modal AI processing failures
- Knowledge base search errors
- WebSocket disconnections
- File upload failures
- Terminal command timeouts
- Browser automation failures
- Model loading issues
- Cache performance problems
- UI display glitches
- Log rotation issues
- Non-critical service warnings
- Performance optimization opportunities
Symptoms:
- Frontend shows "Network Error" or timeouts
curl http://127.0.0.1:8001/api/healthfails- Browser shows "This site can't be reached"
Root Causes & Solutions:
# Check if Redis is running
docker ps | grep redis
redis-cli -h 172.16.168.23 ping
# If Redis is down, restart it
docker restart autobot-redis
# If Redis is up but backend still fails
# Check backend logs for Redis connection errors
tail -f logs/autobot.log | grep -i redis
# Solution: Use fast backend startup
# Edit run_autobot.sh to use fast_app_factory_fix.py (should be default)Quick Fix:
# Restart backend with fast startup
pkill -f "python.*backend"
cd backend && python fast_app_factory_fix.py &# Check if port 8001 is in use
lsof -i :8001
netstat -tulpn | grep :8001
# Kill conflicting process
sudo kill -9 $(lsof -t -i:8001)
# Restart AutoBot
bash run_autobot.sh --dev --build# Check memory usage
free -h
ps aux --sort=-%mem | head -10
# Check if Python processes consuming too much memory
ps aux | grep python | awk '{sum+=$6} END {print "Python memory usage: " sum/1024 " MB"}'
# If memory is low, restart system or clear cache
sudo systemctl restart autobot-backendSymptoms:
- Services can't reach other VMs (timeouts)
- "Connection refused" errors in logs
- Frontend can't connect to backend
Diagnostic Commands:
# Test VM-to-VM connectivity
for vm in 21 22 23 24 25; do
echo "Testing VM 172.16.168.$vm"
nc -zv 172.16.168.$vm 22 2>&1 | head -1
done
# Check firewall rules
sudo iptables -L -n | grep 172.16.168
# Test specific service ports
nc -zv 172.16.168.21 5173 # Frontend dev server
nc -zv 172.16.168.22 8081 # NPU Worker
nc -zv 172.16.168.23 6379 # Redis
nc -zv 172.16.168.24 8080 # AI Stack
nc -zv 172.16.168.25 3000 # Browser ServiceSolutions:
# Generate new SSH keys if needed
ssh-keygen -t ed25519 -f ~/.ssh/autobot_key -N ""
# Copy keys to all VMs
for vm in 21 22 23 24 25; do
ssh-copy-id -i ~/.ssh/autobot_key.pub autobot@172.16.168.$vm
done
# Test SSH connectivity
ssh -i ~/.ssh/autobot_key autobot@172.16.168.21 "echo 'VM1 accessible'"# Check network interface configuration
ip addr show
ip route show
# Verify VM network is properly configured
ping -c 3 172.16.168.1 # Network gateway
# Reset network if needed (WSL2)
wsl --shutdown
# Restart WSL2 and rerun setupSymptoms:
- "Redis connection failed" errors
- Knowledge base search returns empty
- Chat history not saving
Diagnostic Steps:
# 1. Check Redis container status
docker ps | grep redis
docker logs autobot-redis --tail=50
# 2. Test Redis connectivity from different VMs
redis-cli -h 172.16.168.23 -p 6379 ping
# 3. Check Redis memory usage
redis-cli -h 172.16.168.23 info memory
# 4. Check database configuration
redis-cli -h 172.16.168.23 config get databasesSolutions:
# Restart Redis with proper configuration
docker stop autobot-redis
docker rm autobot-redis
# Recreate with correct settings
docker run -d \
--name autobot-redis \
--network autobot \
--ip 172.16.168.23 \
-p 6379:6379 \
-v redis_data:/data \
redis/redis-stack:7.4.0-v1# Check available memory
redis-cli -h 172.16.168.23 info memory | grep used_memory_human
# If Redis is out of memory, clear unnecessary data
redis-cli -h 172.16.168.23
SELECT 3 # Cache database
FLUSHDB # Clear cache only
# Or restart Redis to clear all data (use carefully!)
docker restart autobot-redis# Verify Redis database separation
redis-cli -h 172.16.168.23
INFO keyspace # Should show multiple databases
# If databases are mixed up, run database migration
python3 scripts/migrate_redis_databases.py --fix-separationSymptoms:
- Multi-modal requests timeout after 45+ seconds
- "NPU not available" errors
- Image processing fails with GPU errors
Diagnostic Commands:
# 1. Check NPU Worker status
curl -s http://172.16.168.22:8081/health | jq .
# 2. Verify GPU availability
nvidia-smi
python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
# 3. Check Intel NPU
python3 -c "from openvino.runtime import Core; print(f'NPU available: {\"NPU\" in Core().available_devices}')"
# 4. Test AI Stack orchestrator
curl -s http://172.16.168.24:8080/health | jq .Solutions:
# Check NPU service logs
ssh autobot@172.16.168.22 'sudo journalctl -u autobot-npu-worker -f'
# Restart NPU service
ssh autobot@172.16.168.22 'sudo systemctl restart autobot-npu-worker'
# If NPU not detected, fallback to GPU
curl -X POST http://172.16.168.22:8081/config \
-H "Content-Type: application/json" \
-d '{"fallback_to_gpu": true, "primary_device": "GPU"}'# Clear GPU cache
python3 -c "import torch; torch.cuda.empty_cache(); print('GPU cache cleared')"
# Check GPU processes
nvidia-smi pmon -c 1
# Kill GPU processes if necessary
sudo fuser -v /dev/nvidia*# Check model cache status
ls -la /opt/autobot/models/cache/
du -sh /opt/autobot/models/cache/
# Clear corrupted models
rm -rf /opt/autobot/models/cache/*.tmp
rm -rf /opt/autobot/models/cache/corrupted/
# Re-download models
python3 scripts/download_ai_models.py --force-refreshSymptoms:
- Search returns no results despite having 3,278 documents
- "Vector database error" messages
- Slow search performance (>5 seconds)
Diagnostic Steps:
# 1. Check knowledge base statistics
curl -s http://127.0.0.1:8001/api/knowledge_base/stats/comprehensive | jq .
# 2. Test direct Redis vector search
redis-cli -h 172.16.168.23
SELECT 8 # Vector database
FT.INFO knowledge_idx
# 3. Check embedding service
curl -s http://172.16.168.22:8081/embeddings/health | jq .Solutions:
# Rebuild vector index
curl -X POST http://127.0.0.1:8001/api/knowledge_base/rebuild_index
# Or manually rebuild
redis-cli -h 172.16.168.23
SELECT 8
FT.DROPINDEX knowledge_idx
# Index will be rebuilt automatically on next search# Restart embedding service on NPU worker
ssh autobot@172.16.168.22 'sudo systemctl restart autobot-embedding-service'
# Test embedding generation
curl -X POST http://172.16.168.22:8081/embeddings/generate \
-H "Content-Type: application/json" \
-d '{"text": "test embedding generation"}'# Check Redis database integrity
redis-cli -h 172.16.168.23 --latency-history -i 1
# If corrupted, restore from backup
redis-cli -h 172.16.168.23 --rdb backup/dump.rdb
# Or rebuild knowledge base from source
python3 scripts/rebuild_knowledge_base.py --from-sourceSymptoms:
- Frontend loads but shows "API connection failed"
- WebSocket "Connection refused" errors
- Real-time features not working
Diagnostic Commands:
# 1. Check frontend service
curl -s http://172.16.168.21:5173/ | head -10
# 2. Test API proxy
curl -s http://172.16.168.21:5173/api/health
# 3. Check WebSocket connection
wscat -c ws://127.0.0.1:8001/ws/testSolutions:
# Check Vite configuration
cd autobot-vue
cat vite.config.ts | grep -A 10 proxy
# Expected proxy configuration:
# proxy: {
# '/api': {
# target: 'https://172.16.168.20:8443',
# changeOrigin: true
# }
# }
# Restart frontend with correct configuration
npm run dev# Check CORS headers
curl -H "Origin: http://172.16.168.21:5173" \
-H "Access-Control-Request-Method: POST" \
-H "Access-Control-Request-Headers: Content-Type" \
-X OPTIONS \
http://127.0.0.1:8001/api/health -v
# Should return CORS headers allowing the frontend origin# Check WebSocket endpoint
curl --include \
--no-buffer \
--header "Connection: Upgrade" \
--header "Upgrade: websocket" \
--header "Sec-WebSocket-Key: SGVsbG8sIHdvcmxkIQ==" \
--header "Sec-WebSocket-Version: 13" \
http://127.0.0.1:8001/ws/test
# Test WebSocket proxy through Nginx
curl -I http://172.16.168.21:5173/ws/testSymptoms:
- Commands timeout after 30 seconds
- "Permission denied" for basic commands
- Terminal session hangs or becomes unresponsive
Diagnostic Steps:
# 1. Test terminal endpoint directly
curl -X POST http://127.0.0.1:8001/api/terminal/execute \
-H "Content-Type: application/json" \
-d '{"command": "echo test", "timeout": 10}'
# 2. Check command whitelist
grep -A 20 "COMMAND_WHITELIST" autobot-user-backend/utils/command_security.py
# 3. Verify sandbox configuration
ls -la /tmp/autobot/sandbox/Solutions:
# Add command to whitelist (if safe)
echo "your_command" >> config/allowed_commands.txt
# Or update command security configuration
# Edit autobot-user-backend/utils/command_security.py and add to COMMAND_WHITELIST# Fix sandbox permissions
sudo chown -R $(whoami):$(whoami) /tmp/autobot/sandbox/
chmod 755 /tmp/autobot/sandbox/
# Clear sandbox if corrupted
rm -rf /tmp/autobot/sandbox/*
mkdir -p /tmp/autobot/sandbox/sessions# Clean up hanging terminal sessions
ps aux | grep -i terminal | grep -v grep
pkill -f "terminal_session"
# Clear session storage
redis-cli -h 172.16.168.23
SELECT 2 # Session database
KEYS "terminal_session:*"
DEL terminal_session:* # Clear all terminal sessionsSymptoms:
- Browser automation timeouts
- "No browser sessions available"
- Screenshot capture failures
Diagnostic Commands:
# 1. Check browser service status
curl -s http://172.16.168.25:3000/health | jq .
# 2. Check available browsers
curl -s http://172.16.168.25:3000/browsers | jq .
# 3. List active sessions
curl -s http://172.16.168.25:3000/sessions | jq .Solutions:
# Check active sessions
curl -s http://172.16.168.25:3000/sessions/stats | jq .
# Close idle sessions
curl -X DELETE http://172.16.168.25:3000/sessions/cleanup
# Increase browser pool size
ssh autobot@172.16.168.25 'echo "MAX_BROWSER_SESSIONS=20" >> /opt/autobot/browser/.env'
ssh autobot@172.16.168.25 'sudo systemctl restart autobot-browser-service'# Reinstall Playwright browsers
ssh autobot@172.16.168.25 'cd /opt/autobot/browser && npx playwright install --with-deps'
# Check browser installation
ssh autobot@172.16.168.25 'npx playwright install --dry-run'# Check X11 forwarding
ssh autobot@172.16.168.25 'echo $DISPLAY'
# Start virtual display if needed
ssh autobot@172.16.168.25 'sudo systemctl start xvfb'
# Test browser launch
ssh autobot@172.16.168.25 'timeout 10s chromium-browser --headless --dump-dom http://example.com'Symptoms:
- File uploads fail or timeout
- "File size too large" for small files
- Virus scan failures
Diagnostic Steps:
# 1. Check file upload endpoint
curl -X POST http://127.0.0.1:8001/api/files/upload/test \
-F "file=@small_test_file.txt"
# 2. Check available disk space
df -h /data/autobot/uploads/
# 3. Verify virus scanner
clamdscan --version
systemctl status clamav-daemonSolutions:
# Clean up old uploads
find /data/autobot/uploads/ -type f -mtime +30 -delete
# Check large files
du -sh /data/autobot/uploads/* | sort -hr | head -10
# Clean up temporary files
rm -rf /tmp/autobot/uploads/*.tmp# Update virus definitions
sudo freshclam
# Restart ClamAV
sudo systemctl restart clamav-daemon
# Test virus scanning
clamdscan /tmp/test_file.txt# Fix upload directory permissions
sudo chown -R autobot:autobot /data/autobot/uploads/
chmod 755 /data/autobot/uploads/
chmod 644 /data/autobot/uploads/*Symptoms:
- Slow API response times (>2 seconds)
- High CPU or memory usage
- Cache inefficiency
Performance Diagnostics:
# 1. Check API response times
time curl -s http://127.0.0.1:8001/api/health
# 2. Monitor resource usage
htop
iostat 1 5
free -m
# 3. Check cache hit rates
redis-cli -h 172.16.168.23 info stats | grep cache
# 4. Profile application
python3 -m cProfile -o profile.stats backend/main.pyOptimization Solutions:
# Check slow Redis queries
redis-cli -h 172.16.168.23 SLOWLOG GET 10
# Optimize knowledge base index
curl -X POST http://127.0.0.1:8001/api/knowledge_base/optimize_index
# Clear query cache if needed
redis-cli -h 172.16.168.23
SELECT 3 # Cache database
FLUSHDB# Clear Python caches
python3 -c "import gc; gc.collect(); print('Memory cleanup completed')"
# Restart services with memory limits
docker update --memory=4g autobot-backend
docker update --memory=2g autobot-redis# Check CPU-intensive processes
ps aux --sort=-%cpu | head -10
# Optimize Python threading
export PYTHON_GIL_DISABLED=1 # If using Python 3.13+
export OMP_NUM_THREADS=4 # Limit OpenMP threadsAutomated Health Check Script:
#!/bin/bash
# daily_health_check.sh
echo "=== AutoBot Daily Health Check ==="
echo "Date: $(date)"
echo
# 1. Service Status
echo "=== Service Status ==="
bash run_autobot.sh --status
# 2. Resource Usage
echo "=== Resource Usage ==="
echo "Memory: $(free -h | grep Mem | awk '{print $3 "/" $2}')"
echo "Disk: $(df -h /data | tail -1 | awk '{print $3 "/" $2 " (" $5 " used)"}')"
echo "CPU Load: $(uptime | awk -F'load average:' '{print $2}')"
# 3. Error Count
echo "=== Recent Errors ==="
error_count=$(grep -c "ERROR" logs/autobot.log 2>/dev/null || echo 0)
echo "Errors in last 24h: $error_count"
if [ $error_count -gt 10 ]; then
echo "⚠️ High error count detected"
tail -20 logs/autobot.log | grep ERROR
fi
# 4. Database Health
echo "=== Database Health ==="
redis_ping=$(redis-cli -h 172.16.168.23 ping 2>/dev/null || echo "FAILED")
echo "Redis: $redis_ping"
if [ "$redis_ping" = "PONG" ]; then
kb_count=$(redis-cli -h 172.16.168.23 -n 8 DBSIZE 2>/dev/null || echo 0)
echo "Knowledge base vectors: $kb_count"
fi
# 5. Security Status
echo "=== Security Status ==="
failed_auth=$(grep -c "authentication_failure" logs/security.log 2>/dev/null || echo 0)
echo "Failed auth attempts (24h): $failed_auth"
if [ $failed_auth -gt 5 ]; then
echo "⚠️ High failed authentication count"
fi
echo
echo "=== Health Check Complete ==="Weekly Maintenance Script:
#!/bin/bash
# weekly_maintenance.sh
echo "=== AutoBot Weekly Maintenance ==="
# 1. Log Rotation
echo "Rotating logs..."
sudo logrotate /etc/logrotate.d/autobot
# 2. Clean Temporary Files
echo "Cleaning temporary files..."
find /tmp/autobot/ -type f -mtime +7 -delete
find /data/autobot/temp/ -type f -mtime +7 -delete
# 3. Database Optimization
echo "Optimizing databases..."
redis-cli -h 172.16.168.23 BGREWRITEAOF
curl -X POST http://127.0.0.1:8001/api/knowledge_base/optimize
# 4. Update System Packages
echo "Updating system packages..."
sudo apt update && sudo apt upgrade -y
# 5. Backup Critical Data
echo "Creating backup..."
python3 scripts/create_backup.py --type=incremental
# 6. Security Scan
echo "Running security scan..."
python3 scripts/security_scan.py --report=weekly
echo "Weekly maintenance completed"Complete System Failure Recovery:
-
Assess the Situation:
# Check what's running docker ps ps aux | grep -E "(python|node|redis)" # Check logs for clues tail -100 logs/autobot.log dmesg | tail -20
-
Stop All Services:
# Stop AutoBot services pkill -f "autobot" docker stop $(docker ps -q) # Kill any hanging processes sudo pkill -f "python.*backend" sudo pkill -f "node.*vite"
-
Clean Reset:
# Clean temporary files rm -rf /tmp/autobot/* rm -rf /var/run/autobot/* # Clean Docker docker system prune -f
-
Restart from Clean State:
# Full restart with rebuild bash run_autobot.sh --dev --rebuild # If that fails, try minimal startup bash run_autobot.sh --minimal --build
-
Verify Recovery:
# Check all services bash run_autobot.sh --status # Test basic functionality curl http://127.0.0.1:8001/api/health
Knowledge Base Recovery:
# 1. Check if backup exists
ls -la backups/knowledge_base/
# 2. Restore from backup
python3 scripts/restore_knowledge_base.py --backup=latest
# 3. Rebuild if no backup
python3 scripts/rebuild_knowledge_base.py --from-source --confirmConfiguration Recovery:
# 1. Restore default configuration
cp config/config.yaml.default config/config.yaml
# 2. Restore environment variables
cp .env.example .env
# 3. Regenerate SSH keys
rm ~/.ssh/autobot_key*
ssh-keygen -t ed25519 -f ~/.ssh/autobot_key -N ""Interactive Diagnostics:
# Run comprehensive diagnostic
python3 scripts/autobot_diagnostics.py --interactive
# System health dashboard
python3 scripts/health_dashboard.py --web-interface
# Log analyzer
python3 scripts/log_analyzer.py --last=24h --severity=errorDocumentation:
- Architecture Guide:
docs/architecture/PHASE_5_DISTRIBUTED_ARCHITECTURE.md - API Documentation:
docs/api/COMPREHENSIVE_API_DOCUMENTATION.md - Security Guide:
docs/security/PHASE_5_SECURITY_IMPLEMENTATION.md - Developer Setup:
docs/developer/PHASE_5_DEVELOPER_SETUP.md
Support Channels:
- GitHub Issues: For reproducible bugs
- Internal Wiki: For internal troubleshooting
- Slack #autobot-support: For real-time help
- Email: support@autobot.com
Emergency Contacts:
- On-call Engineer: +1-800-AUTOBOT-1
- System Administrator: admin@autobot.com
- Security Incident: security@autobot.com
Remember: Most issues can be resolved by restarting the affected service or using the --rebuild flag. When in doubt, check the logs first - they usually contain the exact error message and solution path.