Skip to content

Latest commit

 

History

History
892 lines (679 loc) · 20.6 KB

File metadata and controls

892 lines (679 loc) · 20.6 KB

AutoBot Phase 5 - Comprehensive Troubleshooting Guide

Complete Problem Resolution for Distributed Multi-Modal AI System

Generated: 2025-09-10
Coverage: 518 API endpoints across 6 distributed VMs
Resolution Time: ~5 minutes average (with this guide)

Quick Diagnostic Commands

Before diving into specific issues, run these diagnostic commands:

# 1. Overall system health check
bash run_autobot.sh --status

# 2. Check all service connectivity  
python3 scripts/health_check_comprehensive.py

# 3. View recent error logs
tail -f logs/autobot.log | grep -i error

# 4. Check VM network connectivity
ping -c 3 172.16.168.21  # Frontend
ping -c 3 172.16.168.22  # NPU Worker  
ping -c 3 172.16.168.23  # Redis
ping -c 3 172.16.168.24  # AI Stack
ping -c 3 172.16.168.25  # Browser Service

# 5. Verify critical services
curl -s http://127.0.0.1:8001/api/health | jq .
redis-cli -h 172.16.168.23 ping
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

Issue Classification & Priority

🔥 Critical Issues (System Down)

  • Backend API unresponsive
  • Redis connection failures
  • VM communication breakdown
  • Authentication system failure

⚠️ High Priority Issues (Degraded Performance)

  • Multi-modal AI processing failures
  • Knowledge base search errors
  • WebSocket disconnections
  • File upload failures

📝 Medium Priority Issues (Feature Impact)

  • Terminal command timeouts
  • Browser automation failures
  • Model loading issues
  • Cache performance problems

ℹ️ Low Priority Issues (Minor Impact)

  • UI display glitches
  • Log rotation issues
  • Non-critical service warnings
  • Performance optimization opportunities

Critical Issues (🔥)

1. Backend API Unresponsive

Symptoms:

  • Frontend shows "Network Error" or timeouts
  • curl http://127.0.0.1:8001/api/health fails
  • Browser shows "This site can't be reached"

Root Causes & Solutions:

A. Redis Connection Timeout (Most Common)

# Check if Redis is running
docker ps | grep redis
redis-cli -h 172.16.168.23 ping

# If Redis is down, restart it
docker restart autobot-redis

# If Redis is up but backend still fails
# Check backend logs for Redis connection errors
tail -f logs/autobot.log | grep -i redis

# Solution: Use fast backend startup
# Edit run_autobot.sh to use fast_app_factory_fix.py (should be default)

Quick Fix:

# Restart backend with fast startup
pkill -f "python.*backend"
cd backend && python fast_app_factory_fix.py &

B. Port Conflicts

# Check if port 8001 is in use
lsof -i :8001
netstat -tulpn | grep :8001

# Kill conflicting process
sudo kill -9 $(lsof -t -i:8001)

# Restart AutoBot
bash run_autobot.sh --dev --build

C. Out of Memory

# Check memory usage
free -h
ps aux --sort=-%mem | head -10

# Check if Python processes consuming too much memory
ps aux | grep python | awk '{sum+=$6} END {print "Python memory usage: " sum/1024 " MB"}'

# If memory is low, restart system or clear cache
sudo systemctl restart autobot-backend

2. VM Communication Failures

Symptoms:

  • Services can't reach other VMs (timeouts)
  • "Connection refused" errors in logs
  • Frontend can't connect to backend

Diagnostic Commands:

# Test VM-to-VM connectivity
for vm in 21 22 23 24 25; do
    echo "Testing VM 172.16.168.$vm"
    nc -zv 172.16.168.$vm 22 2>&1 | head -1
done

# Check firewall rules
sudo iptables -L -n | grep 172.16.168

# Test specific service ports
nc -zv 172.16.168.21 5173  # Frontend dev server
nc -zv 172.16.168.22 8081  # NPU Worker
nc -zv 172.16.168.23 6379  # Redis
nc -zv 172.16.168.24 8080  # AI Stack
nc -zv 172.16.168.25 3000  # Browser Service

Solutions:

A. SSH Connectivity Issues

# Generate new SSH keys if needed
ssh-keygen -t ed25519 -f ~/.ssh/autobot_key -N ""

# Copy keys to all VMs
for vm in 21 22 23 24 25; do
    ssh-copy-id -i ~/.ssh/autobot_key.pub autobot@172.16.168.$vm
done

# Test SSH connectivity
ssh -i ~/.ssh/autobot_key autobot@172.16.168.21 "echo 'VM1 accessible'"

B. Network Configuration Issues

# Check network interface configuration
ip addr show
ip route show

# Verify VM network is properly configured
ping -c 3 172.16.168.1  # Network gateway

# Reset network if needed (WSL2)
wsl --shutdown
# Restart WSL2 and rerun setup

3. Redis Database Failures

Symptoms:

  • "Redis connection failed" errors
  • Knowledge base search returns empty
  • Chat history not saving

Diagnostic Steps:

# 1. Check Redis container status
docker ps | grep redis
docker logs autobot-redis --tail=50

# 2. Test Redis connectivity from different VMs
redis-cli -h 172.16.168.23 -p 6379 ping

# 3. Check Redis memory usage
redis-cli -h 172.16.168.23 info memory

# 4. Check database configuration
redis-cli -h 172.16.168.23 config get databases

Solutions:

A. Redis Container Not Running

# Restart Redis with proper configuration
docker stop autobot-redis
docker rm autobot-redis

# Recreate with correct settings
docker run -d \
  --name autobot-redis \
  --network autobot \
  --ip 172.16.168.23 \
  -p 6379:6379 \
  -v redis_data:/data \
  redis/redis-stack:7.4.0-v1

B. Redis Memory Issues

# Check available memory
redis-cli -h 172.16.168.23 info memory | grep used_memory_human

# If Redis is out of memory, clear unnecessary data
redis-cli -h 172.16.168.23
SELECT 3  # Cache database
FLUSHDB  # Clear cache only

# Or restart Redis to clear all data (use carefully!)
docker restart autobot-redis

C. Database Configuration Issues

# Verify Redis database separation
redis-cli -h 172.16.168.23
INFO keyspace  # Should show multiple databases

# If databases are mixed up, run database migration
python3 scripts/migrate_redis_databases.py --fix-separation

High Priority Issues (⚠️)

4. Multi-Modal AI Processing Failures

Symptoms:

  • Multi-modal requests timeout after 45+ seconds
  • "NPU not available" errors
  • Image processing fails with GPU errors

Diagnostic Commands:

# 1. Check NPU Worker status
curl -s http://172.16.168.22:8081/health | jq .

# 2. Verify GPU availability
nvidia-smi
python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

# 3. Check Intel NPU
python3 -c "from openvino.runtime import Core; print(f'NPU available: {\"NPU\" in Core().available_devices}')"

# 4. Test AI Stack orchestrator
curl -s http://172.16.168.24:8080/health | jq .

Solutions:

A. NPU Worker Issues

# Check NPU service logs
ssh autobot@172.16.168.22 'sudo journalctl -u autobot-npu-worker -f'

# Restart NPU service
ssh autobot@172.16.168.22 'sudo systemctl restart autobot-npu-worker'

# If NPU not detected, fallback to GPU
curl -X POST http://172.16.168.22:8081/config \
  -H "Content-Type: application/json" \
  -d '{"fallback_to_gpu": true, "primary_device": "GPU"}'

B. GPU Memory Issues

# Clear GPU cache
python3 -c "import torch; torch.cuda.empty_cache(); print('GPU cache cleared')"

# Check GPU processes
nvidia-smi pmon -c 1

# Kill GPU processes if necessary
sudo fuser -v /dev/nvidia*

C. Model Loading Failures

# Check model cache status
ls -la /opt/autobot/models/cache/
du -sh /opt/autobot/models/cache/

# Clear corrupted models
rm -rf /opt/autobot/models/cache/*.tmp
rm -rf /opt/autobot/models/cache/corrupted/

# Re-download models
python3 scripts/download_ai_models.py --force-refresh

5. Knowledge Base Search Failures

Symptoms:

  • Search returns no results despite having 3,278 documents
  • "Vector database error" messages
  • Slow search performance (>5 seconds)

Diagnostic Steps:

# 1. Check knowledge base statistics
curl -s http://127.0.0.1:8001/api/knowledge_base/stats/comprehensive | jq .

# 2. Test direct Redis vector search
redis-cli -h 172.16.168.23
SELECT 8  # Vector database
FT.INFO knowledge_idx

# 3. Check embedding service
curl -s http://172.16.168.22:8081/embeddings/health | jq .

Solutions:

A. Vector Index Corruption

# Rebuild vector index
curl -X POST http://127.0.0.1:8001/api/knowledge_base/rebuild_index

# Or manually rebuild
redis-cli -h 172.16.168.23
SELECT 8
FT.DROPINDEX knowledge_idx
# Index will be rebuilt automatically on next search

B. Embedding Service Issues

# Restart embedding service on NPU worker
ssh autobot@172.16.168.22 'sudo systemctl restart autobot-embedding-service'

# Test embedding generation
curl -X POST http://172.16.168.22:8081/embeddings/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "test embedding generation"}'

C. Database Corruption

# Check Redis database integrity
redis-cli -h 172.16.168.23 --latency-history -i 1

# If corrupted, restore from backup
redis-cli -h 172.16.168.23 --rdb backup/dump.rdb

# Or rebuild knowledge base from source
python3 scripts/rebuild_knowledge_base.py --from-source

6. Frontend Connection Issues

Symptoms:

  • Frontend loads but shows "API connection failed"
  • WebSocket "Connection refused" errors
  • Real-time features not working

Diagnostic Commands:

# 1. Check frontend service
curl -s http://172.16.168.21:5173/ | head -10

# 2. Test API proxy
curl -s http://172.16.168.21:5173/api/health

# 3. Check WebSocket connection
wscat -c ws://127.0.0.1:8001/ws/test

Solutions:

A. Vite Proxy Configuration Issues

# Check Vite configuration
cd autobot-vue
cat vite.config.ts | grep -A 10 proxy

# Expected proxy configuration:
# proxy: {
#   '/api': {
#     target: 'https://172.16.168.20:8443',
#     changeOrigin: true
#   }
# }

# Restart frontend with correct configuration
npm run dev

B. CORS Issues

# Check CORS headers
curl -H "Origin: http://172.16.168.21:5173" \
     -H "Access-Control-Request-Method: POST" \
     -H "Access-Control-Request-Headers: Content-Type" \
     -X OPTIONS \
     http://127.0.0.1:8001/api/health -v

# Should return CORS headers allowing the frontend origin

C. WebSocket Connection Issues

# Check WebSocket endpoint
curl --include \
     --no-buffer \
     --header "Connection: Upgrade" \
     --header "Upgrade: websocket" \
     --header "Sec-WebSocket-Key: SGVsbG8sIHdvcmxkIQ==" \
     --header "Sec-WebSocket-Version: 13" \
     http://127.0.0.1:8001/ws/test

# Test WebSocket proxy through Nginx
curl -I http://172.16.168.21:5173/ws/test

Medium Priority Issues (📝)

7. Terminal Command Execution Issues

Symptoms:

  • Commands timeout after 30 seconds
  • "Permission denied" for basic commands
  • Terminal session hangs or becomes unresponsive

Diagnostic Steps:

# 1. Test terminal endpoint directly
curl -X POST http://127.0.0.1:8001/api/terminal/execute \
  -H "Content-Type: application/json" \
  -d '{"command": "echo test", "timeout": 10}'

# 2. Check command whitelist
grep -A 20 "COMMAND_WHITELIST" autobot-user-backend/utils/command_security.py

# 3. Verify sandbox configuration
ls -la /tmp/autobot/sandbox/

Solutions:

A. Command Whitelist Issues

# Add command to whitelist (if safe)
echo "your_command" >> config/allowed_commands.txt

# Or update command security configuration
# Edit autobot-user-backend/utils/command_security.py and add to COMMAND_WHITELIST

B. Sandbox Permission Issues

# Fix sandbox permissions
sudo chown -R $(whoami):$(whoami) /tmp/autobot/sandbox/
chmod 755 /tmp/autobot/sandbox/

# Clear sandbox if corrupted
rm -rf /tmp/autobot/sandbox/*
mkdir -p /tmp/autobot/sandbox/sessions

C. Terminal Session Cleanup

# Clean up hanging terminal sessions
ps aux | grep -i terminal | grep -v grep
pkill -f "terminal_session"

# Clear session storage
redis-cli -h 172.16.168.23
SELECT 2  # Session database
KEYS "terminal_session:*"
DEL terminal_session:*  # Clear all terminal sessions

8. Browser Service Automation Failures

Symptoms:

  • Browser automation timeouts
  • "No browser sessions available"
  • Screenshot capture failures

Diagnostic Commands:

# 1. Check browser service status
curl -s http://172.16.168.25:3000/health | jq .

# 2. Check available browsers
curl -s http://172.16.168.25:3000/browsers | jq .

# 3. List active sessions
curl -s http://172.16.168.25:3000/sessions | jq .

Solutions:

A. Browser Pool Exhaustion

# Check active sessions
curl -s http://172.16.168.25:3000/sessions/stats | jq .

# Close idle sessions
curl -X DELETE http://172.16.168.25:3000/sessions/cleanup

# Increase browser pool size
ssh autobot@172.16.168.25 'echo "MAX_BROWSER_SESSIONS=20" >> /opt/autobot/browser/.env'
ssh autobot@172.16.168.25 'sudo systemctl restart autobot-browser-service'

B. Playwright Installation Issues

# Reinstall Playwright browsers
ssh autobot@172.16.168.25 'cd /opt/autobot/browser && npx playwright install --with-deps'

# Check browser installation
ssh autobot@172.16.168.25 'npx playwright install --dry-run'

C. Display/X11 Issues

# Check X11 forwarding
ssh autobot@172.16.168.25 'echo $DISPLAY'

# Start virtual display if needed
ssh autobot@172.16.168.25 'sudo systemctl start xvfb'

# Test browser launch
ssh autobot@172.16.168.25 'timeout 10s chromium-browser --headless --dump-dom http://example.com'

9. File Upload & Management Issues

Symptoms:

  • File uploads fail or timeout
  • "File size too large" for small files
  • Virus scan failures

Diagnostic Steps:

# 1. Check file upload endpoint
curl -X POST http://127.0.0.1:8001/api/files/upload/test \
  -F "file=@small_test_file.txt"

# 2. Check available disk space
df -h /data/autobot/uploads/

# 3. Verify virus scanner
clamdscan --version
systemctl status clamav-daemon

Solutions:

A. Disk Space Issues

# Clean up old uploads
find /data/autobot/uploads/ -type f -mtime +30 -delete

# Check large files
du -sh /data/autobot/uploads/* | sort -hr | head -10

# Clean up temporary files
rm -rf /tmp/autobot/uploads/*.tmp

B. Virus Scanner Issues

# Update virus definitions
sudo freshclam

# Restart ClamAV
sudo systemctl restart clamav-daemon

# Test virus scanning
clamdscan /tmp/test_file.txt

C. File Permission Issues

# Fix upload directory permissions
sudo chown -R autobot:autobot /data/autobot/uploads/
chmod 755 /data/autobot/uploads/
chmod 644 /data/autobot/uploads/*

Low Priority Issues (ℹ️)

10. Performance Optimization Issues

Symptoms:

  • Slow API response times (>2 seconds)
  • High CPU or memory usage
  • Cache inefficiency

Performance Diagnostics:

# 1. Check API response times
time curl -s http://127.0.0.1:8001/api/health

# 2. Monitor resource usage
htop
iostat 1 5
free -m

# 3. Check cache hit rates
redis-cli -h 172.16.168.23 info stats | grep cache

# 4. Profile application
python3 -m cProfile -o profile.stats backend/main.py

Optimization Solutions:

A. Database Query Optimization

# Check slow Redis queries
redis-cli -h 172.16.168.23 SLOWLOG GET 10

# Optimize knowledge base index
curl -X POST http://127.0.0.1:8001/api/knowledge_base/optimize_index

# Clear query cache if needed
redis-cli -h 172.16.168.23
SELECT 3  # Cache database
FLUSHDB

B. Memory Optimization

# Clear Python caches
python3 -c "import gc; gc.collect(); print('Memory cleanup completed')"

# Restart services with memory limits
docker update --memory=4g autobot-backend
docker update --memory=2g autobot-redis

C. CPU Optimization

# Check CPU-intensive processes
ps aux --sort=-%cpu | head -10

# Optimize Python threading
export PYTHON_GIL_DISABLED=1  # If using Python 3.13+
export OMP_NUM_THREADS=4       # Limit OpenMP threads

Preventive Maintenance

Daily Health Checks

Automated Health Check Script:

#!/bin/bash
# daily_health_check.sh

echo "=== AutoBot Daily Health Check ==="
echo "Date: $(date)"
echo

# 1. Service Status
echo "=== Service Status ==="
bash run_autobot.sh --status

# 2. Resource Usage
echo "=== Resource Usage ==="
echo "Memory: $(free -h | grep Mem | awk '{print $3 "/" $2}')"
echo "Disk: $(df -h /data | tail -1 | awk '{print $3 "/" $2 " (" $5 " used)"}')"
echo "CPU Load: $(uptime | awk -F'load average:' '{print $2}')"

# 3. Error Count
echo "=== Recent Errors ==="
error_count=$(grep -c "ERROR" logs/autobot.log 2>/dev/null || echo 0)
echo "Errors in last 24h: $error_count"

if [ $error_count -gt 10 ]; then
    echo "⚠️  High error count detected"
    tail -20 logs/autobot.log | grep ERROR
fi

# 4. Database Health
echo "=== Database Health ==="
redis_ping=$(redis-cli -h 172.16.168.23 ping 2>/dev/null || echo "FAILED")
echo "Redis: $redis_ping"

if [ "$redis_ping" = "PONG" ]; then
    kb_count=$(redis-cli -h 172.16.168.23 -n 8 DBSIZE 2>/dev/null || echo 0)
    echo "Knowledge base vectors: $kb_count"
fi

# 5. Security Status
echo "=== Security Status ==="
failed_auth=$(grep -c "authentication_failure" logs/security.log 2>/dev/null || echo 0)
echo "Failed auth attempts (24h): $failed_auth"

if [ $failed_auth -gt 5 ]; then
    echo "⚠️  High failed authentication count"
fi

echo
echo "=== Health Check Complete ==="

Weekly Maintenance Tasks

Weekly Maintenance Script:

#!/bin/bash
# weekly_maintenance.sh

echo "=== AutoBot Weekly Maintenance ==="

# 1. Log Rotation
echo "Rotating logs..."
sudo logrotate /etc/logrotate.d/autobot

# 2. Clean Temporary Files
echo "Cleaning temporary files..."
find /tmp/autobot/ -type f -mtime +7 -delete
find /data/autobot/temp/ -type f -mtime +7 -delete

# 3. Database Optimization
echo "Optimizing databases..."
redis-cli -h 172.16.168.23 BGREWRITEAOF
curl -X POST http://127.0.0.1:8001/api/knowledge_base/optimize

# 4. Update System Packages
echo "Updating system packages..."
sudo apt update && sudo apt upgrade -y

# 5. Backup Critical Data
echo "Creating backup..."
python3 scripts/create_backup.py --type=incremental

# 6. Security Scan
echo "Running security scan..."
python3 scripts/security_scan.py --report=weekly

echo "Weekly maintenance completed"

Emergency Recovery Procedures

System Recovery Checklist

Complete System Failure Recovery:

  1. Assess the Situation:

    # Check what's running
    docker ps
    ps aux | grep -E "(python|node|redis)"
    
    # Check logs for clues
    tail -100 logs/autobot.log
    dmesg | tail -20
  2. Stop All Services:

    # Stop AutoBot services
    pkill -f "autobot"
    docker stop $(docker ps -q)
    
    # Kill any hanging processes
    sudo pkill -f "python.*backend"
    sudo pkill -f "node.*vite"
  3. Clean Reset:

    # Clean temporary files
    rm -rf /tmp/autobot/*
    rm -rf /var/run/autobot/*
    
    # Clean Docker
    docker system prune -f
  4. Restart from Clean State:

    # Full restart with rebuild
    bash run_autobot.sh --dev --rebuild
    
    # If that fails, try minimal startup
    bash run_autobot.sh --minimal --build
  5. Verify Recovery:

    # Check all services
    bash run_autobot.sh --status
    
    # Test basic functionality
    curl http://127.0.0.1:8001/api/health

Data Recovery Procedures

Knowledge Base Recovery:

# 1. Check if backup exists
ls -la backups/knowledge_base/

# 2. Restore from backup
python3 scripts/restore_knowledge_base.py --backup=latest

# 3. Rebuild if no backup
python3 scripts/rebuild_knowledge_base.py --from-source --confirm

Configuration Recovery:

# 1. Restore default configuration
cp config/config.yaml.default config/config.yaml

# 2. Restore environment variables
cp .env.example .env

# 3. Regenerate SSH keys
rm ~/.ssh/autobot_key*
ssh-keygen -t ed25519 -f ~/.ssh/autobot_key -N ""

Getting Additional Help

Built-in Diagnostic Tools

Interactive Diagnostics:

# Run comprehensive diagnostic
python3 scripts/autobot_diagnostics.py --interactive

# System health dashboard
python3 scripts/health_dashboard.py --web-interface

# Log analyzer
python3 scripts/log_analyzer.py --last=24h --severity=error

External Resources

Documentation:

  • Architecture Guide: docs/architecture/PHASE_5_DISTRIBUTED_ARCHITECTURE.md
  • API Documentation: docs/api/COMPREHENSIVE_API_DOCUMENTATION.md
  • Security Guide: docs/security/PHASE_5_SECURITY_IMPLEMENTATION.md
  • Developer Setup: docs/developer/PHASE_5_DEVELOPER_SETUP.md

Support Channels:

  • GitHub Issues: For reproducible bugs
  • Internal Wiki: For internal troubleshooting
  • Slack #autobot-support: For real-time help
  • Email: support@autobot.com

Emergency Contacts:


Remember: Most issues can be resolved by restarting the affected service or using the --rebuild flag. When in doubt, check the logs first - they usually contain the exact error message and solution path.