AutoBot Phase 5 - Comprehensive Troubleshooting Guide

Complete Problem Resolution for Distributed Multi-Modal AI System

Generated: 2025-09-10
Coverage: 518 API endpoints across 6 distributed VMs
Resolution Time: ~5 minutes average (with this guide)

Quick Diagnostic Commands

Before diving into specific issues, run these diagnostic commands:

# 1. Overall system health check
bash run_autobot.sh --status

# 2. Check all service connectivity  
python3 scripts/health_check_comprehensive.py

# 3. View recent error logs
tail -f logs/autobot.log | grep -i error

# 4. Check VM network connectivity
ping -c 3 172.16.168.21  # Frontend
ping -c 3 172.16.168.22  # NPU Worker  
ping -c 3 172.16.168.23  # Redis
ping -c 3 172.16.168.24  # AI Stack
ping -c 3 172.16.168.25  # Browser Service

# 5. Verify critical services
curl -s http://127.0.0.1:8001/api/health | jq .
redis-cli -h 172.16.168.23 ping
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

Issue Classification & Priority

🔥 Critical Issues (System Down)

Backend API unresponsive
Redis connection failures
VM communication breakdown
Authentication system failure

⚠️ High Priority Issues (Degraded Performance)

Multi-modal AI processing failures
Knowledge base search errors
WebSocket disconnections
File upload failures

📝 Medium Priority Issues (Feature Impact)

Terminal command timeouts
Browser automation failures
Model loading issues
Cache performance problems

ℹ️ Low Priority Issues (Minor Impact)

UI display glitches
Log rotation issues
Non-critical service warnings
Performance optimization opportunities

Critical Issues (🔥)

1. Backend API Unresponsive

Symptoms:

Frontend shows "Network Error" or timeouts
curl http://127.0.0.1:8001/api/health fails
Browser shows "This site can't be reached"

Root Causes & Solutions:

A. Redis Connection Timeout (Most Common)

# Check if Redis is running
docker ps | grep redis
redis-cli -h 172.16.168.23 ping

# If Redis is down, restart it
docker restart autobot-redis

# If Redis is up but backend still fails
# Check backend logs for Redis connection errors
tail -f logs/autobot.log | grep -i redis

# Solution: Use fast backend startup
# Edit run_autobot.sh to use fast_app_factory_fix.py (should be default)

Quick Fix:

# Restart backend with fast startup
pkill -f "python.*backend"
cd backend && python fast_app_factory_fix.py &

B. Port Conflicts

# Check if port 8001 is in use
lsof -i :8001
netstat -tulpn | grep :8001

# Kill conflicting process
sudo kill -9 $(lsof -t -i:8001)

# Restart AutoBot
bash run_autobot.sh --dev --build

C. Out of Memory

# Check memory usage
free -h
ps aux --sort=-%mem | head -10

# Check if Python processes consuming too much memory
ps aux | grep python | awk '{sum+=$6} END {print "Python memory usage: " sum/1024 " MB"}'

# If memory is low, restart system or clear cache
sudo systemctl restart autobot-backend

2. VM Communication Failures

Symptoms:

Services can't reach other VMs (timeouts)
"Connection refused" errors in logs
Frontend can't connect to backend

Diagnostic Commands:

# Test VM-to-VM connectivity
for vm in 21 22 23 24 25; do
    echo "Testing VM 172.16.168.$vm"
    nc -zv 172.16.168.$vm 22 2>&1 | head -1
done

# Check firewall rules
sudo iptables -L -n | grep 172.16.168

# Test specific service ports
nc -zv 172.16.168.21 5173  # Frontend dev server
nc -zv 172.16.168.22 8081  # NPU Worker
nc -zv 172.16.168.23 6379  # Redis
nc -zv 172.16.168.24 8080  # AI Stack
nc -zv 172.16.168.25 3000  # Browser Service

Solutions:

A. SSH Connectivity Issues

# Generate new SSH keys if needed
ssh-keygen -t ed25519 -f ~/.ssh/autobot_key -N ""

# Copy keys to all VMs
for vm in 21 22 23 24 25; do
    ssh-copy-id -i ~/.ssh/autobot_key.pub autobot@172.16.168.$vm
done

# Test SSH connectivity
ssh -i ~/.ssh/autobot_key autobot@172.16.168.21 "echo 'VM1 accessible'"

B. Network Configuration Issues

# Check network interface configuration
ip addr show
ip route show

# Verify VM network is properly configured
ping -c 3 172.16.168.1  # Network gateway

# Reset network if needed (WSL2)
wsl --shutdown
# Restart WSL2 and rerun setup

3. Redis Database Failures

Symptoms:

"Redis connection failed" errors
Knowledge base search returns empty
Chat history not saving

Diagnostic Steps:

# 1. Check Redis container status
docker ps | grep redis
docker logs autobot-redis --tail=50

# 2. Test Redis connectivity from different VMs
redis-cli -h 172.16.168.23 -p 6379 ping

# 3. Check Redis memory usage
redis-cli -h 172.16.168.23 info memory

# 4. Check database configuration
redis-cli -h 172.16.168.23 config get databases

Solutions:

A. Redis Container Not Running

# Restart Redis with proper configuration
docker stop autobot-redis
docker rm autobot-redis

# Recreate with correct settings
docker run -d \
  --name autobot-redis \
  --network autobot \
  --ip 172.16.168.23 \
  -p 6379:6379 \
  -v redis_data:/data \
  redis/redis-stack:7.4.0-v1

B. Redis Memory Issues

# Check available memory
redis-cli -h 172.16.168.23 info memory | grep used_memory_human

# If Redis is out of memory, clear unnecessary data
redis-cli -h 172.16.168.23
SELECT 3  # Cache database
FLUSHDB  # Clear cache only

# Or restart Redis to clear all data (use carefully!)
docker restart autobot-redis

C. Database Configuration Issues

# Verify Redis database separation
redis-cli -h 172.16.168.23
INFO keyspace  # Should show multiple databases

# If databases are mixed up, run database migration
python3 scripts/migrate_redis_databases.py --fix-separation

High Priority Issues (⚠️)

4. Multi-Modal AI Processing Failures

Symptoms:

Multi-modal requests timeout after 45+ seconds
"NPU not available" errors
Image processing fails with GPU errors

Diagnostic Commands:

# 1. Check NPU Worker status
curl -s http://172.16.168.22:8081/health | jq .

# 2. Verify GPU availability
nvidia-smi
python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

# 3. Check Intel NPU
python3 -c "from openvino.runtime import Core; print(f'NPU available: {\"NPU\" in Core().available_devices}')"

# 4. Test AI Stack orchestrator
curl -s http://172.16.168.24:8080/health | jq .

Solutions:

A. NPU Worker Issues

# Check NPU service logs
ssh autobot@172.16.168.22 'sudo journalctl -u autobot-npu-worker -f'

# Restart NPU service
ssh autobot@172.16.168.22 'sudo systemctl restart autobot-npu-worker'

# If NPU not detected, fallback to GPU
curl -X POST http://172.16.168.22:8081/config \
  -H "Content-Type: application/json" \
  -d '{"fallback_to_gpu": true, "primary_device": "GPU"}'

B. GPU Memory Issues

# Clear GPU cache
python3 -c "import torch; torch.cuda.empty_cache(); print('GPU cache cleared')"

# Check GPU processes
nvidia-smi pmon -c 1

# Kill GPU processes if necessary
sudo fuser -v /dev/nvidia*

C. Model Loading Failures

# Check model cache status
ls -la /opt/autobot/models/cache/
du -sh /opt/autobot/models/cache/

# Clear corrupted models
rm -rf /opt/autobot/models/cache/*.tmp
rm -rf /opt/autobot/models/cache/corrupted/

# Re-download models
python3 scripts/download_ai_models.py --force-refresh

5. Knowledge Base Search Failures

Symptoms:

Search returns no results despite having 3,278 documents
"Vector database error" messages
Slow search performance (>5 seconds)

Diagnostic Steps:

# 1. Check knowledge base statistics
curl -s http://127.0.0.1:8001/api/knowledge_base/stats/comprehensive | jq .

# 2. Test direct Redis vector search
redis-cli -h 172.16.168.23
SELECT 8  # Vector database
FT.INFO knowledge_idx

# 3. Check embedding service
curl -s http://172.16.168.22:8081/embeddings/health | jq .

Solutions:

A. Vector Index Corruption

# Rebuild vector index
curl -X POST http://127.0.0.1:8001/api/knowledge_base/rebuild_index

# Or manually rebuild
redis-cli -h 172.16.168.23
SELECT 8
FT.DROPINDEX knowledge_idx
# Index will be rebuilt automatically on next search

B. Embedding Service Issues

# Restart embedding service on NPU worker
ssh autobot@172.16.168.22 'sudo systemctl restart autobot-embedding-service'

# Test embedding generation
curl -X POST http://172.16.168.22:8081/embeddings/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "test embedding generation"}'

C. Database Corruption

# Check Redis database integrity
redis-cli -h 172.16.168.23 --latency-history -i 1

# If corrupted, restore from backup
redis-cli -h 172.16.168.23 --rdb backup/dump.rdb

# Or rebuild knowledge base from source
python3 scripts/rebuild_knowledge_base.py --from-source

6. Frontend Connection Issues

Symptoms:

Frontend loads but shows "API connection failed"
WebSocket "Connection refused" errors
Real-time features not working

Diagnostic Commands:

# 1. Check frontend service
curl -s http://172.16.168.21:5173/ | head -10

# 2. Test API proxy
curl -s http://172.16.168.21:5173/api/health

# 3. Check WebSocket connection
wscat -c ws://127.0.0.1:8001/ws/test

Solutions:

A. Vite Proxy Configuration Issues

# Check Vite configuration
cd autobot-vue
cat vite.config.ts | grep -A 10 proxy

# Expected proxy configuration:
# proxy: {
#   '/api': {
#     target: 'https://172.16.168.20:8443',
#     changeOrigin: true
#   }
# }

# Restart frontend with correct configuration
npm run dev

B. CORS Issues

# Check CORS headers
curl -H "Origin: http://172.16.168.21:5173" \
     -H "Access-Control-Request-Method: POST" \
     -H "Access-Control-Request-Headers: Content-Type" \
     -X OPTIONS \
     http://127.0.0.1:8001/api/health -v

# Should return CORS headers allowing the frontend origin

C. WebSocket Connection Issues

# Check WebSocket endpoint
curl --include \
     --no-buffer \
     --header "Connection: Upgrade" \
     --header "Upgrade: websocket" \
     --header "Sec-WebSocket-Key: SGVsbG8sIHdvcmxkIQ==" \
     --header "Sec-WebSocket-Version: 13" \
     http://127.0.0.1:8001/ws/test

# Test WebSocket proxy through Nginx
curl -I http://172.16.168.21:5173/ws/test

Medium Priority Issues (📝)

7. Terminal Command Execution Issues

Symptoms:

Commands timeout after 30 seconds
"Permission denied" for basic commands
Terminal session hangs or becomes unresponsive

Diagnostic Steps:

# 1. Test terminal endpoint directly
curl -X POST http://127.0.0.1:8001/api/terminal/execute \
  -H "Content-Type: application/json" \
  -d '{"command": "echo test", "timeout": 10}'

# 2. Check command whitelist
grep -A 20 "COMMAND_WHITELIST" autobot-user-backend/utils/command_security.py

# 3. Verify sandbox configuration
ls -la /tmp/autobot/sandbox/

Solutions:

A. Command Whitelist Issues

# Add command to whitelist (if safe)
echo "your_command" >> config/allowed_commands.txt

# Or update command security configuration
# Edit autobot-user-backend/utils/command_security.py and add to COMMAND_WHITELIST

B. Sandbox Permission Issues

# Fix sandbox permissions
sudo chown -R $(whoami):$(whoami) /tmp/autobot/sandbox/
chmod 755 /tmp/autobot/sandbox/

# Clear sandbox if corrupted
rm -rf /tmp/autobot/sandbox/*
mkdir -p /tmp/autobot/sandbox/sessions

C. Terminal Session Cleanup

# Clean up hanging terminal sessions
ps aux | grep -i terminal | grep -v grep
pkill -f "terminal_session"

# Clear session storage
redis-cli -h 172.16.168.23
SELECT 2  # Session database
KEYS "terminal_session:*"
DEL terminal_session:*  # Clear all terminal sessions

8. Browser Service Automation Failures

Symptoms:

Browser automation timeouts
"No browser sessions available"
Screenshot capture failures

Diagnostic Commands:

# 1. Check browser service status
curl -s http://172.16.168.25:3000/health | jq .

# 2. Check available browsers
curl -s http://172.16.168.25:3000/browsers | jq .

# 3. List active sessions
curl -s http://172.16.168.25:3000/sessions | jq .

Solutions:

A. Browser Pool Exhaustion

# Check active sessions
curl -s http://172.16.168.25:3000/sessions/stats | jq .

# Close idle sessions
curl -X DELETE http://172.16.168.25:3000/sessions/cleanup

# Increase browser pool size
ssh autobot@172.16.168.25 'echo "MAX_BROWSER_SESSIONS=20" >> /opt/autobot/browser/.env'
ssh autobot@172.16.168.25 'sudo systemctl restart autobot-browser-service'

B. Playwright Installation Issues

# Reinstall Playwright browsers
ssh autobot@172.16.168.25 'cd /opt/autobot/browser && npx playwright install --with-deps'

# Check browser installation
ssh autobot@172.16.168.25 'npx playwright install --dry-run'

C. Display/X11 Issues

# Check X11 forwarding
ssh autobot@172.16.168.25 'echo $DISPLAY'

# Start virtual display if needed
ssh autobot@172.16.168.25 'sudo systemctl start xvfb'

# Test browser launch
ssh autobot@172.16.168.25 'timeout 10s chromium-browser --headless --dump-dom http://example.com'

9. File Upload & Management Issues

Symptoms:

File uploads fail or timeout
"File size too large" for small files
Virus scan failures

Diagnostic Steps:

# 1. Check file upload endpoint
curl -X POST http://127.0.0.1:8001/api/files/upload/test \
  -F "file=@small_test_file.txt"

# 2. Check available disk space
df -h /data/autobot/uploads/

# 3. Verify virus scanner
clamdscan --version
systemctl status clamav-daemon

Solutions:

A. Disk Space Issues

# Clean up old uploads
find /data/autobot/uploads/ -type f -mtime +30 -delete

# Check large files
du -sh /data/autobot/uploads/* | sort -hr | head -10

# Clean up temporary files
rm -rf /tmp/autobot/uploads/*.tmp

B. Virus Scanner Issues

# Update virus definitions
sudo freshclam

# Restart ClamAV
sudo systemctl restart clamav-daemon

# Test virus scanning
clamdscan /tmp/test_file.txt

C. File Permission Issues

# Fix upload directory permissions
sudo chown -R autobot:autobot /data/autobot/uploads/
chmod 755 /data/autobot/uploads/
chmod 644 /data/autobot/uploads/*

Low Priority Issues (ℹ️)

10. Performance Optimization Issues

Symptoms:

Slow API response times (>2 seconds)
High CPU or memory usage
Cache inefficiency

Performance Diagnostics:

# 1. Check API response times
time curl -s http://127.0.0.1:8001/api/health

# 2. Monitor resource usage
htop
iostat 1 5
free -m

# 3. Check cache hit rates
redis-cli -h 172.16.168.23 info stats | grep cache

# 4. Profile application
python3 -m cProfile -o profile.stats backend/main.py

Optimization Solutions:

A. Database Query Optimization

# Check slow Redis queries
redis-cli -h 172.16.168.23 SLOWLOG GET 10

# Optimize knowledge base index
curl -X POST http://127.0.0.1:8001/api/knowledge_base/optimize_index

# Clear query cache if needed
redis-cli -h 172.16.168.23
SELECT 3  # Cache database
FLUSHDB

B. Memory Optimization

# Clear Python caches
python3 -c "import gc; gc.collect(); print('Memory cleanup completed')"

# Restart services with memory limits
docker update --memory=4g autobot-backend
docker update --memory=2g autobot-redis

C. CPU Optimization

# Check CPU-intensive processes
ps aux --sort=-%cpu | head -10

# Optimize Python threading
export PYTHON_GIL_DISABLED=1  # If using Python 3.13+
export OMP_NUM_THREADS=4       # Limit OpenMP threads

Preventive Maintenance

Daily Health Checks

Automated Health Check Script:

#!/bin/bash
# daily_health_check.sh

echo "=== AutoBot Daily Health Check ==="
echo "Date: $(date)"
echo

# 1. Service Status
echo "=== Service Status ==="
bash run_autobot.sh --status

# 2. Resource Usage
echo "=== Resource Usage ==="
echo "Memory: $(free -h | grep Mem | awk '{print $3 "/" $2}')"
echo "Disk: $(df -h /data | tail -1 | awk '{print $3 "/" $2 " (" $5 " used)"}')"
echo "CPU Load: $(uptime | awk -F'load average:' '{print $2}')"

# 3. Error Count
echo "=== Recent Errors ==="
error_count=$(grep -c "ERROR" logs/autobot.log 2>/dev/null || echo 0)
echo "Errors in last 24h: $error_count"

if [ $error_count -gt 10 ]; then
    echo "⚠️  High error count detected"
    tail -20 logs/autobot.log | grep ERROR
fi

# 4. Database Health
echo "=== Database Health ==="
redis_ping=$(redis-cli -h 172.16.168.23 ping 2>/dev/null || echo "FAILED")
echo "Redis: $redis_ping"

if [ "$redis_ping" = "PONG" ]; then
    kb_count=$(redis-cli -h 172.16.168.23 -n 8 DBSIZE 2>/dev/null || echo 0)
    echo "Knowledge base vectors: $kb_count"
fi

# 5. Security Status
echo "=== Security Status ==="
failed_auth=$(grep -c "authentication_failure" logs/security.log 2>/dev/null || echo 0)
echo "Failed auth attempts (24h): $failed_auth"

if [ $failed_auth -gt 5 ]; then
    echo "⚠️  High failed authentication count"
fi

echo
echo "=== Health Check Complete ==="

Weekly Maintenance Tasks

Weekly Maintenance Script:

#!/bin/bash
# weekly_maintenance.sh

echo "=== AutoBot Weekly Maintenance ==="

# 1. Log Rotation
echo "Rotating logs..."
sudo logrotate /etc/logrotate.d/autobot

# 2. Clean Temporary Files
echo "Cleaning temporary files..."
find /tmp/autobot/ -type f -mtime +7 -delete
find /data/autobot/temp/ -type f -mtime +7 -delete

# 3. Database Optimization
echo "Optimizing databases..."
redis-cli -h 172.16.168.23 BGREWRITEAOF
curl -X POST http://127.0.0.1:8001/api/knowledge_base/optimize

# 4. Update System Packages
echo "Updating system packages..."
sudo apt update && sudo apt upgrade -y

# 5. Backup Critical Data
echo "Creating backup..."
python3 scripts/create_backup.py --type=incremental

# 6. Security Scan
echo "Running security scan..."
python3 scripts/security_scan.py --report=weekly

echo "Weekly maintenance completed"

Emergency Recovery Procedures

System Recovery Checklist

Complete System Failure Recovery:

Assess the Situation:

# Check what's running
docker ps
ps aux | grep -E "(python|node|redis)"

# Check logs for clues
tail -100 logs/autobot.log
dmesg | tail -20

Stop All Services:

# Stop AutoBot services
pkill -f "autobot"
docker stop $(docker ps -q)

# Kill any hanging processes
sudo pkill -f "python.*backend"
sudo pkill -f "node.*vite"

Clean Reset:

# Clean temporary files
rm -rf /tmp/autobot/*
rm -rf /var/run/autobot/*

# Clean Docker
docker system prune -f

Restart from Clean State:

# Full restart with rebuild
bash run_autobot.sh --dev --rebuild

# If that fails, try minimal startup
bash run_autobot.sh --minimal --build

Verify Recovery:

# Check all services
bash run_autobot.sh --status

# Test basic functionality
curl http://127.0.0.1:8001/api/health

Data Recovery Procedures

Knowledge Base Recovery:

# 1. Check if backup exists
ls -la backups/knowledge_base/

# 2. Restore from backup
python3 scripts/restore_knowledge_base.py --backup=latest

# 3. Rebuild if no backup
python3 scripts/rebuild_knowledge_base.py --from-source --confirm

Configuration Recovery:

# 1. Restore default configuration
cp config/config.yaml.default config/config.yaml

# 2. Restore environment variables
cp .env.example .env

# 3. Regenerate SSH keys
rm ~/.ssh/autobot_key*
ssh-keygen -t ed25519 -f ~/.ssh/autobot_key -N ""

Getting Additional Help

Built-in Diagnostic Tools

Interactive Diagnostics:

# Run comprehensive diagnostic
python3 scripts/autobot_diagnostics.py --interactive

# System health dashboard
python3 scripts/health_dashboard.py --web-interface

# Log analyzer
python3 scripts/log_analyzer.py --last=24h --severity=error

External Resources

Documentation:

Architecture Guide: docs/architecture/PHASE_5_DISTRIBUTED_ARCHITECTURE.md
API Documentation: docs/api/COMPREHENSIVE_API_DOCUMENTATION.md
Security Guide: docs/security/PHASE_5_SECURITY_IMPLEMENTATION.md
Developer Setup: docs/developer/PHASE_5_DEVELOPER_SETUP.md

Support Channels:

GitHub Issues: For reproducible bugs
Internal Wiki: For internal troubleshooting
Slack #autobot-support: For real-time help
Email: support@autobot.com

Emergency Contacts:

On-call Engineer: +1-800-AUTOBOT-1
System Administrator: admin@autobot.com
Security Incident: security@autobot.com

Remember: Most issues can be resolved by restarting the affected service or using the --rebuild flag. When in doubt, check the logs first - they usually contain the exact error message and solution path.

FilesExpand file tree

COMPREHENSIVE_TROUBLESHOOTING_GUIDE.md

Latest commit

History

COMPREHENSIVE_TROUBLESHOOTING_GUIDE.md

File metadata and controls

AutoBot Phase 5 - Comprehensive Troubleshooting Guide

Quick Diagnostic Commands

Issue Classification & Priority

🔥 Critical Issues (System Down)

⚠️ High Priority Issues (Degraded Performance)

📝 Medium Priority Issues (Feature Impact)

ℹ️ Low Priority Issues (Minor Impact)

Critical Issues (🔥)

1. Backend API Unresponsive

A. Redis Connection Timeout (Most Common)

B. Port Conflicts

C. Out of Memory

2. VM Communication Failures

A. SSH Connectivity Issues

B. Network Configuration Issues

3. Redis Database Failures

A. Redis Container Not Running

B. Redis Memory Issues

C. Database Configuration Issues

High Priority Issues (⚠️)

4. Multi-Modal AI Processing Failures

A. NPU Worker Issues

B. GPU Memory Issues

C. Model Loading Failures

5. Knowledge Base Search Failures

A. Vector Index Corruption

B. Embedding Service Issues

C. Database Corruption

6. Frontend Connection Issues

A. Vite Proxy Configuration Issues

B. CORS Issues

C. WebSocket Connection Issues

Medium Priority Issues (📝)

7. Terminal Command Execution Issues

A. Command Whitelist Issues

B. Sandbox Permission Issues

C. Terminal Session Cleanup

8. Browser Service Automation Failures

A. Browser Pool Exhaustion

B. Playwright Installation Issues

C. Display/X11 Issues

9. File Upload & Management Issues

A. Disk Space Issues

B. Virus Scanner Issues

C. File Permission Issues

Low Priority Issues (ℹ️)

10. Performance Optimization Issues

A. Database Query Optimization

B. Memory Optimization

C. CPU Optimization

Preventive Maintenance

Daily Health Checks

Weekly Maintenance Tasks

Emergency Recovery Procedures

System Recovery Checklist

Data Recovery Procedures

Getting Additional Help

Built-in Diagnostic Tools

External Resources