This repository contains operational runbooks for diagnosing and resolving common issues with web applications, API gateways, and databases.
ใใฎใชใใธใใชใซใฏใWebใขใใชใฑใผใทใงใณใAPIใฒใผใใฆใงใคใใใผใฟใใผในใฎไธ่ฌ็ใชๅ้กใ่จบๆญใป่งฃๆฑบใใใใใฎใชใใฌใผใทใงใใซใฉใณใใใฏใๅซใพใใฆใใพใใ
Runbooks provide step-by-step procedures for troubleshooting and resolving common operational issues. Each runbook includes:
- Symptom identification
- Diagnostic commands
- Resolution steps
- Prevention strategies
- Escalation criteria
๐ Web Applications
Comprehensive troubleshooting guide for web application issues including:
- High CPU/Memory usage
- Application not responding
- Slow response times
- Connection issues
- Certificate errors
- Deployment problems
Quick Command Reference:
# Check application status
systemctl status <service-name>
# Monitor logs in real-time
tail -f /var/log/application/*.log
# Check port availability
netstat -tulpn | grep <port>
# Test application health
curl -I http://localhost:<port>/health๐ API Gateways
Troubleshooting guide for API gateway issues including:
- High latency
- Rate limiting problems
- Authentication failures
- Gateway timeouts (504)
- Bad gateway errors (502)
- Service unavailable (503)
- Routing issues
- SSL/TLS problems
Quick Command Reference:
# Test API endpoint with timing
curl -w "%{time_total}\n" -o /dev/null -s https://api.example.com/endpoint
# Monitor API gateway logs
tail -f /var/log/nginx/access.log
# Check upstream connections
netstat -an | grep ESTABLISHED | wc -l
# Test SSL certificate
openssl s_client -connect api.example.com:443 -servername api.example.com๐๏ธ Databases
Comprehensive database troubleshooting guide covering PostgreSQL, MySQL, MongoDB, and Redis:
- Slow query performance
- Connection issues
- High CPU/Memory usage
- Disk space problems
- Replication lag
- Deadlocks
- Backup and recovery procedures
Quick Command Reference:
# PostgreSQL: Check active queries
psql -U postgres -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';"
# MySQL: Check running queries
mysql -u root -p -e "SHOW FULL PROCESSLIST;"
# MongoDB: Check current operations
mongo --eval "db.currentOp()"
# Redis: Monitor commands
redis-cli MONITORใฉใณใใใฏใฏใไธ่ฌ็ใช้็จไธใฎๅ้กใใใฉใใซใทใฅใผใใฃใณใฐใ่งฃๆฑบใใใใใฎๆ้ ใๆฎต้็ใซๆไพใใพใใๅใฉใณใใใฏใซใฏไปฅไธใๅซใพใใฆใใพใ๏ผ
- ็็ถใฎ็นๅฎ
- ่จบๆญใณใใณใ
- ่งฃๆฑบๆ้
- ไบ้ฒ็ญ
- ใจในใซใฌใผใทใงใณๅบๆบ
Webใขใใชใฑใผใทใงใณใฎๅ้กใซ้ขใใๅ ๆฌ็ใชใใฉใใซใทใฅใผใใฃใณใฐใฌใคใ๏ผ
- CPU/ใกใขใชไฝฟ็จ็ใ้ซใ
- ใขใใชใฑใผใทใงใณใๅฟ็ญใใชใ
- ใฌในใใณในใฟใคใ ใ้ ใ
- ๆฅ็ถใฎๅ้ก
- ่จผๆๆธใจใฉใผ
- ใใใญใคใกใณใใฎๅ้ก
ใฏใคใใฏใณใใณใใชใใกใฌใณใน๏ผ
# ใขใใชใฑใผใทใงใณในใใผใฟในใ็ขบ่ช
systemctl status <service-name>
# ใชใขใซใฟใคใ ใงใญใฐใ็ฃ่ฆ
tail -f /var/log/application/*.log
# ใใผใใฎๅฏ็จๆงใ็ขบ่ช
netstat -tulpn | grep <port>
# ใขใใชใฑใผใทใงใณใใซในใใในใ
curl -I http://localhost:<port>/healthAPIใฒใผใใฆใงใคใฎๅ้กใซ้ขใใใใฉใใซใทใฅใผใใฃใณใฐใฌใคใ๏ผ
- ้ซใฌใคใใณใท
- ใฌใผใๅถ้ใฎๅ้ก
- ่ช่จผๅคฑๆ
- ใฒใผใใฆใงใคใฟใคใ ใขใฆใ (504)
- Bad Gatewayใจใฉใผ (502)
- ใตใผใในๅฉ็จไธๅฏ (503)
- ใซใผใใฃใณใฐใฎๅ้ก
- SSL/TLSใฎๅ้ก
ใฏใคใใฏใณใใณใใชใใกใฌใณใน๏ผ
# ใฟใคใใณใฐไปใใงAPIใจใณใใใคใณใใใในใ
curl -w "%{time_total}\n" -o /dev/null -s https://api.example.com/endpoint
# APIใฒใผใใฆใงใคใญใฐใ็ฃ่ฆ
tail -f /var/log/nginx/access.log
# ใขใใในใใชใผใ ๆฅ็ถใ็ขบ่ช
netstat -an | grep ESTABLISHED | wc -l
# SSL่จผๆๆธใใในใ
openssl s_client -connect api.example.com:443 -servername api.example.com๐๏ธ ใใผใฟใใผใน
PostgreSQLใMySQLใMongoDBใRedisใใซใใผใใๅ ๆฌ็ใชใใผใฟใใผในใใฉใใซใทใฅใผใใฃใณใฐใฌใคใ๏ผ
- ใฏใจใชใใใฉใผใใณในใ้ ใ
- ๆฅ็ถใฎๅ้ก
- CPU/ใกใขใชไฝฟ็จ็ใ้ซใ
- ใใฃในใฏๅฎน้ใฎๅ้ก
- ใฌใใชใฑใผใทใงใณ้ ๅปถ
- ใใใใญใใฏ
- ใใใฏใขใใใจใชใซใใชๆ้
ใฏใคใใฏใณใใณใใชใใกใฌใณใน๏ผ
# PostgreSQL: ใขใฏใใฃใใฏใจใชใ็ขบ่ช
psql -U postgres -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';"
# MySQL: ๅฎ่กไธญใฎใฏใจใชใ็ขบ่ช
mysql -u root -p -e "SHOW FULL PROCESSLIST;"
# MongoDB: ็พๅจใฎๆไฝใ็ขบ่ช
mongo --eval "db.currentOp()"
# Redis: ใณใใณใใ็ฃ่ฆ
redis-cli MONITORStart by identifying which system is experiencing problems:
- Is it the web application itself?
- Is it the API gateway or load balancer?
- Is it the database backend?
Run basic health checks:
# System resources
top
free -h
df -h
# Network connectivity
ping <host>
netstat -tulpn
# Service status
systemctl status <service-name>Check relevant logs for errors:
# Application logs
tail -100 /var/log/application/*.log
# System logs
journalctl -xe
# Specific service logs
journalctl -u <service-name> -n 100Navigate to the relevant runbook and follow the diagnostic and resolution steps.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Issue Detected โ
โ (Alert, User Report, Monitoring) โ
โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Initial Assessment โ
โ โข Check monitoring dashboards โ
โ โข Review recent changes โ
โ โข Identify affected components โ
โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Run Diagnostic Commands โ
โ โข System health checks โ
โ โข Application-specific diagnostics โ
โ โข Log analysis โ
โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Follow Runbook Steps โ
โ โข Execute resolution procedures โ
โ โข Document actions taken โ
โ โข Verify fix โ
โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโดโโโโโโโโโ
โ โ
โผ โผ
Resolved Escalate
- Stay Calm: Follow the runbook systematically
- Document: Record all commands run and observations
- Communicate: Keep stakeholders informed of progress
- Verify: Always verify the fix before marking as resolved
- Post-Mortem: Document lessons learned after resolution
- Monitor Proactively: Set up alerts for key metrics
- Test Regularly: Conduct regular load and failover testing
- Keep Updated: Maintain runbooks with new findings
- Automate: Automate routine checks and remediation where possible
- Review: Regularly review and update procedures
# CPU usage
top -b -n 1 | head -20
ps aux --sort=-%cpu | head -10
# Memory usage
free -h
ps aux --sort=-%mem | head -10
# Disk usage
df -h
du -sh /var/log/*
# Network
netstat -tulpn
ss -tulpn# Check service status
systemctl status <service-name>
# Restart service
systemctl restart <service-name>
# View service logs
journalctl -u <service-name> -f# Tail logs with follow
tail -f /var/log/application/*.log
# Search for errors
grep -i error /var/log/application/*.log | tail -50
# Count occurrences
grep -c "ERROR" /var/log/application/*.logEscalate to senior engineers or management when:
- Issue persists after following runbook procedures
- Multiple services affected (cascading failure)
- Data loss or corruption suspected
- Security breach suspected
- SLA breach imminent or occurred
- Issue requires architectural changes
- Critical business impact
To update or add to these runbooks:
- Test procedures in a non-production environment
- Document all steps clearly with example commands
- Include both diagnostic and resolution procedures
- Add escalation criteria
- Submit changes via pull request
- On-Call Engineer: [Insert contact information]
- Database Team: [Insert contact information]
- Network Team: [Insert contact information]
- Security Team: [Insert contact information]
- Monitoring Dashboards
- Architecture Diagrams
- Incident Management Process
- Change Management Process
- Disaster Recovery Plan
Last Updated: 2026-02-11
Maintained By: Operations Team