This repository contains operational runbooks for diagnosing and resolving common issues with web applications, API gateways, and databases.
このリポジトリには、Webアプリケーション、APIゲートウェイ、データベースの一般的な問題を診断・解決するためのオペレーショナルランブックが含まれています。
Runbooks provide step-by-step procedures for troubleshooting and resolving common operational issues. Each runbook includes:
- Symptom identification
- Diagnostic commands
- Resolution steps
- Prevention strategies
- Escalation criteria
Comprehensive troubleshooting guide for web application issues including:
- High CPU/Memory usage
- Application not responding
- Slow response times
- Connection issues
- Certificate errors
- Deployment problems
Quick Command Reference:
# Check application status
systemctl status <service-name>
# Monitor logs in real-time
tail -f /var/log/application/*.log
# Check port availability
netstat -tulpn | grep <port>
# Test application health
curl -I http://localhost:<port>/healthTroubleshooting guide for API gateway issues including:
- High latency
- Rate limiting problems
- Authentication failures
- Gateway timeouts (504)
- Bad gateway errors (502)
- Service unavailable (503)
- Routing issues
- SSL/TLS problems
Quick Command Reference:
# Test API endpoint with timing
curl -w "%{time_total}\n" -o /dev/null -s https://api.example.com/endpoint
# Monitor API gateway logs
tail -f /var/log/nginx/access.log
# Check upstream connections
netstat -an | grep ESTABLISHED | wc -l
# Test SSL certificate
openssl s_client -connect api.example.com:443 -servername api.example.com🗄️ Databases
Comprehensive database troubleshooting guide covering PostgreSQL, MySQL, MongoDB, and Redis:
- Slow query performance
- Connection issues
- High CPU/Memory usage
- Disk space problems
- Replication lag
- Deadlocks
- Backup and recovery procedures
Quick Command Reference:
# PostgreSQL: Check active queries
psql -U postgres -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';"
# MySQL: Check running queries
mysql -u root -p -e "SHOW FULL PROCESSLIST;"
# MongoDB: Check current operations
mongo --eval "db.currentOp()"
# Redis: Monitor commands
redis-cli MONITORランブックは、一般的な運用上の問題をトラブルシューティングし解決するための手順を段階的に提供します。各ランブックには以下が含まれています:
- 症状の特定
- 診断コマンド
- 解決手順
- 予防策
- エスカレーション基準
Webアプリケーションの問題に関する包括的なトラブルシューティングガイド:
- CPU/メモリ使用率が高い
- アプリケーションが応答しない
- レスポンスタイムが遅い
- 接続の問題
- 証明書エラー
- デプロイメントの問題
クイックコマンドリファレンス:
# アプリケーションステータスを確認
systemctl status <service-name>
# リアルタイムでログを監視
tail -f /var/log/application/*.log
# ポートの可用性を確認
netstat -tulpn | grep <port>
# アプリケーションヘルスをテスト
curl -I http://localhost:<port>/healthAPIゲートウェイの問題に関するトラブルシューティングガイド:
- 高レイテンシ
- レート制限の問題
- 認証失敗
- ゲートウェイタイムアウト (504)
- Bad Gatewayエラー (502)
- サービス利用不可 (503)
- ルーティングの問題
- SSL/TLSの問題
クイックコマンドリファレンス:
# タイミング付きでAPIエンドポイントをテスト
curl -w "%{time_total}\n" -o /dev/null -s https://api.example.com/endpoint
# APIゲートウェイログを監視
tail -f /var/log/nginx/access.log
# アップストリーム接続を確認
netstat -an | grep ESTABLISHED | wc -l
# SSL証明書をテスト
openssl s_client -connect api.example.com:443 -servername api.example.com🗄️ データベース
PostgreSQL、MySQL、MongoDB、Redisをカバーする包括的なデータベーストラブルシューティングガイド:
- クエリパフォーマンスが遅い
- 接続の問題
- CPU/メモリ使用率が高い
- ディスク容量の問題
- レプリケーション遅延
- デッドロック
- バックアップとリカバリ手順
クイックコマンドリファレンス:
# PostgreSQL: アクティブクエリを確認
psql -U postgres -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';"
# MySQL: 実行中のクエリを確認
mysql -u root -p -e "SHOW FULL PROCESSLIST;"
# MongoDB: 現在の操作を確認
mongo --eval "db.currentOp()"
# Redis: コマンドを監視
redis-cli MONITORStart by identifying which system is experiencing problems:
- Is it the web application itself?
- Is it the API gateway or load balancer?
- Is it the database backend?
Run basic health checks:
# System resources
top
free -h
df -h
# Network connectivity
ping <host>
netstat -tulpn
# Service status
systemctl status <service-name>Check relevant logs for errors:
# Application logs
tail -100 /var/log/application/*.log
# System logs
journalctl -xe
# Specific service logs
journalctl -u <service-name> -n 100Navigate to the relevant runbook and follow the diagnostic and resolution steps.
┌─────────────────────────────────────────┐
│ Issue Detected │
│ (Alert, User Report, Monitoring) │
└────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Initial Assessment │
│ • Check monitoring dashboards │
│ • Review recent changes │
│ • Identify affected components │
└────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Run Diagnostic Commands │
│ • System health checks │
│ • Application-specific diagnostics │
│ • Log analysis │
└────────────────┬────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Follow Runbook Steps │
│ • Execute resolution procedures │
│ • Document actions taken │
│ • Verify fix │
└────────────────┬────────────────────────┘
│
▼
┌───────┴────────┐
│ │
▼ ▼
Resolved Escalate
- Stay Calm: Follow the runbook systematically
- Document: Record all commands run and observations
- Communicate: Keep stakeholders informed of progress
- Verify: Always verify the fix before marking as resolved
- Post-Mortem: Document lessons learned after resolution
- Monitor Proactively: Set up alerts for key metrics
- Test Regularly: Conduct regular load and failover testing
- Keep Updated: Maintain runbooks with new findings
- Automate: Automate routine checks and remediation where possible
- Review: Regularly review and update procedures
# CPU usage
top -b -n 1 | head -20
ps aux --sort=-%cpu | head -10
# Memory usage
free -h
ps aux --sort=-%mem | head -10
# Disk usage
df -h
du -sh /var/log/*
# Network
netstat -tulpn
ss -tulpn# Check service status
systemctl status <service-name>
# Restart service
systemctl restart <service-name>
# View service logs
journalctl -u <service-name> -f# Tail logs with follow
tail -f /var/log/application/*.log
# Search for errors
grep -i error /var/log/application/*.log | tail -50
# Count occurrences
grep -c "ERROR" /var/log/application/*.logEscalate to senior engineers or management when:
- Issue persists after following runbook procedures
- Multiple services affected (cascading failure)
- Data loss or corruption suspected
- Security breach suspected
- SLA breach imminent or occurred
- Issue requires architectural changes
- Critical business impact
To update or add to these runbooks:
- Test procedures in a non-production environment
- Document all steps clearly with example commands
- Include both diagnostic and resolution procedures
- Add escalation criteria
- Submit changes via pull request
- On-Call Engineer: [Insert contact information]
- Database Team: [Insert contact information]
- Network Team: [Insert contact information]
- Security Team: [Insert contact information]
- Monitoring Dashboards
- Architecture Diagrams
- Incident Management Process
- Change Management Process
- Disaster Recovery Plan
Last Updated: 2026-02-11
Maintained By: Operations Team