Skip to content

Latest commit

 

History

History
373 lines (292 loc) · 10.7 KB

File metadata and controls

373 lines (292 loc) · 10.7 KB

Operational Runbooks / オペレーショナルランブック

This repository contains operational runbooks for diagnosing and resolving common issues with web applications, API gateways, and databases.

このリポジトリには、Webアプリケーション、APIゲートウェイ、データベースの一般的な問題を診断・解決するためのオペレーショナルランブックが含まれています。

Language / 言語


English Documentation

Overview

Runbooks provide step-by-step procedures for troubleshooting and resolving common operational issues. Each runbook includes:

  • Symptom identification
  • Diagnostic commands
  • Resolution steps
  • Prevention strategies
  • Escalation criteria

Available Runbooks

Comprehensive troubleshooting guide for web application issues including:

  • High CPU/Memory usage
  • Application not responding
  • Slow response times
  • Connection issues
  • Certificate errors
  • Deployment problems

Quick Command Reference:

# Check application status
systemctl status <service-name>

# Monitor logs in real-time
tail -f /var/log/application/*.log

# Check port availability
netstat -tulpn | grep <port>

# Test application health
curl -I http://localhost:<port>/health

Troubleshooting guide for API gateway issues including:

  • High latency
  • Rate limiting problems
  • Authentication failures
  • Gateway timeouts (504)
  • Bad gateway errors (502)
  • Service unavailable (503)
  • Routing issues
  • SSL/TLS problems

Quick Command Reference:

# Test API endpoint with timing
curl -w "%{time_total}\n" -o /dev/null -s https://api.example.com/endpoint

# Monitor API gateway logs
tail -f /var/log/nginx/access.log

# Check upstream connections
netstat -an | grep ESTABLISHED | wc -l

# Test SSL certificate
openssl s_client -connect api.example.com:443 -servername api.example.com

🗄️ Databases

Comprehensive database troubleshooting guide covering PostgreSQL, MySQL, MongoDB, and Redis:

  • Slow query performance
  • Connection issues
  • High CPU/Memory usage
  • Disk space problems
  • Replication lag
  • Deadlocks
  • Backup and recovery procedures

Quick Command Reference:

# PostgreSQL: Check active queries
psql -U postgres -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';"

# MySQL: Check running queries
mysql -u root -p -e "SHOW FULL PROCESSLIST;"

# MongoDB: Check current operations
mongo --eval "db.currentOp()"

# Redis: Monitor commands
redis-cli MONITOR

日本語ドキュメント

概要

ランブックは、一般的な運用上の問題をトラブルシューティングし解決するための手順を段階的に提供します。各ランブックには以下が含まれています:

  • 症状の特定
  • 診断コマンド
  • 解決手順
  • 予防策
  • エスカレーション基準

利用可能なランブック

Webアプリケーションの問題に関する包括的なトラブルシューティングガイド:

  • CPU/メモリ使用率が高い
  • アプリケーションが応答しない
  • レスポンスタイムが遅い
  • 接続の問題
  • 証明書エラー
  • デプロイメントの問題

クイックコマンドリファレンス:

# アプリケーションステータスを確認
systemctl status <service-name>

# リアルタイムでログを監視
tail -f /var/log/application/*.log

# ポートの可用性を確認
netstat -tulpn | grep <port>

# アプリケーションヘルスをテスト
curl -I http://localhost:<port>/health

APIゲートウェイの問題に関するトラブルシューティングガイド:

  • 高レイテンシ
  • レート制限の問題
  • 認証失敗
  • ゲートウェイタイムアウト (504)
  • Bad Gatewayエラー (502)
  • サービス利用不可 (503)
  • ルーティングの問題
  • SSL/TLSの問題

クイックコマンドリファレンス:

# タイミング付きでAPIエンドポイントをテスト
curl -w "%{time_total}\n" -o /dev/null -s https://api.example.com/endpoint

# APIゲートウェイログを監視
tail -f /var/log/nginx/access.log

# アップストリーム接続を確認
netstat -an | grep ESTABLISHED | wc -l

# SSL証明書をテスト
openssl s_client -connect api.example.com:443 -servername api.example.com

PostgreSQL、MySQL、MongoDB、Redisをカバーする包括的なデータベーストラブルシューティングガイド:

  • クエリパフォーマンスが遅い
  • 接続の問題
  • CPU/メモリ使用率が高い
  • ディスク容量の問題
  • レプリケーション遅延
  • デッドロック
  • バックアップとリカバリ手順

クイックコマンドリファレンス:

# PostgreSQL: アクティブクエリを確認
psql -U postgres -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';"

# MySQL: 実行中のクエリを確認
mysql -u root -p -e "SHOW FULL PROCESSLIST;"

# MongoDB: 現在の操作を確認
mongo --eval "db.currentOp()"

# Redis: コマンドを監視
redis-cli MONITOR

Quick Start Guide

1. Identify the Issue

Start by identifying which system is experiencing problems:

  • Is it the web application itself?
  • Is it the API gateway or load balancer?
  • Is it the database backend?

2. Check System Health

Run basic health checks:

# System resources
top
free -h
df -h

# Network connectivity
ping <host>
netstat -tulpn

# Service status
systemctl status <service-name>

3. Review Logs

Check relevant logs for errors:

# Application logs
tail -100 /var/log/application/*.log

# System logs
journalctl -xe

# Specific service logs
journalctl -u <service-name> -n 100

4. Follow the Appropriate Runbook

Navigate to the relevant runbook and follow the diagnostic and resolution steps.

General Troubleshooting Workflow

┌─────────────────────────────────────────┐
│   Issue Detected                        │
│   (Alert, User Report, Monitoring)      │
└────────────────┬────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────┐
│   Initial Assessment                    │
│   • Check monitoring dashboards         │
│   • Review recent changes               │
│   • Identify affected components        │
└────────────────┬────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────┐
│   Run Diagnostic Commands               │
│   • System health checks                │
│   • Application-specific diagnostics    │
│   • Log analysis                        │
└────────────────┬────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────┐
│   Follow Runbook Steps                  │
│   • Execute resolution procedures       │
│   • Document actions taken              │
│   • Verify fix                          │
└────────────────┬────────────────────────┘
                 │
                 ▼
         ┌───────┴────────┐
         │                │
         ▼                ▼
    Resolved          Escalate

Best Practices

During Incidents

  1. Stay Calm: Follow the runbook systematically
  2. Document: Record all commands run and observations
  3. Communicate: Keep stakeholders informed of progress
  4. Verify: Always verify the fix before marking as resolved
  5. Post-Mortem: Document lessons learned after resolution

Preventive Measures

  1. Monitor Proactively: Set up alerts for key metrics
  2. Test Regularly: Conduct regular load and failover testing
  3. Keep Updated: Maintain runbooks with new findings
  4. Automate: Automate routine checks and remediation where possible
  5. Review: Regularly review and update procedures

Common Commands Cheat Sheet

System Diagnostics

# CPU usage
top -b -n 1 | head -20
ps aux --sort=-%cpu | head -10

# Memory usage
free -h
ps aux --sort=-%mem | head -10

# Disk usage
df -h
du -sh /var/log/*

# Network
netstat -tulpn
ss -tulpn

Service Management

# Check service status
systemctl status <service-name>

# Restart service
systemctl restart <service-name>

# View service logs
journalctl -u <service-name> -f

Log Analysis

# Tail logs with follow
tail -f /var/log/application/*.log

# Search for errors
grep -i error /var/log/application/*.log | tail -50

# Count occurrences
grep -c "ERROR" /var/log/application/*.log

Escalation Guidelines

Escalate to senior engineers or management when:

  • Issue persists after following runbook procedures
  • Multiple services affected (cascading failure)
  • Data loss or corruption suspected
  • Security breach suspected
  • SLA breach imminent or occurred
  • Issue requires architectural changes
  • Critical business impact

Contributing

To update or add to these runbooks:

  1. Test procedures in a non-production environment
  2. Document all steps clearly with example commands
  3. Include both diagnostic and resolution procedures
  4. Add escalation criteria
  5. Submit changes via pull request

Emergency Contacts

  • On-Call Engineer: [Insert contact information]
  • Database Team: [Insert contact information]
  • Network Team: [Insert contact information]
  • Security Team: [Insert contact information]

Additional Resources


Last Updated: 2026-02-11

Maintained By: Operations Team