Operational Runbooks / オペレーショナルランブック

This repository contains operational runbooks for diagnosing and resolving common issues with web applications, API gateways, and databases.

このリポジトリには、Webアプリケーション、APIゲートウェイ、データベースの一般的な問題を診断・解決するためのオペレーショナルランブックが含まれています。

Language / 言語

English Documentation (英語版)
日本語ドキュメント (Japanese Documentation)

English Documentation

Overview

Runbooks provide step-by-step procedures for troubleshooting and resolving common operational issues. Each runbook includes:

Symptom identification
Diagnostic commands
Resolution steps
Prevention strategies
Escalation criteria

Available Runbooks

🌐 Web Applications

Comprehensive troubleshooting guide for web application issues including:

High CPU/Memory usage
Application not responding
Slow response times
Connection issues
Certificate errors
Deployment problems

Quick Command Reference:

# Check application status
systemctl status <service-name>

# Monitor logs in real-time
tail -f /var/log/application/*.log

# Check port availability
netstat -tulpn | grep <port>

# Test application health
curl -I http://localhost:<port>/health

🔌 API Gateways

Troubleshooting guide for API gateway issues including:

High latency
Rate limiting problems
Authentication failures
Gateway timeouts (504)
Bad gateway errors (502)
Service unavailable (503)
Routing issues
SSL/TLS problems

Quick Command Reference:

# Test API endpoint with timing
curl -w "%{time_total}\n" -o /dev/null -s https://api.example.com/endpoint

# Monitor API gateway logs
tail -f /var/log/nginx/access.log

# Check upstream connections
netstat -an | grep ESTABLISHED | wc -l

# Test SSL certificate
openssl s_client -connect api.example.com:443 -servername api.example.com

🗄️ Databases

Comprehensive database troubleshooting guide covering PostgreSQL, MySQL, MongoDB, and Redis:

Slow query performance
Connection issues
High CPU/Memory usage
Disk space problems
Replication lag
Deadlocks
Backup and recovery procedures

Quick Command Reference:

# PostgreSQL: Check active queries
psql -U postgres -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';"

# MySQL: Check running queries
mysql -u root -p -e "SHOW FULL PROCESSLIST;"

# MongoDB: Check current operations
mongo --eval "db.currentOp()"

# Redis: Monitor commands
redis-cli MONITOR

日本語ドキュメント

概要

ランブックは、一般的な運用上の問題をトラブルシューティングし解決するための手順を段階的に提供します。各ランブックには以下が含まれています：

症状の特定
診断コマンド
解決手順
予防策
エスカレーション基準

利用可能なランブック

🌐 Webアプリケーション

Webアプリケーションの問題に関する包括的なトラブルシューティングガイド：

CPU/メモリ使用率が高い
アプリケーションが応答しない
レスポンスタイムが遅い
接続の問題
証明書エラー
デプロイメントの問題

クイックコマンドリファレンス：

# アプリケーションステータスを確認
systemctl status <service-name>

# リアルタイムでログを監視
tail -f /var/log/application/*.log

# ポートの可用性を確認
netstat -tulpn | grep <port>

# アプリケーションヘルスをテスト
curl -I http://localhost:<port>/health

🔌 APIゲートウェイ

APIゲートウェイの問題に関するトラブルシューティングガイド：

高レイテンシ
レート制限の問題
認証失敗
ゲートウェイタイムアウト (504)
Bad Gatewayエラー (502)
サービス利用不可 (503)
ルーティングの問題
SSL/TLSの問題

クイックコマンドリファレンス：

# タイミング付きでAPIエンドポイントをテスト
curl -w "%{time_total}\n" -o /dev/null -s https://api.example.com/endpoint

# APIゲートウェイログを監視
tail -f /var/log/nginx/access.log

# アップストリーム接続を確認
netstat -an | grep ESTABLISHED | wc -l

# SSL証明書をテスト
openssl s_client -connect api.example.com:443 -servername api.example.com

🗄️ データベース

PostgreSQL、MySQL、MongoDB、Redisをカバーする包括的なデータベーストラブルシューティングガイド：

クエリパフォーマンスが遅い
接続の問題
CPU/メモリ使用率が高い
ディスク容量の問題
レプリケーション遅延
デッドロック
バックアップとリカバリ手順

クイックコマンドリファレンス：

# PostgreSQL: アクティブクエリを確認
psql -U postgres -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';"

# MySQL: 実行中のクエリを確認
mysql -u root -p -e "SHOW FULL PROCESSLIST;"

# MongoDB: 現在の操作を確認
mongo --eval "db.currentOp()"

# Redis: コマンドを監視
redis-cli MONITOR

Quick Start Guide

1. Identify the Issue

Start by identifying which system is experiencing problems:

Is it the web application itself?
Is it the API gateway or load balancer?
Is it the database backend?

2. Check System Health

Run basic health checks:

# System resources
top
free -h
df -h

# Network connectivity
ping <host>
netstat -tulpn

# Service status
systemctl status <service-name>

3. Review Logs

Check relevant logs for errors:

# Application logs
tail -100 /var/log/application/*.log

# System logs
journalctl -xe

# Specific service logs
journalctl -u <service-name> -n 100

4. Follow the Appropriate Runbook

Navigate to the relevant runbook and follow the diagnostic and resolution steps.

General Troubleshooting Workflow

┌─────────────────────────────────────────┐
│   Issue Detected                        │
│   (Alert, User Report, Monitoring)      │
└────────────────┬────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────┐
│   Initial Assessment                    │
│   • Check monitoring dashboards         │
│   • Review recent changes               │
│   • Identify affected components        │
└────────────────┬────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────┐
│   Run Diagnostic Commands               │
│   • System health checks                │
│   • Application-specific diagnostics    │
│   • Log analysis                        │
└────────────────┬────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────┐
│   Follow Runbook Steps                  │
│   • Execute resolution procedures       │
│   • Document actions taken              │
│   • Verify fix                          │
└────────────────┬────────────────────────┘
                 │
                 ▼
         ┌───────┴────────┐
         │                │
         ▼                ▼
    Resolved          Escalate

Best Practices

During Incidents

Stay Calm: Follow the runbook systematically
Document: Record all commands run and observations
Communicate: Keep stakeholders informed of progress
Verify: Always verify the fix before marking as resolved
Post-Mortem: Document lessons learned after resolution

Preventive Measures

Monitor Proactively: Set up alerts for key metrics
Test Regularly: Conduct regular load and failover testing
Keep Updated: Maintain runbooks with new findings
Automate: Automate routine checks and remediation where possible
Review: Regularly review and update procedures

Common Commands Cheat Sheet

System Diagnostics

# CPU usage
top -b -n 1 | head -20
ps aux --sort=-%cpu | head -10

# Memory usage
free -h
ps aux --sort=-%mem | head -10

# Disk usage
df -h
du -sh /var/log/*

# Network
netstat -tulpn
ss -tulpn

Service Management

# Check service status
systemctl status <service-name>

# Restart service
systemctl restart <service-name>

# View service logs
journalctl -u <service-name> -f

Log Analysis

# Tail logs with follow
tail -f /var/log/application/*.log

# Search for errors
grep -i error /var/log/application/*.log | tail -50

# Count occurrences
grep -c "ERROR" /var/log/application/*.log

Escalation Guidelines

Escalate to senior engineers or management when:

Issue persists after following runbook procedures
Multiple services affected (cascading failure)
Data loss or corruption suspected
Security breach suspected
SLA breach imminent or occurred
Issue requires architectural changes
Critical business impact

Contributing

To update or add to these runbooks:

Test procedures in a non-production environment
Document all steps clearly with example commands
Include both diagnostic and resolution procedures
Add escalation criteria
Submit changes via pull request

Emergency Contacts

On-Call Engineer: [Insert contact information]
Database Team: [Insert contact information]
Network Team: [Insert contact information]
Security Team: [Insert contact information]

Additional Resources

Last Updated: 2026-02-11

Maintained By: Operations Team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operational Runbooks / オペレーショナルランブック

Language / 言語

English Documentation

Overview

Available Runbooks

🌐 Web Applications

🔌 API Gateways

🗄️ Databases

日本語ドキュメント

概要

利用可能なランブック

🌐 Webアプリケーション

🔌 APIゲートウェイ

🗄️ データベース

Quick Start Guide

1. Identify the Issue

2. Check System Health

3. Review Logs

4. Follow the Appropriate Runbook

General Troubleshooting Workflow

Best Practices

During Incidents

Preventive Measures

Common Commands Cheat Sheet

System Diagnostics

Service Management

Log Analysis

Escalation Guidelines

Contributing

Emergency Contacts

Additional Resources

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Operational Runbooks / オペレーショナルランブック

Language / 言語

English Documentation

Overview

Available Runbooks

🌐 Web Applications

🔌 API Gateways

🗄️ Databases

日本語ドキュメント

概要

利用可能なランブック

🌐 Webアプリケーション

🔌 APIゲートウェイ

🗄️ データベース

Quick Start Guide

1. Identify the Issue

2. Check System Health

3. Review Logs

4. Follow the Appropriate Runbook

General Troubleshooting Workflow

Best Practices

During Incidents

Preventive Measures

Common Commands Cheat Sheet

System Diagnostics

Service Management

Log Analysis

Escalation Guidelines

Contributing

Emergency Contacts

Additional Resources