Skip to content

pd-mhori/runbook-examples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

5 Commits
ย 
ย 
ย 
ย 

Repository files navigation

Operational Runbooks / ใ‚ชใƒšใƒฌใƒผใ‚ทใƒงใƒŠใƒซใƒฉใƒณใƒ–ใƒƒใ‚ฏ

This repository contains operational runbooks for diagnosing and resolving common issues with web applications, API gateways, and databases.

ใ“ใฎใƒชใƒใ‚ธใƒˆใƒชใซใฏใ€Webใ‚ขใƒ—ใƒชใ‚ฑใƒผใ‚ทใƒงใƒณใ€APIใ‚ฒใƒผใƒˆใ‚ฆใ‚งใ‚คใ€ใƒ‡ใƒผใ‚ฟใƒ™ใƒผใ‚นใฎไธ€่ˆฌ็š„ใชๅ•้กŒใ‚’่จบๆ–ญใƒป่งฃๆฑบใ™ใ‚‹ใŸใ‚ใฎใ‚ชใƒšใƒฌใƒผใ‚ทใƒงใƒŠใƒซใƒฉใƒณใƒ–ใƒƒใ‚ฏใŒๅซใพใ‚Œใฆใ„ใพใ™ใ€‚

Language / ่จ€่ชž


English Documentation

Overview

Runbooks provide step-by-step procedures for troubleshooting and resolving common operational issues. Each runbook includes:

  • Symptom identification
  • Diagnostic commands
  • Resolution steps
  • Prevention strategies
  • Escalation criteria

Available Runbooks

Comprehensive troubleshooting guide for web application issues including:

  • High CPU/Memory usage
  • Application not responding
  • Slow response times
  • Connection issues
  • Certificate errors
  • Deployment problems

Quick Command Reference:

# Check application status
systemctl status <service-name>

# Monitor logs in real-time
tail -f /var/log/application/*.log

# Check port availability
netstat -tulpn | grep <port>

# Test application health
curl -I http://localhost:<port>/health

๐Ÿ”Œ API Gateways

Troubleshooting guide for API gateway issues including:

  • High latency
  • Rate limiting problems
  • Authentication failures
  • Gateway timeouts (504)
  • Bad gateway errors (502)
  • Service unavailable (503)
  • Routing issues
  • SSL/TLS problems

Quick Command Reference:

# Test API endpoint with timing
curl -w "%{time_total}\n" -o /dev/null -s https://api.example.com/endpoint

# Monitor API gateway logs
tail -f /var/log/nginx/access.log

# Check upstream connections
netstat -an | grep ESTABLISHED | wc -l

# Test SSL certificate
openssl s_client -connect api.example.com:443 -servername api.example.com

๐Ÿ—„๏ธ Databases

Comprehensive database troubleshooting guide covering PostgreSQL, MySQL, MongoDB, and Redis:

  • Slow query performance
  • Connection issues
  • High CPU/Memory usage
  • Disk space problems
  • Replication lag
  • Deadlocks
  • Backup and recovery procedures

Quick Command Reference:

# PostgreSQL: Check active queries
psql -U postgres -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';"

# MySQL: Check running queries
mysql -u root -p -e "SHOW FULL PROCESSLIST;"

# MongoDB: Check current operations
mongo --eval "db.currentOp()"

# Redis: Monitor commands
redis-cli MONITOR

ๆ—ฅๆœฌ่ชžใƒ‰ใ‚ญใƒฅใƒกใƒณใƒˆ

ๆฆ‚่ฆ

ใƒฉใƒณใƒ–ใƒƒใ‚ฏใฏใ€ไธ€่ˆฌ็š„ใช้‹็”จไธŠใฎๅ•้กŒใ‚’ใƒˆใƒฉใƒ–ใƒซใ‚ทใƒฅใƒผใƒ†ใ‚ฃใƒณใ‚ฐใ—่งฃๆฑบใ™ใ‚‹ใŸใ‚ใฎๆ‰‹้ †ใ‚’ๆฎต้šŽ็š„ใซๆไพ›ใ—ใพใ™ใ€‚ๅ„ใƒฉใƒณใƒ–ใƒƒใ‚ฏใซใฏไปฅไธ‹ใŒๅซใพใ‚Œใฆใ„ใพใ™๏ผš

  • ็—‡็Šถใฎ็‰นๅฎš
  • ่จบๆ–ญใ‚ณใƒžใƒณใƒ‰
  • ่งฃๆฑบๆ‰‹้ †
  • ไบˆ้˜ฒ็ญ–
  • ใ‚จใ‚นใ‚ซใƒฌใƒผใ‚ทใƒงใƒณๅŸบๆบ–

ๅˆฉ็”จๅฏ่ƒฝใชใƒฉใƒณใƒ–ใƒƒใ‚ฏ

Webใ‚ขใƒ—ใƒชใ‚ฑใƒผใ‚ทใƒงใƒณใฎๅ•้กŒใซ้–ขใ™ใ‚‹ๅŒ…ๆ‹ฌ็š„ใชใƒˆใƒฉใƒ–ใƒซใ‚ทใƒฅใƒผใƒ†ใ‚ฃใƒณใ‚ฐใ‚ฌใ‚คใƒ‰๏ผš

  • CPU/ใƒกใƒขใƒชไฝฟ็”จ็އใŒ้ซ˜ใ„
  • ใ‚ขใƒ—ใƒชใ‚ฑใƒผใ‚ทใƒงใƒณใŒๅฟœ็ญ”ใ—ใชใ„
  • ใƒฌใ‚นใƒใƒณใ‚นใ‚ฟใ‚คใƒ ใŒ้…ใ„
  • ๆŽฅ็ถšใฎๅ•้กŒ
  • ่จผๆ˜Žๆ›ธใ‚จใƒฉใƒผ
  • ใƒ‡ใƒ—ใƒญใ‚คใƒกใƒณใƒˆใฎๅ•้กŒ

ใ‚ฏใ‚คใƒƒใ‚ฏใ‚ณใƒžใƒณใƒ‰ใƒชใƒ•ใ‚กใƒฌใƒณใ‚น๏ผš

# ใ‚ขใƒ—ใƒชใ‚ฑใƒผใ‚ทใƒงใƒณใ‚นใƒ†ใƒผใ‚ฟใ‚นใ‚’็ขบ่ช
systemctl status <service-name>

# ใƒชใ‚ขใƒซใ‚ฟใ‚คใƒ ใงใƒญใ‚ฐใ‚’็›ฃ่ฆ–
tail -f /var/log/application/*.log

# ใƒใƒผใƒˆใฎๅฏ็”จๆ€งใ‚’็ขบ่ช
netstat -tulpn | grep <port>

# ใ‚ขใƒ—ใƒชใ‚ฑใƒผใ‚ทใƒงใƒณใƒ˜ใƒซใ‚นใ‚’ใƒ†ใ‚นใƒˆ
curl -I http://localhost:<port>/health

APIใ‚ฒใƒผใƒˆใ‚ฆใ‚งใ‚คใฎๅ•้กŒใซ้–ขใ™ใ‚‹ใƒˆใƒฉใƒ–ใƒซใ‚ทใƒฅใƒผใƒ†ใ‚ฃใƒณใ‚ฐใ‚ฌใ‚คใƒ‰๏ผš

  • ้ซ˜ใƒฌใ‚คใƒ†ใƒณใ‚ท
  • ใƒฌใƒผใƒˆๅˆถ้™ใฎๅ•้กŒ
  • ่ช่จผๅคฑๆ•—
  • ใ‚ฒใƒผใƒˆใ‚ฆใ‚งใ‚คใ‚ฟใ‚คใƒ ใ‚ขใ‚ฆใƒˆ (504)
  • Bad Gatewayใ‚จใƒฉใƒผ (502)
  • ใ‚ตใƒผใƒ“ใ‚นๅˆฉ็”จไธๅฏ (503)
  • ใƒซใƒผใƒ†ใ‚ฃใƒณใ‚ฐใฎๅ•้กŒ
  • SSL/TLSใฎๅ•้กŒ

ใ‚ฏใ‚คใƒƒใ‚ฏใ‚ณใƒžใƒณใƒ‰ใƒชใƒ•ใ‚กใƒฌใƒณใ‚น๏ผš

# ใ‚ฟใ‚คใƒŸใƒณใ‚ฐไป˜ใใงAPIใ‚จใƒณใƒ‰ใƒใ‚คใƒณใƒˆใ‚’ใƒ†ใ‚นใƒˆ
curl -w "%{time_total}\n" -o /dev/null -s https://api.example.com/endpoint

# APIใ‚ฒใƒผใƒˆใ‚ฆใ‚งใ‚คใƒญใ‚ฐใ‚’็›ฃ่ฆ–
tail -f /var/log/nginx/access.log

# ใ‚ขใƒƒใƒ—ใ‚นใƒˆใƒชใƒผใƒ ๆŽฅ็ถšใ‚’็ขบ่ช
netstat -an | grep ESTABLISHED | wc -l

# SSL่จผๆ˜Žๆ›ธใ‚’ใƒ†ใ‚นใƒˆ
openssl s_client -connect api.example.com:443 -servername api.example.com

PostgreSQLใ€MySQLใ€MongoDBใ€Redisใ‚’ใ‚ซใƒใƒผใ™ใ‚‹ๅŒ…ๆ‹ฌ็š„ใชใƒ‡ใƒผใ‚ฟใƒ™ใƒผใ‚นใƒˆใƒฉใƒ–ใƒซใ‚ทใƒฅใƒผใƒ†ใ‚ฃใƒณใ‚ฐใ‚ฌใ‚คใƒ‰๏ผš

  • ใ‚ฏใ‚จใƒชใƒ‘ใƒ•ใ‚ฉใƒผใƒžใƒณใ‚นใŒ้…ใ„
  • ๆŽฅ็ถšใฎๅ•้กŒ
  • CPU/ใƒกใƒขใƒชไฝฟ็”จ็އใŒ้ซ˜ใ„
  • ใƒ‡ใ‚ฃใ‚นใ‚ฏๅฎน้‡ใฎๅ•้กŒ
  • ใƒฌใƒ—ใƒชใ‚ฑใƒผใ‚ทใƒงใƒณ้…ๅปถ
  • ใƒ‡ใƒƒใƒ‰ใƒญใƒƒใ‚ฏ
  • ใƒใƒƒใ‚ฏใ‚ขใƒƒใƒ—ใจใƒชใ‚ซใƒใƒชๆ‰‹้ †

ใ‚ฏใ‚คใƒƒใ‚ฏใ‚ณใƒžใƒณใƒ‰ใƒชใƒ•ใ‚กใƒฌใƒณใ‚น๏ผš

# PostgreSQL: ใ‚ขใ‚ฏใƒ†ใ‚ฃใƒ–ใ‚ฏใ‚จใƒชใ‚’็ขบ่ช
psql -U postgres -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';"

# MySQL: ๅฎŸ่กŒไธญใฎใ‚ฏใ‚จใƒชใ‚’็ขบ่ช
mysql -u root -p -e "SHOW FULL PROCESSLIST;"

# MongoDB: ็พๅœจใฎๆ“ไฝœใ‚’็ขบ่ช
mongo --eval "db.currentOp()"

# Redis: ใ‚ณใƒžใƒณใƒ‰ใ‚’็›ฃ่ฆ–
redis-cli MONITOR

Quick Start Guide

1. Identify the Issue

Start by identifying which system is experiencing problems:

  • Is it the web application itself?
  • Is it the API gateway or load balancer?
  • Is it the database backend?

2. Check System Health

Run basic health checks:

# System resources
top
free -h
df -h

# Network connectivity
ping <host>
netstat -tulpn

# Service status
systemctl status <service-name>

3. Review Logs

Check relevant logs for errors:

# Application logs
tail -100 /var/log/application/*.log

# System logs
journalctl -xe

# Specific service logs
journalctl -u <service-name> -n 100

4. Follow the Appropriate Runbook

Navigate to the relevant runbook and follow the diagnostic and resolution steps.

General Troubleshooting Workflow

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Issue Detected                        โ”‚
โ”‚   (Alert, User Report, Monitoring)      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                 โ”‚
                 โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Initial Assessment                    โ”‚
โ”‚   โ€ข Check monitoring dashboards         โ”‚
โ”‚   โ€ข Review recent changes               โ”‚
โ”‚   โ€ข Identify affected components        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                 โ”‚
                 โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Run Diagnostic Commands               โ”‚
โ”‚   โ€ข System health checks                โ”‚
โ”‚   โ€ข Application-specific diagnostics    โ”‚
โ”‚   โ€ข Log analysis                        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                 โ”‚
                 โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Follow Runbook Steps                  โ”‚
โ”‚   โ€ข Execute resolution procedures       โ”‚
โ”‚   โ€ข Document actions taken              โ”‚
โ”‚   โ€ข Verify fix                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                 โ”‚
                 โ–ผ
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚                โ”‚
         โ–ผ                โ–ผ
    Resolved          Escalate

Best Practices

During Incidents

  1. Stay Calm: Follow the runbook systematically
  2. Document: Record all commands run and observations
  3. Communicate: Keep stakeholders informed of progress
  4. Verify: Always verify the fix before marking as resolved
  5. Post-Mortem: Document lessons learned after resolution

Preventive Measures

  1. Monitor Proactively: Set up alerts for key metrics
  2. Test Regularly: Conduct regular load and failover testing
  3. Keep Updated: Maintain runbooks with new findings
  4. Automate: Automate routine checks and remediation where possible
  5. Review: Regularly review and update procedures

Common Commands Cheat Sheet

System Diagnostics

# CPU usage
top -b -n 1 | head -20
ps aux --sort=-%cpu | head -10

# Memory usage
free -h
ps aux --sort=-%mem | head -10

# Disk usage
df -h
du -sh /var/log/*

# Network
netstat -tulpn
ss -tulpn

Service Management

# Check service status
systemctl status <service-name>

# Restart service
systemctl restart <service-name>

# View service logs
journalctl -u <service-name> -f

Log Analysis

# Tail logs with follow
tail -f /var/log/application/*.log

# Search for errors
grep -i error /var/log/application/*.log | tail -50

# Count occurrences
grep -c "ERROR" /var/log/application/*.log

Escalation Guidelines

Escalate to senior engineers or management when:

  • Issue persists after following runbook procedures
  • Multiple services affected (cascading failure)
  • Data loss or corruption suspected
  • Security breach suspected
  • SLA breach imminent or occurred
  • Issue requires architectural changes
  • Critical business impact

Contributing

To update or add to these runbooks:

  1. Test procedures in a non-production environment
  2. Document all steps clearly with example commands
  3. Include both diagnostic and resolution procedures
  4. Add escalation criteria
  5. Submit changes via pull request

Emergency Contacts

  • On-Call Engineer: [Insert contact information]
  • Database Team: [Insert contact information]
  • Network Team: [Insert contact information]
  • Security Team: [Insert contact information]

Additional Resources


Last Updated: 2026-02-11

Maintained By: Operations Team

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors