Etcd Troubleshooting Skill for Two-Node Clusters

This directory contains the etcd/Pacemaker troubleshooting skill for Claude Code, designed specifically for two-node OpenShift clusters with fencing topology.

Overview

The etcd troubleshooting skill enables Claude to interactively diagnose and resolve etcd and Pacemaker issues on two-node clusters. It provides:

Automated diagnostic data collection from cluster VMs and OpenShift
Systematic analysis frameworks for identifying root causes
Step-by-step remediation procedures
Verification and prevention recommendations

Directory Structure

.claude/commands/etcd/
├── README.md                           # This file
├── PROJECT.md                          # Project specification and checklist
├── QUICK_REFERENCE.md                  # Fast troubleshooting guide (START HERE)
├── TROUBLESHOOTING_SKILL.md           # Detailed skill definition and guidelines
├── ../../../helpers/etcd/              # Helper scripts and playbooks
│   ├── validate-cluster-access.sh      # Validate both Ansible and oc access
│   ├── collect-all-diagnostics.sh      # Master orchestration script
│   ├── oc-wrapper.sh                   # oc wrapper with proxy.env handling
│   └── playbooks/                      # Ansible playbooks
│       ├── validate-access.yml         # Validate Ansible connectivity
│       └── collect-diagnostics.yml     # Collect VM-level diagnostics
├── etcd-ops-guide/                     # Etcd operations documentation
│   ├── clustering.md
│   ├── recovery.md
│   ├── monitoring.md
│   ├── failures.md
│   └── ... (other etcd docs)
└── pacemaker/                          # Pacemaker documentation
    ├── podman-etcd.txt                 # Resource agent (fetched from upstream)
    └── Pacemaker_Administration/       # Pacemaker admin guides

Quick Start

For Fast Troubleshooting

Start with QUICK_REFERENCE.md for common issues and immediate fixes.

The quick reference covers:

Common failure patterns with instant fixes
One-command diagnostics
Step-by-step remediation for 7 most frequent issues
Quick verification checklist

For Complex Issues

Use the detailed TROUBLESHOOTING_SKILL.md when:

Issue doesn't match common patterns
Multiple components are failing
Need deeper architectural understanding
Automated fixes don't resolve the problem

Activating the Skill

Prerequisite: You must run Claude Code CLI from inside a local clone of this repository. The skill files are loaded from the .claude/commands/ directory.

In Claude Code, reference the troubleshooting skill in your request:

"Help me troubleshoot etcd issues on my two-node cluster. Use the etcd troubleshooting skill."

Running Diagnostic Collection

The fastest way to gather all diagnostics:

# From repository root
./helpers/etcd/collect-all-diagnostics.sh

This will:

Validate Ansible access to cluster VMs
Validate OpenShift cluster access (with proxy detection)
Collect VM-level diagnostics (Pacemaker, etcd, containers, logs)
Collect OpenShift cluster-level diagnostics (operators, nodes, events)
Generate a summary report with analysis guidance

Individual Components

Validate Access Only:

./helpers/etcd/validate-cluster-access.sh

Collect VM-Level Diagnostics Only:

ansible-playbook helpers/etcd/playbooks/collect-diagnostics.yml \
  -i deploy/openshift-clusters/inventory.ini

Use oc with Automatic Proxy Handling:

./helpers/etcd/oc-wrapper.sh get nodes
./helpers/etcd/oc-wrapper.sh get co etcd

Prerequisites

Environment Requirements

Ansible inventory at deploy/openshift-clusters/inventory.ini
SSH access to cluster VMs (usually via ProxyJump through bastion)
oc command in PATH
Optional: deploy/openshift-clusters/proxy.env for cluster access

Cluster Requirements

Two-node OpenShift cluster with fencing topology
Pacemaker and Corosync running on both nodes
Etcd running as Podman containers managed by Pacemaker
Stonith (fencing) configured

Usage Patterns

Pattern 1: Interactive Troubleshooting

When working with Claude interactively:

Describe the issue to Claude
Claude will follow the decision tree in TROUBLESHOOTING_SKILL.md
Claude will collect necessary data using playbooks/scripts
Claude will analyze the data systematically
Claude will propose remediation steps
You execute or approve the remediation
Claude helps verify the fix worked
Claude provides prevention recommendations

Pattern 2: Automated Diagnostics

When you want to gather all data first:

Run collect-all-diagnostics.sh
Share the output directory with Claude
Claude analyzes the collected data
Claude provides diagnosis and remediation plan

Pattern 3: Specific Issue Investigation

When you know the general area of the problem:

Tell Claude the symptoms (e.g., "etcd container won't start on node-1")
Claude uses targeted data collection
Claude applies component-specific analysis (see TROUBLESHOOTING_SKILL.md)
Claude provides focused remediation

Key Features

Host Group Targeting

IMPORTANT: All etcd and Pacemaker diagnostics must target the correct Ansible host group:

cluster_vms - Use for all etcd, Pacemaker, and cluster diagnostics
- All pcs commands
- All podman commands for etcd containers
- All etcdctl commands
- All journalctl commands for cluster logs
hypervisor - Only for VM lifecycle management
- virsh commands to start/stop VMs
- kcli commands for cluster management
- Do NOT use for etcd-related operations

Example:

# Correct - targets cluster VMs
ansible cluster_vms -i deploy/openshift-clusters/inventory.ini -m shell -a "pcs status" -b

# Incorrect - would target hypervisor instead of cluster nodes
ansible hypervisor -i deploy/openshift-clusters/inventory.ini -m shell -a "pcs status" -b

Proxy Handling

All scripts automatically detect and handle proxy requirements:

Direct cluster access is tried first
Falls back to proxy.env if needed
Gracefully handles missing proxy.env with warnings
oc-wrapper.sh can be used for all oc commands

Comprehensive Data Collection

The collect-diagnostics playbook gathers:

Pacemaker:

Cluster status and resource status
Constraints and failed actions
CIB attributes (cluster_id, standalone_node, etc.)

Etcd:

Container status and logs
Member list and endpoint health
Cluster health and leadership info

Logs:

Pacemaker, Corosync, and etcd journalctl logs
Configurable timeframe and line limits

OpenShift:

Node status and conditions
Etcd operator status
Cluster operator health
Recent events

Systematic Analysis

Claude follows structured analysis frameworks (see TROUBLESHOOTING_SKILL.md):

Component-specific analysis functions
Decision tree for systematic diagnosis
Error pattern matching guidelines
Common issue recognition

Common Scenarios

Scenario: Different Cluster IDs

Symptoms: Etcd won't start, nodes show different cluster_id in CIB attributes

Quick Fix:

# Use the force-new-cluster helper
ansible-playbook helpers/force-new-cluster.yml \
  -i deploy/openshift-clusters/inventory.ini

This auto-detects the etcd leader and forces the follower to resync.

Scenario: Resource Failed to Start

Symptoms: pcs status shows "Failed Resource Actions"

Quick Fix:

# On cluster VMs via Ansible
ansible cluster_vms -i deploy/openshift-clusters/inventory.ini \
  -m shell -a "pcs resource cleanup etcd" -b

Scenario: Fencing Failures

Symptoms: Node shows UNCLEAN, fencing failed errors

Investigation:

# Check stonith status
ansible cluster_vms -i deploy/openshift-clusters/inventory.ini \
  -m shell -a "pcs stonith status" -b

# Test fence agent manually
ansible cluster_vms -i deploy/openshift-clusters/inventory.ini \
  -m shell -a "fence_redfish -a <bmc_ip> -l <user> -p <pass> -o status" -b

Remediation Tools

pcs resource cleanup

Clears failed resource states and retries operations:

sudo pcs resource cleanup etcd              # All nodes
sudo pcs resource cleanup etcd <node-name>  # Specific node

When to use:

After fixing underlying issues
Resource shows as failed but root cause is resolved
After manual CIB attribute changes

force-new-cluster Helper

Automated cluster recovery playbook at helpers/force-new-cluster.yml:

ansible-playbook helpers/force-new-cluster.yml \
  -i deploy/openshift-clusters/inventory.ini

When to use:

Different etcd cluster IDs between nodes
Etcd won't start on either node
After ungraceful disruptions
Manual recovery attempts failed

See TROUBLESHOOTING_SKILL.md for detailed documentation.

Reference Documentation

Troubleshooting Guides (by detail level)

QUICK_REFERENCE.md - Start here for common issues
- 7 most frequent failure patterns with fixes
- Quick diagnostics commands
- Fast verification checklist
TROUBLESHOOTING_SKILL.md - Detailed methodology
- Systematic analysis frameworks
- Component-specific diagnosis
- Decision trees and error patterns
Etcd Operations - Deep reference via slash commands:
- /etcd:etcd-ops-guide:clustering - Cluster membership operations
- /etcd:etcd-ops-guide:recovery - Recovery procedures
- /etcd:etcd-ops-guide:monitoring - Monitoring and health checks
- /etcd:etcd-ops-guide:failures - Failure scenarios
- /etcd:etcd-ops-guide:data_corruption - Data corruption handling
Or read files directly in .claude/commands/etcd/etcd-ops-guide/
Pacemaker Administration - Deep reference in .claude/commands/etcd/pacemaker/Pacemaker_Administration/:
- troubleshooting.rst - Pacemaker troubleshooting guide
- tools.rst - Command-line tools
- agents.rst - Resource agents
- administrative.rst - Administrative tasks
Podman-etcd Resource Agent - To consult the resource agent source:
```
# Fetch latest from upstream before reading
./helpers/etcd/fetch-podman-etcd.sh
```
Then read .claude/commands/etcd/pacemaker/podman-etcd.txt

Development and Testing

See PROJECT.md for:

Implementation checklist
Technical approach and architecture
Testing scenarios
Success criteria

Environment Variables

INVENTORY_PATH: Override inventory location (default: deploy/openshift-clusters/inventory.ini)

INVENTORY_PATH=/custom/path/inventory.ini ./scripts/validate-cluster-access.sh

PROXY_ENV_PATH: Override proxy.env location (default: deploy/openshift-clusters/proxy.env)

PROXY_ENV_PATH=/custom/path/proxy.env ./scripts/oc-wrapper.sh get nodes

Troubleshooting the Troubleshooter

If the diagnostic scripts themselves fail:

Ansible connectivity issues:

# Test basic connectivity
ansible cluster_vms -i deploy/openshift-clusters/inventory.ini -m ping

# Check inventory syntax
ansible-inventory -i deploy/openshift-clusters/inventory.ini --list

oc access issues:

# Test direct access
oc version

# Test with proxy
source deploy/openshift-clusters/proxy.env && oc version

# Verify KUBECONFIG
echo $KUBECONFIG

Permission issues:

# Ensure scripts are executable
chmod +x helpers/etcd/*.sh

Permission Configuration

To speed up diagnostics, you can configure Claude Code to automatically approve read-only operations without prompting for permission. See PERMISSIONS.md for:

Complete list of safe read-only commands that can be auto-approved
Operations that always require user approval
How to configure permissions in Claude Code
Safety considerations and boundaries

Quick summary of auto-approved operations:

File reading: cat, tail, head, grep, ls
Ansible read-only: pcs status, podman ps, etcdctl queries, journalctl
OpenShift read-only: oc get, oc describe, oc logs
Validation scripts (no state changes)

Always requires approval:

Ansible playbooks (including diagnostics collection)
Pacemaker operations: pcs resource cleanup, restart, disable/enable
Etcd operations: member add/remove, put/delete
Force-new-cluster recovery
Any system modifications

Contributing

When adding new diagnostic capabilities:

Update TROUBLESHOOTING_SKILL.md with new analysis patterns
Add collection steps to collect-diagnostics.yml if needed
Update decision tree and error patterns
Document new remediation tools
Add examples to this README
Update PROJECT.md checklist

License

This is part of the two-node-toolbox project. See repository root for license information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Etcd Troubleshooting Skill for Two-Node Clusters

Overview

Directory Structure

Quick Start

For Fast Troubleshooting

For Complex Issues

Activating the Skill

Running Diagnostic Collection

Individual Components

Prerequisites

Environment Requirements

Cluster Requirements

Usage Patterns

Pattern 1: Interactive Troubleshooting

Pattern 2: Automated Diagnostics

Pattern 3: Specific Issue Investigation

Key Features

Host Group Targeting

Proxy Handling

Comprehensive Data Collection

Systematic Analysis

Common Scenarios

Scenario: Different Cluster IDs

Scenario: Resource Failed to Start

Scenario: Fencing Failures

Remediation Tools

pcs resource cleanup

force-new-cluster Helper

Reference Documentation

Troubleshooting Guides (by detail level)

Development and Testing

Environment Variables

Troubleshooting the Troubleshooter

Permission Configuration

Contributing

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Etcd Troubleshooting Skill for Two-Node Clusters

Overview

Directory Structure

Quick Start

For Fast Troubleshooting

For Complex Issues

Activating the Skill

Running Diagnostic Collection

Individual Components

Prerequisites

Environment Requirements

Cluster Requirements

Usage Patterns

Pattern 1: Interactive Troubleshooting

Pattern 2: Automated Diagnostics

Pattern 3: Specific Issue Investigation

Key Features

Host Group Targeting

Proxy Handling

Comprehensive Data Collection

Systematic Analysis

Common Scenarios

Scenario: Different Cluster IDs

Scenario: Resource Failed to Start

Scenario: Fencing Failures

Remediation Tools

pcs resource cleanup

force-new-cluster Helper

Reference Documentation

Troubleshooting Guides (by detail level)

Development and Testing

Environment Variables

Troubleshooting the Troubleshooter

Permission Configuration

Contributing

License