This directory contains the etcd/Pacemaker troubleshooting skill for Claude Code, designed specifically for two-node OpenShift clusters with fencing topology.
The etcd troubleshooting skill enables Claude to interactively diagnose and resolve etcd and Pacemaker issues on two-node clusters. It provides:
- Automated diagnostic data collection from cluster VMs and OpenShift
- Systematic analysis frameworks for identifying root causes
- Step-by-step remediation procedures
- Verification and prevention recommendations
.claude/commands/etcd/
├── README.md # This file
├── PROJECT.md # Project specification and checklist
├── QUICK_REFERENCE.md # Fast troubleshooting guide (START HERE)
├── TROUBLESHOOTING_SKILL.md # Detailed skill definition and guidelines
├── ../../../helpers/etcd/ # Helper scripts and playbooks
│ ├── validate-cluster-access.sh # Validate both Ansible and oc access
│ ├── collect-all-diagnostics.sh # Master orchestration script
│ ├── oc-wrapper.sh # oc wrapper with proxy.env handling
│ └── playbooks/ # Ansible playbooks
│ ├── validate-access.yml # Validate Ansible connectivity
│ └── collect-diagnostics.yml # Collect VM-level diagnostics
├── etcd-ops-guide/ # Etcd operations documentation
│ ├── clustering.md
│ ├── recovery.md
│ ├── monitoring.md
│ ├── failures.md
│ └── ... (other etcd docs)
└── pacemaker/ # Pacemaker documentation
├── podman-etcd.txt # Resource agent (fetched from upstream)
└── Pacemaker_Administration/ # Pacemaker admin guides
Start with QUICK_REFERENCE.md for common issues and immediate fixes.
The quick reference covers:
- Common failure patterns with instant fixes
- One-command diagnostics
- Step-by-step remediation for 7 most frequent issues
- Quick verification checklist
Use the detailed TROUBLESHOOTING_SKILL.md when:
- Issue doesn't match common patterns
- Multiple components are failing
- Need deeper architectural understanding
- Automated fixes don't resolve the problem
Prerequisite: You must run Claude Code CLI from inside a local clone of this repository. The skill files are loaded from the .claude/commands/ directory.
In Claude Code, reference the troubleshooting skill in your request:
"Help me troubleshoot etcd issues on my two-node cluster. Use the etcd troubleshooting skill."
The fastest way to gather all diagnostics:
# From repository root
./helpers/etcd/collect-all-diagnostics.shThis will:
- Validate Ansible access to cluster VMs
- Validate OpenShift cluster access (with proxy detection)
- Collect VM-level diagnostics (Pacemaker, etcd, containers, logs)
- Collect OpenShift cluster-level diagnostics (operators, nodes, events)
- Generate a summary report with analysis guidance
Validate Access Only:
./helpers/etcd/validate-cluster-access.shCollect VM-Level Diagnostics Only:
ansible-playbook helpers/etcd/playbooks/collect-diagnostics.yml \
-i deploy/openshift-clusters/inventory.iniUse oc with Automatic Proxy Handling:
./helpers/etcd/oc-wrapper.sh get nodes
./helpers/etcd/oc-wrapper.sh get co etcd- Ansible inventory at
deploy/openshift-clusters/inventory.ini - SSH access to cluster VMs (usually via ProxyJump through bastion)
occommand in PATH- Optional:
deploy/openshift-clusters/proxy.envfor cluster access
- Two-node OpenShift cluster with fencing topology
- Pacemaker and Corosync running on both nodes
- Etcd running as Podman containers managed by Pacemaker
- Stonith (fencing) configured
When working with Claude interactively:
- Describe the issue to Claude
- Claude will follow the decision tree in TROUBLESHOOTING_SKILL.md
- Claude will collect necessary data using playbooks/scripts
- Claude will analyze the data systematically
- Claude will propose remediation steps
- You execute or approve the remediation
- Claude helps verify the fix worked
- Claude provides prevention recommendations
When you want to gather all data first:
- Run
collect-all-diagnostics.sh - Share the output directory with Claude
- Claude analyzes the collected data
- Claude provides diagnosis and remediation plan
When you know the general area of the problem:
- Tell Claude the symptoms (e.g., "etcd container won't start on node-1")
- Claude uses targeted data collection
- Claude applies component-specific analysis (see TROUBLESHOOTING_SKILL.md)
- Claude provides focused remediation
IMPORTANT: All etcd and Pacemaker diagnostics must target the correct Ansible host group:
-
cluster_vms- Use for all etcd, Pacemaker, and cluster diagnostics- All pcs commands
- All podman commands for etcd containers
- All etcdctl commands
- All journalctl commands for cluster logs
-
hypervisor- Only for VM lifecycle management- virsh commands to start/stop VMs
- kcli commands for cluster management
- Do NOT use for etcd-related operations
Example:
# Correct - targets cluster VMs
ansible cluster_vms -i deploy/openshift-clusters/inventory.ini -m shell -a "pcs status" -b
# Incorrect - would target hypervisor instead of cluster nodes
ansible hypervisor -i deploy/openshift-clusters/inventory.ini -m shell -a "pcs status" -bAll scripts automatically detect and handle proxy requirements:
- Direct cluster access is tried first
- Falls back to
proxy.envif needed - Gracefully handles missing proxy.env with warnings
oc-wrapper.shcan be used for all oc commands
The collect-diagnostics playbook gathers:
Pacemaker:
- Cluster status and resource status
- Constraints and failed actions
- CIB attributes (cluster_id, standalone_node, etc.)
Etcd:
- Container status and logs
- Member list and endpoint health
- Cluster health and leadership info
Logs:
- Pacemaker, Corosync, and etcd journalctl logs
- Configurable timeframe and line limits
OpenShift:
- Node status and conditions
- Etcd operator status
- Cluster operator health
- Recent events
Claude follows structured analysis frameworks (see TROUBLESHOOTING_SKILL.md):
- Component-specific analysis functions
- Decision tree for systematic diagnosis
- Error pattern matching guidelines
- Common issue recognition
Symptoms: Etcd won't start, nodes show different cluster_id in CIB attributes
Quick Fix:
# Use the force-new-cluster helper
ansible-playbook helpers/force-new-cluster.yml \
-i deploy/openshift-clusters/inventory.iniThis auto-detects the etcd leader and forces the follower to resync.
Symptoms: pcs status shows "Failed Resource Actions"
Quick Fix:
# On cluster VMs via Ansible
ansible cluster_vms -i deploy/openshift-clusters/inventory.ini \
-m shell -a "pcs resource cleanup etcd" -bSymptoms: Node shows UNCLEAN, fencing failed errors
Investigation:
# Check stonith status
ansible cluster_vms -i deploy/openshift-clusters/inventory.ini \
-m shell -a "pcs stonith status" -b
# Test fence agent manually
ansible cluster_vms -i deploy/openshift-clusters/inventory.ini \
-m shell -a "fence_redfish -a <bmc_ip> -l <user> -p <pass> -o status" -bClears failed resource states and retries operations:
sudo pcs resource cleanup etcd # All nodes
sudo pcs resource cleanup etcd <node-name> # Specific nodeWhen to use:
- After fixing underlying issues
- Resource shows as failed but root cause is resolved
- After manual CIB attribute changes
Automated cluster recovery playbook at helpers/force-new-cluster.yml:
ansible-playbook helpers/force-new-cluster.yml \
-i deploy/openshift-clusters/inventory.iniWhen to use:
- Different etcd cluster IDs between nodes
- Etcd won't start on either node
- After ungraceful disruptions
- Manual recovery attempts failed
See TROUBLESHOOTING_SKILL.md for detailed documentation.
-
QUICK_REFERENCE.md - Start here for common issues
- 7 most frequent failure patterns with fixes
- Quick diagnostics commands
- Fast verification checklist
-
TROUBLESHOOTING_SKILL.md - Detailed methodology
- Systematic analysis frameworks
- Component-specific diagnosis
- Decision trees and error patterns
-
Etcd Operations - Deep reference via slash commands:
/etcd:etcd-ops-guide:clustering- Cluster membership operations/etcd:etcd-ops-guide:recovery- Recovery procedures/etcd:etcd-ops-guide:monitoring- Monitoring and health checks/etcd:etcd-ops-guide:failures- Failure scenarios/etcd:etcd-ops-guide:data_corruption- Data corruption handling
Or read files directly in
.claude/commands/etcd/etcd-ops-guide/ -
Pacemaker Administration - Deep reference in
.claude/commands/etcd/pacemaker/Pacemaker_Administration/:troubleshooting.rst- Pacemaker troubleshooting guidetools.rst- Command-line toolsagents.rst- Resource agentsadministrative.rst- Administrative tasks
-
Podman-etcd Resource Agent - To consult the resource agent source:
# Fetch latest from upstream before reading ./helpers/etcd/fetch-podman-etcd.shThen read
.claude/commands/etcd/pacemaker/podman-etcd.txt
See PROJECT.md for:
- Implementation checklist
- Technical approach and architecture
- Testing scenarios
- Success criteria
INVENTORY_PATH: Override inventory location (default: deploy/openshift-clusters/inventory.ini)
INVENTORY_PATH=/custom/path/inventory.ini ./scripts/validate-cluster-access.shPROXY_ENV_PATH: Override proxy.env location (default: deploy/openshift-clusters/proxy.env)
PROXY_ENV_PATH=/custom/path/proxy.env ./scripts/oc-wrapper.sh get nodesIf the diagnostic scripts themselves fail:
Ansible connectivity issues:
# Test basic connectivity
ansible cluster_vms -i deploy/openshift-clusters/inventory.ini -m ping
# Check inventory syntax
ansible-inventory -i deploy/openshift-clusters/inventory.ini --listoc access issues:
# Test direct access
oc version
# Test with proxy
source deploy/openshift-clusters/proxy.env && oc version
# Verify KUBECONFIG
echo $KUBECONFIGPermission issues:
# Ensure scripts are executable
chmod +x helpers/etcd/*.shTo speed up diagnostics, you can configure Claude Code to automatically approve read-only operations without prompting for permission. See PERMISSIONS.md for:
- Complete list of safe read-only commands that can be auto-approved
- Operations that always require user approval
- How to configure permissions in Claude Code
- Safety considerations and boundaries
Quick summary of auto-approved operations:
- File reading:
cat,tail,head,grep,ls - Ansible read-only:
pcs status,podman ps,etcdctlqueries,journalctl - OpenShift read-only:
oc get,oc describe,oc logs - Validation scripts (no state changes)
Always requires approval:
- Ansible playbooks (including diagnostics collection)
- Pacemaker operations:
pcs resource cleanup, restart, disable/enable - Etcd operations: member add/remove, put/delete
- Force-new-cluster recovery
- Any system modifications
When adding new diagnostic capabilities:
- Update TROUBLESHOOTING_SKILL.md with new analysis patterns
- Add collection steps to collect-diagnostics.yml if needed
- Update decision tree and error patterns
- Document new remediation tools
- Add examples to this README
- Update PROJECT.md checklist
This is part of the two-node-toolbox project. See repository root for license information.