NO-JIRA: Claude tool - etcd troubleshooting skill #32

fonta-rh · 2025-10-29T10:27:09Z

Add etcd troubleshooting skill for Claude Code

Adds a comprehensive Claude Code skill that helps troubleshoot etcd issues on two-node fencing clusters.

The skill enables automated diagnosis and remediation of common etcd/Pacemaker problems.

New feature: Claude Code Skill (.claude/commands/etcd/):

Interactive troubleshooting capability with systematic decision trees
Automated diagnostic data collection from Pacemaker, etcd, and OpenShift
Analysis guidelines for cluster state, resource failures, and error patterns
Remediation recommendations with verification steps
Permission framework for safe read-only operations vs. requiring approval for state changes

Diagnostic Tools:

Ansible playbooks for validation and data collection
Shell scripts with automatic proxy.env detection for cluster access
Master orchestration script collecting both VM and cluster-level diagnostics

Helper:

force-new-cluster.yml - Automated recovery for cluster ID mismatches and split-brain scenarios

Documentation:

Etcd operations guide (clustering, recovery, monitoring, failures)
Pacemaker administration reference
Comprehensive skill usage and permission configuration docs

Tested with:

Cluster ID mismatch recovery
Resource failure cleanup
Failed learner rejoins
Transient CIB communication errors

openshift-ci · 2025-10-29T10:27:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fonta-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [fonta-rh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-10-29T10:28:35Z

@fonta-rh: This pull request explicitly references no jira issue.

Details

In response to this:

Add etcd troubleshooting skill for Claude Code

Adds a comprehensive Claude Code skill that helps troubleshoot etcd issues on two-node fencing clusters.

The skill enables automated diagnosis and remediation of common etcd/Pacemaker problems.

New feature: Claude Code Skill (.claude/commands/etcd/):

Interactive troubleshooting capability with systematic decision trees

Automated diagnostic data collection from Pacemaker, etcd, and OpenShift

Analysis guidelines for cluster state, resource failures, and error patterns

Remediation recommendations with verification steps

Permission framework for safe read-only operations vs. requiring approval for state changes

Diagnostic Tools:

Ansible playbooks for validation and data collection

Shell scripts with automatic proxy.env detection for cluster access

Master orchestration script collecting both VM and cluster-level diagnostics

Helper:

force-new-cluster.yml - Automated recovery for cluster ID mismatches and split-brain scenarios

Documentation:

Etcd operations guide (clustering, recovery, monitoring, failures)

Pacemaker administration reference

Comprehensive skill usage and permission configuration docs

Tested with:

Cluster ID mismatch recovery

Resource failure cleanup

Failed learner rejoins

Transient CIB communication errors

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

clobrano

I left some comments

.claude/commands/etcd/README.md

.claude/commands/etcd/pacemaker/podman-etcd.txt

.claude/commands/etcd/playbooks/collect-diagnostics.yml

release-notes.md

…ools

Relocate scripts and playbooks from .claude/commands/etcd/ to helpers/etcd/ so they can be used by any tool, not just the Claude skill. This aligns with the existing helpers/ directory structure. - Move 3 scripts to helpers/etcd/ - Move 2 playbooks to helpers/etcd/playbooks/ - Update internal path calculations (REPO_ROOT, playbook paths) - Update all documentation references 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Add guidance on choosing between quick manual triage and full diagnostic collection. Quick triage is recommended for initial assessment, with the comprehensive script reserved for complex issues where root cause is unclear. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Instead of tracking a static copy of the podman-etcd resource agent, add a script to fetch it from the ClusterLabs repository when needed. This ensures the reference stays current with upstream changes. - Add helpers/etcd/fetch-podman-etcd.sh to fetch from GitHub - Add podman-etcd.txt to .gitignore - Update documentation references 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

The etcd container logs are not excessively large, so collect all of them for more complete diagnostics. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Instead of assuming the first node in inventory is the leader, the playbook now: 1. Checks which node has etcd running and is the actual leader 2. Falls back to first node with running etcd if no leader found 3. Falls back to inventory order only if no etcd is running This prevents data loss from incorrectly designating a follower as the recovery leader. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

clobrano

I left a few comments, but overall it looks good. Great work!

clobrano · 2026-01-09T08:49:46Z

.claude/commands/etcd/PERMISSIONS.md

@@ -0,0 +1,212 @@
+# Etcd Troubleshooting Skill - Permission Configuration


This file is awesome! Does it always work, or the agent is picky? 😁
I need something similar in my environment

clobrano · 2026-01-09T08:58:09Z