Skip to content

Conversation

@fonta-rh
Copy link
Contributor

Add etcd troubleshooting skill for Claude Code

Adds a comprehensive Claude Code skill that helps troubleshoot etcd issues on two-node fencing clusters.

The skill enables automated diagnosis and remediation of common etcd/Pacemaker problems.

New feature: Claude Code Skill (.claude/commands/etcd/):

  • Interactive troubleshooting capability with systematic decision trees
  • Automated diagnostic data collection from Pacemaker, etcd, and OpenShift
  • Analysis guidelines for cluster state, resource failures, and error patterns
  • Remediation recommendations with verification steps
  • Permission framework for safe read-only operations vs. requiring approval for state changes

Diagnostic Tools:

  • Ansible playbooks for validation and data collection
  • Shell scripts with automatic proxy.env detection for cluster access
  • Master orchestration script collecting both VM and cluster-level diagnostics

Helper:

  • force-new-cluster.yml - Automated recovery for cluster ID mismatches and split-brain scenarios

Documentation:

  • Etcd operations guide (clustering, recovery, monitoring, failures)
  • Pacemaker administration reference
  • Comprehensive skill usage and permission configuration docs

Tested with:

  • Cluster ID mismatch recovery
  • Resource failure cleanup
  • Failed learner rejoins
  • Transient CIB communication errors

@openshift-ci openshift-ci bot requested review from clobrano and eggfoobar October 29, 2025 10:27
@openshift-ci
Copy link

openshift-ci bot commented Oct 29, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fonta-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 29, 2025
@fonta-rh fonta-rh changed the title Claude tool: etcd troubleshooting skill NO-JIRA: Claude tool - etcd troubleshooting skill Oct 29, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 29, 2025
@openshift-ci-robot
Copy link

@fonta-rh: This pull request explicitly references no jira issue.

Details

In response to this:

Add etcd troubleshooting skill for Claude Code

Adds a comprehensive Claude Code skill that helps troubleshoot etcd issues on two-node fencing clusters.

The skill enables automated diagnosis and remediation of common etcd/Pacemaker problems.

New feature: Claude Code Skill (.claude/commands/etcd/):

  • Interactive troubleshooting capability with systematic decision trees
  • Automated diagnostic data collection from Pacemaker, etcd, and OpenShift
  • Analysis guidelines for cluster state, resource failures, and error patterns
  • Remediation recommendations with verification steps
  • Permission framework for safe read-only operations vs. requiring approval for state changes

Diagnostic Tools:

  • Ansible playbooks for validation and data collection
  • Shell scripts with automatic proxy.env detection for cluster access
  • Master orchestration script collecting both VM and cluster-level diagnostics

Helper:

  • force-new-cluster.yml - Automated recovery for cluster ID mismatches and split-brain scenarios

Documentation:

  • Etcd operations guide (clustering, recovery, monitoring, failures)
  • Pacemaker administration reference
  • Comprehensive skill usage and permission configuration docs

Tested with:

  • Cluster ID mismatch recovery
  • Resource failure cleanup
  • Failed learner rejoins
  • Transient CIB communication errors

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@fonta-rh fonta-rh added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. and removed jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Oct 29, 2025
Copy link
Contributor

@clobrano clobrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments

fonta-rh and others added 8 commits January 8, 2026 20:40
Relocate scripts and playbooks from .claude/commands/etcd/ to helpers/etcd/
so they can be used by any tool, not just the Claude skill. This aligns with
the existing helpers/ directory structure.

- Move 3 scripts to helpers/etcd/
- Move 2 playbooks to helpers/etcd/playbooks/
- Update internal path calculations (REPO_ROOT, playbook paths)
- Update all documentation references

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add guidance on choosing between quick manual triage and full
diagnostic collection. Quick triage is recommended for initial
assessment, with the comprehensive script reserved for complex
issues where root cause is unclear.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Instead of tracking a static copy of the podman-etcd resource agent,
add a script to fetch it from the ClusterLabs repository when needed.
This ensures the reference stays current with upstream changes.

- Add helpers/etcd/fetch-podman-etcd.sh to fetch from GitHub
- Add podman-etcd.txt to .gitignore
- Update documentation references

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
The etcd container logs are not excessively large, so collect all
of them for more complete diagnostics.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Instead of assuming the first node in inventory is the leader,
the playbook now:
1. Checks which node has etcd running and is the actual leader
2. Falls back to first node with running etcd if no leader found
3. Falls back to inventory order only if no etcd is running

This prevents data loss from incorrectly designating a follower
as the recovery leader.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Copy link
Contributor

@clobrano clobrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments, but overall it looks good. Great work!

@@ -0,0 +1,212 @@
# Etcd Troubleshooting Skill - Permission Configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is awesome! Does it always work, or the agent is picky? 😁
I need something similar in my environment


**Symptoms:**
- `pcs status` shows: `etcd start on <node> returned 'error'`
- Pacemaker logs show: `crm_attribute: Error performing operation: No such device or address`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note: This message can refer to an attribute that is intentionally unset. I have implemented a change to prevent this message from appearing, so it will be removed in a future update. Not asking to remove this from "common issues" however, as it also point to "IS_LEARNER" value, which might be helpful.


---

### 2. Split-Brain: "master-X must force a new cluster"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully, this has never happened, right? 😱

Comment on lines +97 to +106
**Diagnosis:**
```bash
# Check cluster IDs on both nodes
ansible cluster_vms -i deploy/openshift-clusters/inventory.ini -m shell -a \
"sudo crm_attribute -G -n cluster_id" -b

# Check which node is standalone
ansible cluster_vms -i deploy/openshift-clusters/inventory.ini -m shell -a \
"sudo crm_attribute -G -n standalone_node" -b
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can make it more "reliable" directly asking to the etcd instances running on each node. Something like "etcdctl member list". It must be double check however and it's quite some work to test it, so feel free to ignore this comment.

Comment on lines +209 to +211
**Root Cause:**
OpenShift etcd operator learner promotion workflow stalled or conditions not met.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Luckily, this should never happen anymore, as it is podman-etcd that promotes the member now :)

Comment on lines +268 to +279
**Fix:**
```bash
# Force certificate regeneration via machine config
oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "cert-refresh-$(date +%s)"}}' --type=merge

# Or manually trigger cert rotation
oc delete secret -n openshift-etcd etcd-all-certs
oc delete pod -n openshift-etcd-operator -l name=etcd-operator

# Wait for operator to regenerate certs and restart etcd
oc get pods -n openshift-etcd -w
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### Activating the Skill

In Claude Code, reference the troubleshooting skill in your request:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it might seems excessive, for someone that is inexperienced with Claude (or agents in general) it is worth mentioning that it must start Claude Code CLI inside a local copy of this repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants