-
Notifications
You must be signed in to change notification settings - Fork 8
NO-JIRA: Claude tool - etcd troubleshooting skill #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fonta-rh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@fonta-rh: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
a4c5145 to
50f415d
Compare
e527c8e to
8b1da20
Compare
clobrano
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments
8b1da20 to
8f455fb
Compare
Relocate scripts and playbooks from .claude/commands/etcd/ to helpers/etcd/ so they can be used by any tool, not just the Claude skill. This aligns with the existing helpers/ directory structure. - Move 3 scripts to helpers/etcd/ - Move 2 playbooks to helpers/etcd/playbooks/ - Update internal path calculations (REPO_ROOT, playbook paths) - Update all documentation references 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add guidance on choosing between quick manual triage and full diagnostic collection. Quick triage is recommended for initial assessment, with the comprehensive script reserved for complex issues where root cause is unclear. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Instead of tracking a static copy of the podman-etcd resource agent, add a script to fetch it from the ClusterLabs repository when needed. This ensures the reference stays current with upstream changes. - Add helpers/etcd/fetch-podman-etcd.sh to fetch from GitHub - Add podman-etcd.txt to .gitignore - Update documentation references 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
The etcd container logs are not excessively large, so collect all of them for more complete diagnostics. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Instead of assuming the first node in inventory is the leader, the playbook now: 1. Checks which node has etcd running and is the actual leader 2. Falls back to first node with running etcd if no leader found 3. Falls back to inventory order only if no etcd is running This prevents data loss from incorrectly designating a follower as the recovery leader. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
clobrano
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments, but overall it looks good. Great work!
| @@ -0,0 +1,212 @@ | |||
| # Etcd Troubleshooting Skill - Permission Configuration | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is awesome! Does it always work, or the agent is picky? 😁
I need something similar in my environment
|
|
||
| **Symptoms:** | ||
| - `pcs status` shows: `etcd start on <node> returned 'error'` | ||
| - Pacemaker logs show: `crm_attribute: Error performing operation: No such device or address` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note: This message can refer to an attribute that is intentionally unset. I have implemented a change to prevent this message from appearing, so it will be removed in a future update. Not asking to remove this from "common issues" however, as it also point to "IS_LEARNER" value, which might be helpful.
|
|
||
| --- | ||
|
|
||
| ### 2. Split-Brain: "master-X must force a new cluster" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully, this has never happened, right? 😱
| **Diagnosis:** | ||
| ```bash | ||
| # Check cluster IDs on both nodes | ||
| ansible cluster_vms -i deploy/openshift-clusters/inventory.ini -m shell -a \ | ||
| "sudo crm_attribute -G -n cluster_id" -b | ||
|
|
||
| # Check which node is standalone | ||
| ansible cluster_vms -i deploy/openshift-clusters/inventory.ini -m shell -a \ | ||
| "sudo crm_attribute -G -n standalone_node" -b | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can make it more "reliable" directly asking to the etcd instances running on each node. Something like "etcdctl member list". It must be double check however and it's quite some work to test it, so feel free to ignore this comment.
| **Root Cause:** | ||
| OpenShift etcd operator learner promotion workflow stalled or conditions not met. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Luckily, this should never happen anymore, as it is podman-etcd that promotes the member now :)
| **Fix:** | ||
| ```bash | ||
| # Force certificate regeneration via machine config | ||
| oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "cert-refresh-$(date +%s)"}}' --type=merge | ||
|
|
||
| # Or manually trigger cert rotation | ||
| oc delete secret -n openshift-etcd etcd-all-certs | ||
| oc delete pod -n openshift-etcd-operator -l name=etcd-operator | ||
|
|
||
| # Wait for operator to regenerate certs and restart etcd | ||
| oc get pods -n openshift-etcd -w | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⭐
|
|
||
| ### Activating the Skill | ||
|
|
||
| In Claude Code, reference the troubleshooting skill in your request: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it might seems excessive, for someone that is inexperienced with Claude (or agents in general) it is worth mentioning that it must start Claude Code CLI inside a local copy of this repository.
Add etcd troubleshooting skill for Claude Code
Adds a comprehensive Claude Code skill that helps troubleshoot etcd issues on two-node fencing clusters.
The skill enables automated diagnosis and remediation of common etcd/Pacemaker problems.
New feature: Claude Code Skill (.claude/commands/etcd/):
Diagnostic Tools:
Helper:
Documentation:
Tested with: