Skip to content

Implement etcd diagnostics admin endpoints#4668

Open
ventifus wants to merge 7 commits intomasterfrom
ventifus/ARO-23146/etcd-admin-actions
Open

Implement etcd diagnostics admin endpoints#4668
ventifus wants to merge 7 commits intomasterfrom
ventifus/ARO-23146/etcd-admin-actions

Conversation

@ventifus
Copy link
Copy Markdown
Collaborator

@ventifus ventifus commented Mar 9, 2026

Adds admin API endpoints for etcd diagnostics on Azure Red Hat OpenShift clusters.

Which issue this PR addresses:

Fixes ARO-23146

What this PR does / why we need it:

Four new admin endpoints for etcd diagnostics without requiring direct cluster access:

  • POST /admin/.../exec: Streams stdout/stderr of a shell command from a named pod container. Restricted to OpenShift system namespaces.
  • POST /admin/.../runjob: Accepts a Kubernetes Job manifest, creates it, streams pod logs, reports the final result, and cleans up. Restricted to OpenShift system namespaces.
  • POST /admin/.../etcdkeycount: Reports the top-10 etcd key counts by Kubernetes namespace via etcdctl get --prefix /kubernetes.io/ on the named node's etcd pod.
  • POST /admin/.../etcdanalysis: Takes an etcd snapshot via exec, runs the octosql-etcd analysis Job against it via a HostPath volume, streams results, and cleans up the snapshot and Job on completion or cancellation.

Design decisions

etcdanalysis image tag (quay.io/redhat_emp1/octosql-etcd:latest): Currently mutable and hosted on a personal Quay namespace; a TODO comment marks this as a supply-chain risk. The reference will move to a Microsoft/Red Hat-controlled registry and be pinned to a digest before GA.

KubeExecStream uses golang.org/x/net/websocket: KubeExecStream dials the Kubernetes WebSocket exec subprotocol (v4.channel.k8s.io) directly. We use golang.org/x/net/websocket because the project currently pins k8s.io/client-go to v0.25.16 via a replace directive. remotecommand.NewWebSocketExecutor (the maintained client-go WebSocket transport) was added in v0.28; NewSPDYExecutor (the older transport) is not an option as SPDY has been removed from all supported OpenShift versions. Once the project upgrades to client-go v0.28+, KubeExecStream can be simplified to use remotecommand.NewWebSocketExecutor. A TODO comment marks the spot. Auth supports both BearerToken and BearerTokenFile rest.Config fields.

Test plan for issue:

Unit tests cover all validation paths, streaming success/failure cases, and edge cases:

  • TestLimitedWriter (6 cases): boundary, truncation, post-truncation silence, and notice formatting.
  • TestWaitForJobPod_ErrorBranches (4 cases): closed channel, watch.Error with *metav1.Status, watch.Error with non-Status object, context cancellation.
  • TestWaitForJobTerminal_MaxErrors: 10 consecutive KubeGet failures trigger max-error exit.
  • TestRunJobStream_CleanupFailure / TestRunJobStream_CleanupNotFound: non-NotFound delete error writes "Cleanup failed:"; 404 is treated as success and writes "Cleanup complete.".

E2e tests (guarded by skipIfNotInDevelopmentEnv) cover all four endpoints against a real cluster via adminRequest(). All four pass against a live ARO cluster.

Is there any documentation that needs to be updated for this PR?

SRE runbooks and SOPs will be written at eng.ms before GA, covering when to use each endpoint, how to interpret output, and remediation steps for common findings.

How do you know this will function as expected in production?

All new code paths have unit test coverage. Watch-before-create ordering avoids pod event races; bounded output protects RP memory; retry logic and cleanup timeouts using fresh contexts handle transient apiserver failures and ensure privileged objects are cleaned up even when the request context is cancelled. All four e2e tests pass against a live ARO cluster.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR (WIP) is adding plumbing needed for new admin diagnostics endpoints, including the ability to execute commands in Kubernetes pods while streaming output back to callers.

Changes:

  • Extend KubeActions with a KubeExecStream method for streaming kubectl exec-style output.
  • Store the *rest.Config in kubeActions to support SPDY exec.
  • Regenerate/update the GoMock KubeActions mock to include the new method.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
pkg/frontend/adminactions/kubeactions.go Adds KubeExecStream API + implementation using SPDY exec and stores rest.Config on the struct.
pkg/util/mocks/adminactions/kubeactions.go Updates generated mock to match the expanded KubeActions interface.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ventifus ventifus force-pushed the ventifus/ARO-23146/etcd-admin-actions branch 2 times, most recently from 0aea0d4 to 12d3b6d Compare March 9, 2026 23:04
@ventifus ventifus requested a review from Copilot March 9, 2026 23:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ventifus ventifus force-pushed the ventifus/ARO-23146/etcd-admin-actions branch 2 times, most recently from 749e0c4 to 94a720d Compare March 10, 2026 00:02
Copilot AI review requested due to automatic review settings March 10, 2026 00:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ventifus ventifus force-pushed the ventifus/ARO-23146/etcd-admin-actions branch from 94a720d to dfb33a2 Compare March 10, 2026 00:11
@ventifus
Copy link
Copy Markdown
Collaborator Author

/azp run ci

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tuxerrante
Copy link
Copy Markdown
Collaborator

Hi,
just a fast mention that I've recently introduced a validateEtcdHealth func that you may find useful to reuse :)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@azure-pipelines
Copy link
Copy Markdown

There was an error handling pipeline event 826f2607-647b-4680-97c7-91be6cfed042.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// Close the underlying TLS connection to unblock the goroutine's Receive.
// The goroutine's defer wsConn.Close() handles WebSocket-level teardown.
tlsConn.Close()
<-resultCh // wait for goroutine to finish all writes before returning
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In KubeExecStream, the ctx-cancellation path waits on <-resultCh after closing tlsConn. If the receive goroutine is blocked on stdout.Write/stderr.Write (e.g., slow client causing the io.PipeWriter to block), it will never send on resultCh, and cancellation/timeout won’t return promptly. Consider making the cancel path non-blocking (return ctx.Err() without waiting), or ensure writes can’t block indefinitely (e.g., write via a buffered channel/goroutine and stop draining on cancel).

Suggested change
<-resultCh // wait for goroutine to finish all writes before returning
// Do not wait on resultCh here: the goroutine may be blocked on writes to
// user-provided stdout/stderr, and we want to return promptly on cancel.

Copilot uses AI. Check for mistakes.
ventifus and others added 7 commits March 27, 2026 16:13
Dials the Kubernetes WebSocket exec subprotocol (v4.channel.k8s.io), demuxing
channel frames into stdout (ch 1), stderr (ch 2), and exit status (ch 3). On
context cancellation, closes the WebSocket and drains the internal result
channel before returning ctx.Err(), ensuring all writes to caller io.Writers
are complete.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ction constants

Fix two bugs in the existing etcd recovery privilege management:
- kubeServiceAccount constant was missing the colon separator between
  the namespace and service account name
- newClusterRoleBinding roleRef used kubeServiceAccount (a Subject string)
  as the ClusterRole name instead of usersAccount

Also refactor createPrivilegedServiceAccount cleanup to use a fresh context
(so request cancellation cannot prevent RBAC/SCC teardown), retry transient
API server errors up to 3 times, and collect all delete errors with
errors.Join rather than failing on the first.

Extract shared admin-action constants (execOutputLimit, adminActionStreamTimeout,
adminActionCleanupTimeout) into common.go. These are used by all four new
admin endpoints and by the fixetcd cleanup path.

Add etcdContainerName constant for the etcd management sidecar container.
Remove unused cluster parameter from newClusterRole and newClusterRoleBinding.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
POST /admin/.../exec streams stdout/stderr of a shell command from a named pod
container. Validates namespace (must be an OpenShift namespace), pod name,
container name, and command. Caps each stream at 1 MiB. Exposes
execContainerStream for reuse by higher-level actions.

Adds limitedWriter (truncates at 1 MiB with a notice) and nopWriteCloser
(no-op Close for composed writers) to common.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
POST /admin/.../runjob accepts a Kubernetes Job manifest, creates it, streams
pod logs as they arrive, reports the terminal state, and deletes the Job. Adds
KubeFollowPodLogs to KubeActions. Uses retry.OnError with an isTransient
predicate (k8s.io/client-go/util/retry) for transient-error retry on Job
cleanup; retryDelay is a package-level var for test injection. Watch is
established before job creation to avoid missing the pod-created event.
parseAndValidateJob normalises the manifest: clamps backoffLimit to 0, requires
exactly one container, truncates names over 57 chars, and restricts namespaces
to OpenShift system namespaces.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
POST /admin/.../etcdkeycount?nodeName=<node> execs a fixed etcdctl command in
the etcd pod on the named node and streams the top-10 key counts by Kubernetes
namespace. The command is hardcoded in the RP so callers cannot inject
arbitrary shell commands.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
POST /admin/.../etcdanalysis?nodeName=<node> orchestrates three steps:
1. Execs etcdctl snapshot save in the etcdctl container on the target node.
2. Creates a privileged ServiceAccount (SA, ClusterRole, CRB, SCC) for the Job pod.
3. Runs a Job (octosql-etcd image) against the snapshot via a HostPath volume,
   streaming results.

The snapshot is removed on all exit paths via a deferred rm -f using a fresh
context. Privileged RBAC/SCC objects are deleted via the cleanup closure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…admin endpoints

Tests follow the existing e2e pattern (skipIfNotInDevelopmentEnv, adminRequest()).
Covers success paths, failing jobs, stderr passthrough, 400/403 for invalid
input and forbidden namespaces, and post-job cleanup verification.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chainsaw Pull requests or issues owned by Team Chainsaw ready-for-review size-large Size large

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants