Implement etcd diagnostics admin endpoints by ventifus · Pull Request #4668 · Azure/ARO-RP

ventifus · 2026-03-09T22:50:28Z

Adds admin API endpoints for etcd diagnostics on Azure Red Hat OpenShift clusters.

Which issue this PR addresses:

What this PR does / why we need it:

Four new admin endpoints for etcd diagnostics without requiring direct cluster access:

POST /admin/.../exec: Streams stdout/stderr of a shell command from a named pod container. Restricted to OpenShift system namespaces.
POST /admin/.../runjob: Accepts a Kubernetes Job manifest, creates it, streams pod logs, reports the final result, and cleans up. Restricted to OpenShift system namespaces.
POST /admin/.../etcdkeycount: Reports the top-10 etcd key counts by Kubernetes namespace via etcdctl get --prefix /kubernetes.io/ on the named node's etcd pod.
POST /admin/.../etcdanalysis: Takes an etcd snapshot via exec, runs the octosql-etcd analysis Job against it via a HostPath volume, streams results, and cleans up the snapshot and Job on completion or cancellation.

Design decisions

etcdanalysis image tag (quay.io/redhat_emp1/octosql-etcd:latest): Currently mutable and hosted on a personal Quay namespace; a TODO comment marks this as a supply-chain risk. The reference will move to a Microsoft/Red Hat-controlled registry and be pinned to a digest before GA.

KubeExecStream uses golang.org/x/net/websocket: KubeExecStream dials the Kubernetes WebSocket exec subprotocol (v4.channel.k8s.io) directly. We use golang.org/x/net/websocket because the project currently pins k8s.io/client-go to v0.25.16 via a replace directive. remotecommand.NewWebSocketExecutor (the maintained client-go WebSocket transport) was added in v0.28; NewSPDYExecutor (the older transport) is not an option as SPDY has been removed from all supported OpenShift versions. Once the project upgrades to client-go v0.28+, KubeExecStream can be simplified to use remotecommand.NewWebSocketExecutor. A TODO comment marks the spot. Auth supports both BearerToken and BearerTokenFile rest.Config fields.

Test plan for issue:

Unit tests cover all validation paths, streaming success/failure cases, and edge cases:

TestLimitedWriter (6 cases): boundary, truncation, post-truncation silence, and notice formatting.
TestWaitForJobPod_ErrorBranches (4 cases): closed channel, watch.Error with *metav1.Status, watch.Error with non-Status object, context cancellation.
TestWaitForJobTerminal_MaxErrors: 10 consecutive KubeGet failures trigger max-error exit.
TestRunJobStream_CleanupFailure / TestRunJobStream_CleanupNotFound: non-NotFound delete error writes "Cleanup failed:"; 404 is treated as success and writes "Cleanup complete.".

E2e tests (guarded by skipIfNotInDevelopmentEnv) cover all four endpoints against a real cluster via adminRequest(). All four pass against a live ARO cluster.

Is there any documentation that needs to be updated for this PR?

SRE runbooks and SOPs will be written at eng.ms before GA, covering when to use each endpoint, how to interpret output, and remediation steps for common findings.

How do you know this will function as expected in production?

All new code paths have unit test coverage. Watch-before-create ordering avoids pod event races; bounded output protects RP memory; retry logic and cleanup timeouts using fresh contexts handle transient apiserver failures and ensure privileged objects are cleaned up even when the request context is cancelled. All four e2e tests pass against a live ARO cluster.

Copilot

Pull request overview

This PR (WIP) is adding plumbing needed for new admin diagnostics endpoints, including the ability to execute commands in Kubernetes pods while streaming output back to callers.

Changes:

Extend KubeActions with a KubeExecStream method for streaming kubectl exec-style output.
Store the *rest.Config in kubeActions to support SPDY exec.
Regenerate/update the GoMock KubeActions mock to include the new method.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
pkg/frontend/adminactions/kubeactions.go	Adds `KubeExecStream` API + implementation using SPDY exec and stores `rest.Config` on the struct.
pkg/util/mocks/adminactions/kubeactions.go	Updates generated mock to match the expanded `KubeActions` interface.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/frontend/adminactions/kubeactions.go

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/frontend/admin_openshiftcluster_exec.go

pkg/frontend/frontend.go

pkg/frontend/adminactions/kubeactions.go

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/frontend/admin_openshiftcluster_runjob.go

pkg/frontend/adminactions/kubeactions.go

pkg/frontend/admin_openshiftcluster_runjob.go

ventifus · 2026-03-12T22:22:07Z

/azp run ci

azure-pipelines · 2026-03-12T22:22:20Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/frontend/adminactions/kubeactions.go

pkg/frontend/adminactions/spdy.go

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/frontend/adminactions/kubeactions.go

pkg/frontend/fixetcd.go

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/frontend/admin_openshiftcluster_etcdanalysis.go

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/util/kubernetes/retry.go

pkg/frontend/fixetcd_test.go

pkg/frontend/adminactions/kubeactions.go

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/frontend/admin_openshiftcluster_etcdanalysis.go

test/e2e/adminapi_runjob.go

pkg/frontend/adminactions/kubeactions.go

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/e2e/adminapi_exec.go

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/e2e/adminapi_exec.go

test/e2e/adminapi_runjob.go

pkg/frontend/adminactions/kubeactions.go

pkg/frontend/admin_openshiftcluster_etcdkeycount.go

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/frontend/admin_openshiftcluster_exec.go

pkg/frontend/admin_openshiftcluster_etcdkeycount.go

pkg/frontend/admin_openshiftcluster_etcdanalysis.go

pkg/frontend/adminactions/kubeactions.go

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/frontend/fixetcd.go

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tuxerrante · 2026-03-16T10:51:49Z

Hi,
just a fast mention that I've recently introduced a validateEtcdHealth func that you may find useful to reuse :)

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/util/kubernetes/retry.go

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/frontend/adminactions/kubeactions.go

pkg/frontend/admin_openshiftcluster_runjob.go

pkg/frontend/fixetcd_test.go

Copilot

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/frontend/adminactions/kubeactions.go

pkg/frontend/admin_openshiftcluster_runjob_test.go

azure-pipelines · 2026-03-16T18:01:10Z

There was an error handling pipeline event 826f2607-647b-4680-97c7-91be6cfed042.

Copilot

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/frontend/common.go

pkg/util/kubernetes/retry.go

pkg/frontend/admin_openshiftcluster_exec.go

pkg/frontend/scripts/etcdkeycount.sh

Copilot

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/util/kubernetes/retry.go

pkg/frontend/adminactions/kubeactions.go

pkg/frontend/common.go

pkg/frontend/fixetcd.go

Copilot

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/util/kubernetes/retry.go

pkg/frontend/common.go

pkg/frontend/adminactions/kubeactions.go

pkg/frontend/admin_openshiftcluster_etcdanalysis.go

pkg/frontend/scripts/etcdkeycount.sh

Copilot

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-27T21:27:20Z

pkg/frontend/adminactions/kubeactions.go

+		// Close the underlying TLS connection to unblock the goroutine's Receive.
+		// The goroutine's defer wsConn.Close() handles WebSocket-level teardown.
+		tlsConn.Close()
+		<-resultCh // wait for goroutine to finish all writes before returning


In KubeExecStream, the ctx-cancellation path waits on <-resultCh after closing tlsConn. If the receive goroutine is blocked on stdout.Write/stderr.Write (e.g., slow client causing the io.PipeWriter to block), it will never send on resultCh, and cancellation/timeout won’t return promptly. Consider making the cancel path non-blocking (return ctx.Err() without waiting), or ensure writes can’t block indefinitely (e.g., write via a buffered channel/goroutine and stop draining on cancel).

Suggested change

<-resultCh // wait for goroutine to finish all writes before returning

// Do not wait on resultCh here: the goroutine may be blocked on writes to

// user-provided stdout/stderr, and we want to return promptly on cancel.

Dials the Kubernetes WebSocket exec subprotocol (v4.channel.k8s.io), demuxing channel frames into stdout (ch 1), stderr (ch 2), and exit status (ch 3). On context cancellation, closes the WebSocket and drains the internal result channel before returning ctx.Err(), ensuring all writes to caller io.Writers are complete. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ction constants Fix two bugs in the existing etcd recovery privilege management: - kubeServiceAccount constant was missing the colon separator between the namespace and service account name - newClusterRoleBinding roleRef used kubeServiceAccount (a Subject string) as the ClusterRole name instead of usersAccount Also refactor createPrivilegedServiceAccount cleanup to use a fresh context (so request cancellation cannot prevent RBAC/SCC teardown), retry transient API server errors up to 3 times, and collect all delete errors with errors.Join rather than failing on the first. Extract shared admin-action constants (execOutputLimit, adminActionStreamTimeout, adminActionCleanupTimeout) into common.go. These are used by all four new admin endpoints and by the fixetcd cleanup path. Add etcdContainerName constant for the etcd management sidecar container. Remove unused cluster parameter from newClusterRole and newClusterRoleBinding. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

POST /admin/.../exec streams stdout/stderr of a shell command from a named pod container. Validates namespace (must be an OpenShift namespace), pod name, container name, and command. Caps each stream at 1 MiB. Exposes execContainerStream for reuse by higher-level actions. Adds limitedWriter (truncates at 1 MiB with a notice) and nopWriteCloser (no-op Close for composed writers) to common.go. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

POST /admin/.../runjob accepts a Kubernetes Job manifest, creates it, streams pod logs as they arrive, reports the terminal state, and deletes the Job. Adds KubeFollowPodLogs to KubeActions. Uses retry.OnError with an isTransient predicate (k8s.io/client-go/util/retry) for transient-error retry on Job cleanup; retryDelay is a package-level var for test injection. Watch is established before job creation to avoid missing the pod-created event. parseAndValidateJob normalises the manifest: clamps backoffLimit to 0, requires exactly one container, truncates names over 57 chars, and restricts namespaces to OpenShift system namespaces. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

POST /admin/.../etcdkeycount?nodeName=<node> execs a fixed etcdctl command in the etcd pod on the named node and streams the top-10 key counts by Kubernetes namespace. The command is hardcoded in the RP so callers cannot inject arbitrary shell commands. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

POST /admin/.../etcdanalysis?nodeName=<node> orchestrates three steps: 1. Execs etcdctl snapshot save in the etcdctl container on the target node. 2. Creates a privileged ServiceAccount (SA, ClusterRole, CRB, SCC) for the Job pod. 3. Runs a Job (octosql-etcd image) against the snapshot via a HostPath volume, streaming results. The snapshot is removed on all exit paths via a deferred rm -f using a fresh context. Privileged RBAC/SCC objects are deleted via the cleanup closure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…admin endpoints Tests follow the existing e2e pattern (skipIfNotInDevelopmentEnv, adminRequest()). Covers success paths, failing jobs, stderr passthrough, 400/403 for invalid input and forbidden namespaces, and post-job cleanup verification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 9, 2026 22:50

ventifus requested review from alcasim, bennerv, cadenmarchese, hawkowl, hlipsig, jharrington22, kimorris27, mociarain, mrWinston, rogbas, sankur-codes, tiguelu, tsatam, tuxerrante, wanghaoran1988 and yjst2012 as code owners March 9, 2026 22:50

Copilot started reviewing on behalf of ventifus March 9, 2026 22:51 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

pkg/frontend/adminactions/kubeactions.go Outdated Show resolved Hide resolved

ventifus force-pushed the ventifus/ARO-23146/etcd-admin-actions branch 2 times, most recently from 0aea0d4 to 12d3b6d Compare March 9, 2026 23:04

ventifus requested a review from Copilot March 9, 2026 23:07

Copilot started reviewing on behalf of ventifus March 9, 2026 23:07 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

ventifus force-pushed the ventifus/ARO-23146/etcd-admin-actions branch 2 times, most recently from 749e0c4 to 94a720d Compare March 10, 2026 00:02

Copilot AI review requested due to automatic review settings March 10, 2026 00:02

Copilot started reviewing on behalf of ventifus March 10, 2026 00:03 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

ventifus force-pushed the ventifus/ARO-23146/etcd-admin-actions branch from 94a720d to dfb33a2 Compare March 10, 2026 00:11

Copilot AI reviewed Mar 13, 2026

View reviewed changes

pkg/frontend/adminactions/kubeactions.go Show resolved Hide resolved

pkg/frontend/adminactions/spdy.go Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 13, 2026

View reviewed changes

pkg/frontend/adminactions/kubeactions.go Outdated Show resolved Hide resolved

pkg/frontend/fixetcd.go Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 13, 2026

View reviewed changes

pkg/frontend/admin_openshiftcluster_etcdanalysis.go Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 13, 2026

View reviewed changes

pkg/util/kubernetes/retry.go Outdated Show resolved Hide resolved

pkg/frontend/fixetcd_test.go Outdated Show resolved Hide resolved

pkg/frontend/adminactions/kubeactions.go Show resolved Hide resolved

Copilot AI reviewed Mar 13, 2026

View reviewed changes

pkg/frontend/admin_openshiftcluster_etcdanalysis.go Outdated Show resolved Hide resolved

test/e2e/adminapi_runjob.go Show resolved Hide resolved

pkg/frontend/adminactions/kubeactions.go Show resolved Hide resolved

pkg/frontend/adminactions/kubeactions.go Show resolved Hide resolved

Copilot AI reviewed Mar 13, 2026

View reviewed changes

test/e2e/adminapi_exec.go Show resolved Hide resolved

Copilot AI reviewed Mar 13, 2026

View reviewed changes

test/e2e/adminapi_exec.go Outdated Show resolved Hide resolved

test/e2e/adminapi_runjob.go Outdated Show resolved Hide resolved

pkg/frontend/adminactions/kubeactions.go Show resolved Hide resolved

pkg/frontend/admin_openshiftcluster_etcdkeycount.go Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 14, 2026

View reviewed changes

Copilot AI reviewed Mar 15, 2026

View reviewed changes

pkg/frontend/fixetcd.go Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 15, 2026

View reviewed changes

Copilot AI reviewed Mar 16, 2026

View reviewed changes

pkg/util/kubernetes/retry.go Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 16, 2026

View reviewed changes

pkg/frontend/adminactions/kubeactions.go Show resolved Hide resolved

pkg/frontend/admin_openshiftcluster_runjob.go Outdated Show resolved Hide resolved

pkg/frontend/fixetcd_test.go Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 16, 2026

View reviewed changes

pkg/frontend/adminactions/kubeactions.go Outdated Show resolved Hide resolved

pkg/frontend/admin_openshiftcluster_runjob_test.go Show resolved Hide resolved

Copilot AI reviewed Mar 16, 2026

View reviewed changes

Copilot AI reviewed Mar 19, 2026

View reviewed changes

pkg/frontend/common.go Show resolved Hide resolved

pkg/util/kubernetes/retry.go Outdated Show resolved Hide resolved

pkg/frontend/admin_openshiftcluster_exec.go Show resolved Hide resolved

pkg/frontend/scripts/etcdkeycount.sh Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 19, 2026

View reviewed changes

pkg/util/kubernetes/retry.go Outdated Show resolved Hide resolved

pkg/frontend/adminactions/kubeactions.go Show resolved Hide resolved

pkg/frontend/common.go Show resolved Hide resolved

pkg/frontend/fixetcd.go Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 19, 2026

View reviewed changes

Copilot AI reviewed Mar 27, 2026

View reviewed changes

ventifus and others added 7 commits March 27, 2026 16:13

	<-resultCh // wait for goroutine to finish all writes before returning
	// Do not wait on resultCh here: the goroutine may be blocked on writes to
	// user-provided stdout/stderr, and we want to return promptly on cancel.

Conversation

ventifus commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue this PR addresses:

What this PR does / why we need it:

Design decisions

Test plan for issue:

Is there any documentation that needs to be updated for this PR?

How do you know this will function as expected in production?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ventifus commented Mar 12, 2026

Uh oh!

azure-pipelines bot commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

ventifus commented Mar 9, 2026 •

edited

Loading