Conversation
There was a problem hiding this comment.
Pull request overview
This PR (WIP) is adding plumbing needed for new admin diagnostics endpoints, including the ability to execute commands in Kubernetes pods while streaming output back to callers.
Changes:
- Extend
KubeActionswith aKubeExecStreammethod for streamingkubectl exec-style output. - Store the
*rest.ConfiginkubeActionsto support SPDY exec. - Regenerate/update the GoMock
KubeActionsmock to include the new method.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| pkg/frontend/adminactions/kubeactions.go | Adds KubeExecStream API + implementation using SPDY exec and stores rest.Config on the struct. |
| pkg/util/mocks/adminactions/kubeactions.go | Updates generated mock to match the expanded KubeActions interface. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
0aea0d4 to
12d3b6d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
749e0c4 to
94a720d
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
94a720d to
dfb33a2
Compare
|
/azp run ci |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hi, |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
There was an error handling pipeline event 826f2607-647b-4680-97c7-91be6cfed042. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 22 out of 22 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Close the underlying TLS connection to unblock the goroutine's Receive. | ||
| // The goroutine's defer wsConn.Close() handles WebSocket-level teardown. | ||
| tlsConn.Close() | ||
| <-resultCh // wait for goroutine to finish all writes before returning |
There was a problem hiding this comment.
In KubeExecStream, the ctx-cancellation path waits on <-resultCh after closing tlsConn. If the receive goroutine is blocked on stdout.Write/stderr.Write (e.g., slow client causing the io.PipeWriter to block), it will never send on resultCh, and cancellation/timeout won’t return promptly. Consider making the cancel path non-blocking (return ctx.Err() without waiting), or ensure writes can’t block indefinitely (e.g., write via a buffered channel/goroutine and stop draining on cancel).
| <-resultCh // wait for goroutine to finish all writes before returning | |
| // Do not wait on resultCh here: the goroutine may be blocked on writes to | |
| // user-provided stdout/stderr, and we want to return promptly on cancel. |
Dials the Kubernetes WebSocket exec subprotocol (v4.channel.k8s.io), demuxing channel frames into stdout (ch 1), stderr (ch 2), and exit status (ch 3). On context cancellation, closes the WebSocket and drains the internal result channel before returning ctx.Err(), ensuring all writes to caller io.Writers are complete. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ction constants Fix two bugs in the existing etcd recovery privilege management: - kubeServiceAccount constant was missing the colon separator between the namespace and service account name - newClusterRoleBinding roleRef used kubeServiceAccount (a Subject string) as the ClusterRole name instead of usersAccount Also refactor createPrivilegedServiceAccount cleanup to use a fresh context (so request cancellation cannot prevent RBAC/SCC teardown), retry transient API server errors up to 3 times, and collect all delete errors with errors.Join rather than failing on the first. Extract shared admin-action constants (execOutputLimit, adminActionStreamTimeout, adminActionCleanupTimeout) into common.go. These are used by all four new admin endpoints and by the fixetcd cleanup path. Add etcdContainerName constant for the etcd management sidecar container. Remove unused cluster parameter from newClusterRole and newClusterRoleBinding. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
POST /admin/.../exec streams stdout/stderr of a shell command from a named pod container. Validates namespace (must be an OpenShift namespace), pod name, container name, and command. Caps each stream at 1 MiB. Exposes execContainerStream for reuse by higher-level actions. Adds limitedWriter (truncates at 1 MiB with a notice) and nopWriteCloser (no-op Close for composed writers) to common.go. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
POST /admin/.../runjob accepts a Kubernetes Job manifest, creates it, streams pod logs as they arrive, reports the terminal state, and deletes the Job. Adds KubeFollowPodLogs to KubeActions. Uses retry.OnError with an isTransient predicate (k8s.io/client-go/util/retry) for transient-error retry on Job cleanup; retryDelay is a package-level var for test injection. Watch is established before job creation to avoid missing the pod-created event. parseAndValidateJob normalises the manifest: clamps backoffLimit to 0, requires exactly one container, truncates names over 57 chars, and restricts namespaces to OpenShift system namespaces. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
POST /admin/.../etcdkeycount?nodeName=<node> execs a fixed etcdctl command in the etcd pod on the named node and streams the top-10 key counts by Kubernetes namespace. The command is hardcoded in the RP so callers cannot inject arbitrary shell commands. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
POST /admin/.../etcdanalysis?nodeName=<node> orchestrates three steps: 1. Execs etcdctl snapshot save in the etcdctl container on the target node. 2. Creates a privileged ServiceAccount (SA, ClusterRole, CRB, SCC) for the Job pod. 3. Runs a Job (octosql-etcd image) against the snapshot via a HostPath volume, streaming results. The snapshot is removed on all exit paths via a deferred rm -f using a fresh context. Privileged RBAC/SCC objects are deleted via the cleanup closure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…admin endpoints Tests follow the existing e2e pattern (skipIfNotInDevelopmentEnv, adminRequest()). Covers success paths, failing jobs, stderr passthrough, 400/403 for invalid input and forbidden namespaces, and post-job cleanup verification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds admin API endpoints for etcd diagnostics on Azure Red Hat OpenShift clusters.
Which issue this PR addresses:
Fixes ARO-23146
What this PR does / why we need it:
Four new admin endpoints for etcd diagnostics without requiring direct cluster access:
POST /admin/.../exec: Streams stdout/stderr of a shell command from a named pod container. Restricted to OpenShift system namespaces.POST /admin/.../runjob: Accepts a KubernetesJobmanifest, creates it, streams pod logs, reports the final result, and cleans up. Restricted to OpenShift system namespaces.POST /admin/.../etcdkeycount: Reports the top-10 etcd key counts by Kubernetes namespace viaetcdctl get --prefix /kubernetes.io/on the named node's etcd pod.POST /admin/.../etcdanalysis: Takes an etcd snapshot via exec, runs theoctosql-etcdanalysis Job against it via a HostPath volume, streams results, and cleans up the snapshot and Job on completion or cancellation.Design decisions
etcdanalysisimage tag (quay.io/redhat_emp1/octosql-etcd:latest): Currently mutable and hosted on a personal Quay namespace; a TODO comment marks this as a supply-chain risk. The reference will move to a Microsoft/Red Hat-controlled registry and be pinned to a digest before GA.KubeExecStreamusesgolang.org/x/net/websocket:KubeExecStreamdials the Kubernetes WebSocket exec subprotocol (v4.channel.k8s.io) directly. We usegolang.org/x/net/websocketbecause the project currently pinsk8s.io/client-goto v0.25.16 via areplacedirective.remotecommand.NewWebSocketExecutor(the maintained client-go WebSocket transport) was added in v0.28;NewSPDYExecutor(the older transport) is not an option as SPDY has been removed from all supported OpenShift versions. Once the project upgrades to client-go v0.28+,KubeExecStreamcan be simplified to useremotecommand.NewWebSocketExecutor. A TODO comment marks the spot. Auth supports bothBearerTokenandBearerTokenFilerest.Config fields.Test plan for issue:
Unit tests cover all validation paths, streaming success/failure cases, and edge cases:
TestLimitedWriter(6 cases): boundary, truncation, post-truncation silence, and notice formatting.TestWaitForJobPod_ErrorBranches(4 cases): closed channel,watch.Errorwith*metav1.Status,watch.Errorwith non-Status object, context cancellation.TestWaitForJobTerminal_MaxErrors: 10 consecutiveKubeGetfailures trigger max-error exit.TestRunJobStream_CleanupFailure/TestRunJobStream_CleanupNotFound: non-NotFound delete error writes "Cleanup failed:"; 404 is treated as success and writes "Cleanup complete.".E2e tests (guarded by
skipIfNotInDevelopmentEnv) cover all four endpoints against a real cluster viaadminRequest(). All four pass against a live ARO cluster.Is there any documentation that needs to be updated for this PR?
SRE runbooks and SOPs will be written at eng.ms before GA, covering when to use each endpoint, how to interpret output, and remediation steps for common findings.
How do you know this will function as expected in production?
All new code paths have unit test coverage. Watch-before-create ordering avoids pod event races; bounded output protects RP memory; retry logic and cleanup timeouts using fresh contexts handle transient apiserver failures and ensure privileged objects are cleaned up even when the request context is cancelled. All four e2e tests pass against a live ARO cluster.