[Do Not Merge] Bootstrap SSH diagnostics and load balancer health for install failures#4690
[Do Not Merge] Bootstrap SSH diagnostics and load balancer health for install failures#4690
Conversation
There was a problem hiding this comment.
Pull request overview
This PR layers install-failure diagnostics (bootstrap-node SSH command execution + internal load balancer health via Azure Monitor metrics) on top of the E2E default OpenShift version bump to 4.18.34, while centralizing SSH algorithm allow-lists for FIPS/security-baseline alignment.
Changes:
- Set default install stream to OCP 4.18.34 and update pullspec.
- Add Azure Monitor Metrics client wiring (RP + E2E) and implement ILB metrics logging during install failures.
- Add bootstrap-node SSH diagnostics that JIT-configure the ILB for SSH access, run an embedded command list, and centralize SSH algorithm allow-lists in
pkg/util/ssh.
Reviewed changes
Copilot reviewed 20 out of 21 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/e2e/setup.go | Wires Azure Monitor metrics client into E2E clientSet for diagnostics. |
| pkg/util/version/const.go | Updates the default install stream to OCP 4.18.34. |
| pkg/util/ssh/algorithms.go | Centralizes SSH algorithm allow-lists for reuse across components. |
| pkg/util/azureclient/azuresdk/armmonitor/metrics.go | Adds ARM Monitor metrics client wrapper interface + constructor. |
| pkg/util/azureclient/azuresdk/armmonitor/generate.go | Adds mock generation directives for the metrics wrapper. |
| pkg/util/mocks/azureclient/azuresdk/armmonitor/armmonitor.go | Adds generated gomock for the metrics client interface. |
| pkg/portal/ssh/ssh.go | Switches portal SSH server config to use centralized SSH algorithm lists. |
| pkg/portal/ssh/proxy.go | Switches portal-to-cluster SSH client config to centralized algorithm lists. |
| pkg/portal/ssh/proxy_test.go | Updates SSH proxy tests to use centralized algorithm lists. |
| pkg/cluster/cluster.go | Creates/threads an Azure Monitor metrics client into the cluster manager. |
| pkg/cluster/gatherlogs.go | Runs new install-only diagnostics on failure: ILB metrics + bootstrap SSH diag. |
| pkg/cluster/install_test.go | Updates expected failure-diagnostics logging to include new steps. |
| pkg/cluster/failurediagnostics/diagnostics.go | Extends diagnostics manager to accept NIC/LB/Monitor clients + TOFU host key state. |
| pkg/cluster/failurediagnostics/loadbalancers.go | Implements ILB config dump + Dip/VipAvailability metrics logging. |
| pkg/cluster/failurediagnostics/loadbalancers_test.go | Adds unit tests for ILB metrics logging behavior. |
| pkg/cluster/failurediagnostics/bootstrapnode.go | Implements JIT ILB+NIC SSH access + command execution over SSH with TOFU. |
| pkg/cluster/failurediagnostics/bootstrapnode_test.go | Adds unit tests for bootstrap SSH access configuration helpers. |
| pkg/cluster/failurediagnostics/scripts.go | Embeds the bootstrap diagnostic command list JSON. |
| pkg/cluster/failurediagnostics/scripts/bootstrap-node-diag.json | Defines the bootstrap diagnostic commands executed over SSH. |
| go.mod / go.sum | Adds Azure Monitor armmonitor SDK dependency. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…all failure Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the azmetrics batch data plane API (metrics.monitor.azure.com) with the armmonitor single-resource ARM API for querying load balancer health probe metrics. The data plane API requires a separate OAuth2 audience (metrics.monitor.azure.com/.default) and subscription-level Monitoring Reader, which the FPSP may not have in customer tenants. The ARM control plane API uses the standard ARM audience and checks RBAC at the resource level, where the FPSP already has Owner on the managed resource group. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The dropped commit added LogLoadBalancers and LogBootstrapNode expected log entries to both install-path test cases. Re-add them here so they stay with the commit that actually calls those functions.
2b89a32 to
d316ec0
Compare
…back Add unit tests for: - bashQuote: plain strings, empty string, embedded single quotes, multiple embedded single quotes - toFUHostKeyCallback: first key accepted and recorded, same key accepted on second call, different key rejected on second call
…ling Fix gofumpt struct field alignment in TestTOFUHostKeyCallback (the linter wanted one fewer padding space on each field). Fix SSH command timeout to be best-effort: when commandTimeout elapses, the session is closed and sess.Run returns a non-ExitError. The previous code treated that as a hard failure and aborted remaining diagnostic commands. Now an atomic.Bool flag is set in the timer callback; on timeout, the per-command timeout is logged and runSSHCommand returns nil so runDiagCommands continues to the next command. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
conn.SetDeadline(time.Time{}) could theoretically fail on some net.Conn
implementations. Handle the error consistently with the initial deadline
set: close the connection and return.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Short-lived integration PR that layers OCP 4.18.34 E2E-default changes with additional install-failure diagnostics (bootstrap SSH and ILB health/metrics) to improve triage of MCS/22623-related install failures.
Changes:
- Add install-only failure diagnostics: query ILB config + Azure Monitor
DipAvailability/VipAvailability, and JIT-enable bootstrap SSH via ILB to run scripted remote checks. - Centralize FIPS-aligned SSH algorithm allow-lists in
pkg/util/sshand reuse them across portal SSH and bootstrap diagnostics. - Introduce an Azure Monitor ARM metrics client wrapper + mocks, and wire client creation into cluster manager and E2E setup.
Reviewed changes
Copilot reviewed 20 out of 21 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/e2e/setup.go | Adds ARM Monitor metrics client to E2E client set for diagnostics. |
| pkg/util/version/const.go | Updates default local-dev install stream to OCP 4.18.34 pullspec. |
| pkg/util/ssh/algorithms.go | New centralized SSH algorithm lists (KEX/ciphers/MACs/host/public-key). |
| pkg/util/mocks/azureclient/azuresdk/armmonitor/armmonitor.go | Generated gomock for the new armmonitor.MetricsClient wrapper interface. |
| pkg/util/azureclient/azuresdk/armmonitor/metrics.go | Adds wrapper interface + constructor for Azure Monitor ARM metrics client. |
| pkg/util/azureclient/azuresdk/armmonitor/generate.go | Adds go:generate target for mock generation for armmonitor. |
| pkg/portal/ssh/ssh.go | Switches portal SSH server config + login command algorithm selection to pkg/util/ssh. |
| pkg/portal/ssh/proxy_test.go | Updates tests to use centralized SSH algorithm lists. |
| pkg/portal/ssh/proxy.go | Switches downstream (portal→cluster) SSH client config algorithms to pkg/util/ssh. |
| pkg/cluster/install_test.go | Updates expected install failure log output to include new diagnostics steps. |
| pkg/cluster/gatherlogs.go | Wires new failure diagnostics (LB metrics + bootstrap SSH) into install-only log gathering. |
| pkg/cluster/failurediagnostics/scripts/bootstrap-node-diag.json | Adds embedded command list for bootstrap SSH diagnostics. |
| pkg/cluster/failurediagnostics/scripts.go | Embeds the bootstrap diagnostics JSON script. |
| pkg/cluster/failurediagnostics/loadbalancers_test.go | Adds unit tests for ILB state + metrics logging behavior. |
| pkg/cluster/failurediagnostics/loadbalancers.go | Implements ILB config logging and Azure Monitor metrics querying/segmented logging. |
| pkg/cluster/failurediagnostics/diagnostics.go | Extends diagnostics manager with armnetwork + armmonitor clients and bootstrap TOFU host key. |
| pkg/cluster/failurediagnostics/bootstrapnode_test.go | Adds unit tests for ILB bootstrap SSH setup helpers and TOFU/quoting helpers. |
| pkg/cluster/failurediagnostics/bootstrapnode.go | Implements ILB reconfiguration + bootstrap SSH connection + command execution with timeouts. |
| pkg/cluster/cluster.go | Creates ARM Monitor metrics client (best-effort) and stores it on the cluster manager. |
| go.sum | Adds checksums for the Azure Monitor ARM SDK dependency. |
| go.mod | Adds github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/monitor/armmonitor. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run ci |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…gnostics Add journalctl commands to the bootstrap node SSH diagnostics: - `journalctl -u bootkube`: captures the full bootkube service log, showing image pull progress and any errors before containers start - `journalctl -n 100`: captures the 100 most recent system-wide journal messages, useful when bootkube logs are sparse or missing These were identified as gaps while investigating a CI failure where bootkube was running but had launched zero CRI containers after 16+ minutes, making it impossible to determine what was blocking progress.
bootkube.sh calls authentication-operator render, the ose-cluster-authentication-rhel9-operator image built at commit 882f879 on 2026-02-14 does not have a render subcommand, and every bootkube cycle crashes at this step. This is why no masters ever pulled ignition and the cluster never formed. The auth operator commit 882f879 in the 4.18 payload predates the auth-api-bootstrap feature being added to the installer (Feb 3, 2025 in the installer repo). The fix needs to go to the cluster-authentication-operator 4.18 branch to add the render subcommand — or the auth-api-bootstrap stage in bootkube.sh needs to be conditionalized/backported correctly to 4.18. |
Short-lived integration PR: #4655 (OCP 4.18.34 as E2E default) with #4268 (bootstrap SSH diagnostics and LB health) layered on top.
Commits beyond base (#4655)
pkg/cluster: add load balancer and bootstrap node diagnostics on install failurepkg/cluster: switch LB metrics from data plane to ARM control plane APIpkg/cluster: fix install_test after dropping CI-unconditional commitpkg/cluster/failurediagnostics: add tests for bashQuote and TOFU callbackpkg/cluster/failurediagnostics: fix lint and SSH command timeout handlingpkg/cluster/failurediagnostics: handle error from clearing SSH deadlinepkg/cluster/failurediagnostics: add bootkube journal to bootstrap diagnosticsWhat the diagnostics collect on install failure
Load balancer (ARM control plane API):
Bootstrap node (via SSH):
systemctl is-system-running/list-unitscrictl ps --all/podman ps --allss -tlnpcurl localhost:22623to check MCS reachabilityjournalctl -u bootkube— full bootkube service log (image pulls, errors)journalctl -n 100— last 100 system-wide journal messages