Skip to content

[Do Not Merge] Bootstrap SSH diagnostics and load balancer health for install failures#4690

Closed
ventifus wants to merge 10 commits intomasterfrom
ventifus/ARO-15916/lb-diags-on-4188
Closed

[Do Not Merge] Bootstrap SSH diagnostics and load balancer health for install failures#4690
ventifus wants to merge 10 commits intomasterfrom
ventifus/ARO-15916/lb-diags-on-4188

Conversation

@ventifus
Copy link
Copy Markdown
Collaborator

@ventifus ventifus commented Mar 18, 2026

Short-lived integration PR: #4655 (OCP 4.18.34 as E2E default) with #4268 (bootstrap SSH diagnostics and LB health) layered on top.

Commits beyond base (#4655)

  • pkg/cluster: add load balancer and bootstrap node diagnostics on install failure
  • pkg/cluster: switch LB metrics from data plane to ARM control plane API
  • pkg/cluster: fix install_test after dropping CI-unconditional commit
  • pkg/cluster/failurediagnostics: add tests for bashQuote and TOFU callback
  • pkg/cluster/failurediagnostics: fix lint and SSH command timeout handling
  • pkg/cluster/failurediagnostics: handle error from clearing SSH deadline
  • pkg/cluster/failurediagnostics: add bootkube journal to bootstrap diagnostics

What the diagnostics collect on install failure

Load balancer (ARM control plane API):

  • Full LB JSON config for each cluster LB
  • VipAvailability and DipAvailability metrics per frontend port

Bootstrap node (via SSH):

  • systemctl is-system-running / list-units
  • crictl ps --all / podman ps --all
  • ss -tlnp
  • MCS container logs (if present)
  • curl localhost:22623 to check MCS reachability
  • journalctl -u bootkube — full bootkube service log (image pulls, errors)
  • journalctl -n 100 — last 100 system-wide journal messages

@ventifus ventifus changed the base branch from tof1973/41834-as-default to master March 18, 2026 17:02
@ventifus ventifus changed the title Bootstrap SSH diagnostics and load balancer health for install failures [Do Not Merge] Bootstrap SSH diagnostics and load balancer health for install failures Mar 18, 2026
Copilot AI review requested due to automatic review settings March 18, 2026 17:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR layers install-failure diagnostics (bootstrap-node SSH command execution + internal load balancer health via Azure Monitor metrics) on top of the E2E default OpenShift version bump to 4.18.34, while centralizing SSH algorithm allow-lists for FIPS/security-baseline alignment.

Changes:

  • Set default install stream to OCP 4.18.34 and update pullspec.
  • Add Azure Monitor Metrics client wiring (RP + E2E) and implement ILB metrics logging during install failures.
  • Add bootstrap-node SSH diagnostics that JIT-configure the ILB for SSH access, run an embedded command list, and centralize SSH algorithm allow-lists in pkg/util/ssh.

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/e2e/setup.go Wires Azure Monitor metrics client into E2E clientSet for diagnostics.
pkg/util/version/const.go Updates the default install stream to OCP 4.18.34.
pkg/util/ssh/algorithms.go Centralizes SSH algorithm allow-lists for reuse across components.
pkg/util/azureclient/azuresdk/armmonitor/metrics.go Adds ARM Monitor metrics client wrapper interface + constructor.
pkg/util/azureclient/azuresdk/armmonitor/generate.go Adds mock generation directives for the metrics wrapper.
pkg/util/mocks/azureclient/azuresdk/armmonitor/armmonitor.go Adds generated gomock for the metrics client interface.
pkg/portal/ssh/ssh.go Switches portal SSH server config to use centralized SSH algorithm lists.
pkg/portal/ssh/proxy.go Switches portal-to-cluster SSH client config to centralized algorithm lists.
pkg/portal/ssh/proxy_test.go Updates SSH proxy tests to use centralized algorithm lists.
pkg/cluster/cluster.go Creates/threads an Azure Monitor metrics client into the cluster manager.
pkg/cluster/gatherlogs.go Runs new install-only diagnostics on failure: ILB metrics + bootstrap SSH diag.
pkg/cluster/install_test.go Updates expected failure-diagnostics logging to include new steps.
pkg/cluster/failurediagnostics/diagnostics.go Extends diagnostics manager to accept NIC/LB/Monitor clients + TOFU host key state.
pkg/cluster/failurediagnostics/loadbalancers.go Implements ILB config dump + Dip/VipAvailability metrics logging.
pkg/cluster/failurediagnostics/loadbalancers_test.go Adds unit tests for ILB metrics logging behavior.
pkg/cluster/failurediagnostics/bootstrapnode.go Implements JIT ILB+NIC SSH access + command execution over SSH with TOFU.
pkg/cluster/failurediagnostics/bootstrapnode_test.go Adds unit tests for bootstrap SSH access configuration helpers.
pkg/cluster/failurediagnostics/scripts.go Embeds the bootstrap diagnostic command list JSON.
pkg/cluster/failurediagnostics/scripts/bootstrap-node-diag.json Defines the bootstrap diagnostic commands executed over SSH.
go.mod / go.sum Adds Azure Monitor armmonitor SDK dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Tof1973 and others added 6 commits March 18, 2026 14:54
…all failure

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the azmetrics batch data plane API (metrics.monitor.azure.com)
with the armmonitor single-resource ARM API for querying load balancer
health probe metrics. The data plane API requires a separate OAuth2
audience (metrics.monitor.azure.com/.default) and subscription-level
Monitoring Reader, which the FPSP may not have in customer tenants. The
ARM control plane API uses the standard ARM audience and checks RBAC at
the resource level, where the FPSP already has Owner on the managed
resource group.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The dropped commit added LogLoadBalancers and LogBootstrapNode expected
log entries to both install-path test cases. Re-add them here so they
stay with the commit that actually calls those functions.
@ventifus ventifus force-pushed the ventifus/ARO-15916/lb-diags-on-4188 branch from 2b89a32 to d316ec0 Compare March 18, 2026 21:55
ventifus and others added 2 commits March 18, 2026 15:00
…back

Add unit tests for:
- bashQuote: plain strings, empty string, embedded single quotes,
  multiple embedded single quotes
- toFUHostKeyCallback: first key accepted and recorded, same key
  accepted on second call, different key rejected on second call
…ling

Fix gofumpt struct field alignment in TestTOFUHostKeyCallback (the
linter wanted one fewer padding space on each field).

Fix SSH command timeout to be best-effort: when commandTimeout elapses,
the session is closed and sess.Run returns a non-ExitError.  The previous
code treated that as a hard failure and aborted remaining diagnostic
commands.  Now an atomic.Bool flag is set in the timer callback; on
timeout, the per-command timeout is logged and runSSHCommand returns nil
so runDiagCommands continues to the next command.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
conn.SetDeadline(time.Time{}) could theoretically fail on some net.Conn
implementations. Handle the error consistently with the initial deadline
set: close the connection and return.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 18, 2026 22:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Short-lived integration PR that layers OCP 4.18.34 E2E-default changes with additional install-failure diagnostics (bootstrap SSH and ILB health/metrics) to improve triage of MCS/22623-related install failures.

Changes:

  • Add install-only failure diagnostics: query ILB config + Azure Monitor DipAvailability/VipAvailability, and JIT-enable bootstrap SSH via ILB to run scripted remote checks.
  • Centralize FIPS-aligned SSH algorithm allow-lists in pkg/util/ssh and reuse them across portal SSH and bootstrap diagnostics.
  • Introduce an Azure Monitor ARM metrics client wrapper + mocks, and wire client creation into cluster manager and E2E setup.

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/e2e/setup.go Adds ARM Monitor metrics client to E2E client set for diagnostics.
pkg/util/version/const.go Updates default local-dev install stream to OCP 4.18.34 pullspec.
pkg/util/ssh/algorithms.go New centralized SSH algorithm lists (KEX/ciphers/MACs/host/public-key).
pkg/util/mocks/azureclient/azuresdk/armmonitor/armmonitor.go Generated gomock for the new armmonitor.MetricsClient wrapper interface.
pkg/util/azureclient/azuresdk/armmonitor/metrics.go Adds wrapper interface + constructor for Azure Monitor ARM metrics client.
pkg/util/azureclient/azuresdk/armmonitor/generate.go Adds go:generate target for mock generation for armmonitor.
pkg/portal/ssh/ssh.go Switches portal SSH server config + login command algorithm selection to pkg/util/ssh.
pkg/portal/ssh/proxy_test.go Updates tests to use centralized SSH algorithm lists.
pkg/portal/ssh/proxy.go Switches downstream (portal→cluster) SSH client config algorithms to pkg/util/ssh.
pkg/cluster/install_test.go Updates expected install failure log output to include new diagnostics steps.
pkg/cluster/gatherlogs.go Wires new failure diagnostics (LB metrics + bootstrap SSH) into install-only log gathering.
pkg/cluster/failurediagnostics/scripts/bootstrap-node-diag.json Adds embedded command list for bootstrap SSH diagnostics.
pkg/cluster/failurediagnostics/scripts.go Embeds the bootstrap diagnostics JSON script.
pkg/cluster/failurediagnostics/loadbalancers_test.go Adds unit tests for ILB state + metrics logging behavior.
pkg/cluster/failurediagnostics/loadbalancers.go Implements ILB config logging and Azure Monitor metrics querying/segmented logging.
pkg/cluster/failurediagnostics/diagnostics.go Extends diagnostics manager with armnetwork + armmonitor clients and bootstrap TOFU host key.
pkg/cluster/failurediagnostics/bootstrapnode_test.go Adds unit tests for ILB bootstrap SSH setup helpers and TOFU/quoting helpers.
pkg/cluster/failurediagnostics/bootstrapnode.go Implements ILB reconfiguration + bootstrap SSH connection + command execution with timeouts.
pkg/cluster/cluster.go Creates ARM Monitor metrics client (best-effort) and stores it on the cluster manager.
go.sum Adds checksums for the Azure Monitor ARM SDK dependency.
go.mod Adds github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/monitor/armmonitor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ventifus
Copy link
Copy Markdown
Collaborator Author

/azp run ci

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

…gnostics

Add journalctl commands to the bootstrap node SSH diagnostics:
- `journalctl -u bootkube`: captures the full bootkube service log,
  showing image pull progress and any errors before containers start
- `journalctl -n 100`: captures the 100 most recent system-wide journal
  messages, useful when bootkube logs are sparse or missing

These were identified as gaps while investigating a CI failure where
bootkube was running but had launched zero CRI containers after 16+
minutes, making it impossible to determine what was blocking progress.
@ventifus
Copy link
Copy Markdown
Collaborator Author

  1. bootkube.sh announces what it's about to do (line 2):
    Mar 19 16:19:53 ... bootkube.sh[4211]: Rendering auth api manifests...

  2. podman creates the auth-api-render container with the authentication operator image (line 3):
    container create ... name=auth-api-render,
    image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9bd818e...
    name=openshift/ose-cluster-authentication-rhel9-operator,
    version=v4.18.0,
    release=202602132343.p2.g882f879.assembly.stream.el9,
    build-date=2026-02-14T00:56:01Z,
    io.openshift.build.commit.id=882f8799b8a550605e4812479272691de3f5c0d2
    The image is ose-cluster-authentication-rhel9-operator from the 4.18 payload, built 2026-02-14 at commit 882f879.

  3. The container runs, immediately emits the error (line 8):
    Mar 19 16:19:55 ... auth-api-render[5714]: Error: unknown command "render" for "authentication-operator"

  4. Container dies 310ms after it started (line 9):
    container died ... name=auth-api-render

  5. Repeats identically on the next bootkube restart (lines 11, 17, 20, 26...):
    Same pattern, every ~14 seconds, for 77 iterations.

bootkube.sh calls authentication-operator render, the ose-cluster-authentication-rhel9-operator image built at commit 882f879 on 2026-02-14 does not have a render subcommand, and every bootkube cycle crashes at this step. This is why no masters ever pulled ignition and the cluster never formed.

The auth operator commit 882f879 in the 4.18 payload predates the auth-api-bootstrap feature being added to the installer (Feb 3, 2025 in the installer repo). The fix needs to go to the cluster-authentication-operator 4.18 branch to add the render subcommand — or the auth-api-bootstrap stage in bootkube.sh needs to be conditionalized/backported correctly to 4.18.

@ventifus ventifus closed this Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants