Skip to content

MGMT-23191: Ensure MCO configs are present before rebooting the bootstrap node#2039

Open
pastequo wants to merge 1 commit intoopenshift:masterfrom
pastequo:fix/wait-for-nodes-to-be-mco-healthy
Open

MGMT-23191: Ensure MCO configs are present before rebooting the bootstrap node#2039
pastequo wants to merge 1 commit intoopenshift:masterfrom
pastequo:fix/wait-for-nodes-to-be-mco-healthy

Conversation

@pastequo
Copy link
Contributor

@pastequo pastequo commented Mar 3, 2026

This work was made from https://github.com/glennswest/assisted-installer

Ensure MCO configs are repsent before rebooting the bootstrap node

cc @glennswest

Summary by CodeRabbit

  • New Features

    • Added bootstrap-safe synchronization to ensure MachineConfig annotations are consistent across all master nodes before reboot, improving cluster stability during node configuration changes.
  • Chores

    • Updated Kubernetes API client to support newer API versions for improved compatibility.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 3, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 3, 2026

@pastequo: This pull request references MGMT-23191 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

This work was made from https://github.com/glennswest/assisted-installer

Ensure MCO configs are repsent before rebooting the bootstrap node

cc @glennswest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link

coderabbitai bot commented Mar 3, 2026

Walkthrough

The changes implement a bootstrap-safe synchronization mechanism for MachineConfig annotation consistency. When bootstrapping or with single control plane nodes, the installer waits for all master nodes to converge on consistent MachineConfig annotations (currentConfig, desiredConfig, and state) before proceeding with reboot. Supporting changes expand the k8s_client interface to retrieve MachineConfig objects and update API version references.

Changes

Cohort / File(s) Summary
MC Annotation Synchronization
src/installer/installer.go, src/installer/installer_test.go
Introduces waitForMCAnnotationsConsistent() method to poll and validate MachineConfig annotation consistency across master nodes during bootstrap. Adds three new annotation constant definitions and integrates wait mechanism into InstallNode flow. Includes test helper to verify annotation checks occur at intended points in bootstrap/master/worker sequences. Note: MC annotation constants are duplicated in two locations within installer.go.
K8S Client Enhancement
src/k8s_client/k8s_client.go, src/k8s_client/mock_k8s_client.go
Adds GetMachineConfig() method to K8SClient interface and implementation. Reworks NewK8SClient to unify scheme setup and runtime client creation, expanding scheme with metal3v1alpha1, machinev1beta1, and mcfgv1 types. Updates mock client method signatures to use newer Kubernetes API versions (v13 for core resources, v12 for CSR/Event, v11 for Jobs) and adds corresponding GetMachineConfig mock method.

Sequence Diagram

sequenceDiagram
    participant Installer as Installer
    participant K8SClient as K8S Client
    participant API as Kubernetes API
    
    Installer->>Installer: During bootstrap/single CP node
    Installer->>K8SClient: waitForMCAnnotationsConsistent()
    loop Until consistent
        K8SClient->>API: ListNodesByRole("master")
        API-->>K8SClient: Master node list
        K8SClient->>API: GetMachineConfig(name) for each node
        API-->>K8SClient: MachineConfig objects
        K8SClient->>K8SClient: Validate annotations exist<br/>and reference valid MachineConfigs
        K8SClient->>K8SClient: Check all nodes converged<br/>to same desiredConfig
        Note over K8SClient: Continue polling if inconsistent
    end
    K8SClient-->>Installer: Success (consistent state)
    Installer->>Installer: Proceed with reboot
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Test Structure And Quality ❓ Inconclusive PR describes test additions for waitForMCAnnotationsConsistent but actual test files do not contain these changes. Ensure PR code is available in repository to assess test structure, setup/cleanup patterns, timeouts, assertions, and codebase consistency.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly describes the main objective: ensuring MCO configs are present before rebooting the bootstrap node, which aligns with the core changes adding bootstrap-safe synchronization and MC annotation consistency checks.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names ✅ Passed All test names added in the PR are static and deterministic with no dynamic information that changes between runs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 3, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 3, 2026

@pastequo: This pull request references MGMT-23191 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

This work was made from https://github.com/glennswest/assisted-installer

Ensure MCO configs are repsent before rebooting the bootstrap node

cc @glennswest

Summary by CodeRabbit

  • New Features

  • Added bootstrap-safe synchronization to ensure MachineConfig annotations are consistent across all master nodes before reboot, improving cluster stability during node configuration changes.

  • Chores

  • Updated Kubernetes API client to support newer API versions for improved compatibility.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link

openshift-ci bot commented Mar 3, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pastequo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/installer/installer.go (1)

929-985: Bound the new MC-annotation wait to avoid indefinite install hangs.

Line 929 uses waitForeverTimeout; if masters never converge, installation can stall indefinitely before finalize/reboot.

♻️ Suggested timeout guard
 const (
 	InstallDir                   = "/opt/install-dir"
 	KubeconfigPath               = "/opt/openshift/auth/kubeconfig"
@@
 	registryDataDirOnMedia       = "/run/media/iso/registry"
+	waitForMCAnnotationsTimeout  = 30 * time.Minute
 )
@@
 func (i *installer) waitForMCAnnotationsConsistent(ctx context.Context, kc k8s_client.K8SClient) error {
 	i.log.Info("Waiting for MachineConfig annotations to be consistent on all master nodes")
 
-	return utils.WaitForPredicateWithContext(ctx, waitForeverTimeout, generalWaitInterval, func() bool {
+	return utils.WaitForPredicateWithContext(ctx, waitForMCAnnotationsTimeout, generalWaitInterval, func() bool {
 		nodes, err := kc.ListNodesByRole("master")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/installer/installer.go` around lines 929 - 985, The wait loop in the
method calling WaitForPredicateWithContext currently uses waitForeverTimeout and
can hang indefinitely; change the timeout argument to a bounded duration (e.g.,
introduce a new constant like mcAnnotationsWaitTimeout or use an existing
install timeout) and make it configurable if appropriate, then pass that bounded
timeout into WaitForPredicateWithContext instead of waitForeverTimeout; update
any related tests or callers and keep the predicate logic (which references
kc.ListNodesByRole and annotations mcCurrentConfigAnnotation,
mcDesiredConfigAnnotation, mcStateAnnotation and calls kc.GetMachineConfig)
unchanged.
src/installer/installer_test.go (1)

146-167: Tighten MC lookup assertions to prevent false-positive test passes.

The helper currently expects MinTimes(1) for both ListNodesByRole and GetMachineConfig, which allows regressions to slip through. The actual implementation calls ListNodesByRole("master") once and GetMachineConfig twice (for currentConfig and desiredConfig). With one node in the test data, these should be pinned to exact call counts and the config parameter should be verified.

Suggested stricter expectations
-				mockk8sclient.EXPECT().ListNodesByRole("master").Return(masterNodesWithMCAnnotations, nil).MinTimes(1)
-				mockk8sclient.EXPECT().GetMachineConfig(gomock.Any(), gomock.Any()).Return(&mcfgv1.MachineConfig{}, nil).MinTimes(1)
+				mockk8sclient.EXPECT().ListNodesByRole("master").Return(masterNodesWithMCAnnotations, nil).Times(1)
+				mockk8sclient.EXPECT().GetMachineConfig(gomock.Any(), "rendered-master-12345").Return(&mcfgv1.MachineConfig{}, nil).Times(2)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/installer/installer_test.go` around lines 146 - 167, In
waitForMCAnnotationsConsistentSuccess tighten the mock expectations: replace
MinTimes(1) on mockk8sclient.ListNodesByRole("master") with an exact single call
expectation (Times(1)) and change the GetMachineConfig expectation to expect
exactly two calls (Times(2)); additionally assert the GetMachineConfig parameter
equals the config names from the test node annotations ("rendered-master-12345")
(use gomock.Eq or equivalent) so each of the currentConfig and desiredConfig
lookups are verified.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/installer/installer_test.go`:
- Around line 146-167: In waitForMCAnnotationsConsistentSuccess tighten the mock
expectations: replace MinTimes(1) on mockk8sclient.ListNodesByRole("master")
with an exact single call expectation (Times(1)) and change the GetMachineConfig
expectation to expect exactly two calls (Times(2)); additionally assert the
GetMachineConfig parameter equals the config names from the test node
annotations ("rendered-master-12345") (use gomock.Eq or equivalent) so each of
the currentConfig and desiredConfig lookups are verified.

In `@src/installer/installer.go`:
- Around line 929-985: The wait loop in the method calling
WaitForPredicateWithContext currently uses waitForeverTimeout and can hang
indefinitely; change the timeout argument to a bounded duration (e.g., introduce
a new constant like mcAnnotationsWaitTimeout or use an existing install timeout)
and make it configurable if appropriate, then pass that bounded timeout into
WaitForPredicateWithContext instead of waitForeverTimeout; update any related
tests or callers and keep the predicate logic (which references
kc.ListNodesByRole and annotations mcCurrentConfigAnnotation,
mcDesiredConfigAnnotation, mcStateAnnotation and calls kc.GetMachineConfig)
unchanged.

ℹ️ Review info

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 72bf418 and 2e9624a.

⛔ Files ignored due to path filters (4)
  • vendor/modules.txt is excluded by !**/vendor/**, !vendor/**
  • vendor/sigs.k8s.io/controller-runtime/pkg/client/config/config.go is excluded by !**/vendor/**, !vendor/**
  • vendor/sigs.k8s.io/controller-runtime/pkg/client/config/doc.go is excluded by !**/vendor/**, !vendor/**
  • vendor/sigs.k8s.io/controller-runtime/pkg/internal/log/log.go is excluded by !**/vendor/**, !vendor/**
📒 Files selected for processing (4)
  • src/installer/installer.go
  • src/installer/installer_test.go
  • src/k8s_client/k8s_client.go
  • src/k8s_client/mock_k8s_client.go

@codecov
Copy link

codecov bot commented Mar 3, 2026

Codecov Report

❌ Patch coverage is 25.00000% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.34%. Comparing base (9a9222e) to head (2e9624a).
⚠️ Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
src/installer/installer.go 38.09% 18 Missing and 8 partials ⚠️
src/k8s_client/k8s_client.go 0.00% 22 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2039      +/-   ##
==========================================
- Coverage   48.48%   48.34%   -0.15%     
==========================================
  Files          20       20              
  Lines        4333     4379      +46     
==========================================
+ Hits         2101     2117      +16     
- Misses       2011     2033      +22     
- Partials      221      229       +8     
Files with missing lines Coverage Δ
src/k8s_client/k8s_client.go 0.00% <0.00%> (ø)
src/installer/installer.go 67.09% <38.09%> (-1.88%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pastequo
Copy link
Contributor Author

pastequo commented Mar 4, 2026

/retest

@pastequo
Copy link
Contributor Author

pastequo commented Mar 6, 2026

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 6, 2026
@openshift-ci
Copy link

openshift-ci bot commented Mar 11, 2026

@pastequo: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants