Skip to content

OCPBUGS-94187: Backport HA topology deployment scaling to release-4.22#758

Open
tmshort wants to merge 3 commits into
openshift:release-4.22from
tmshort:ocpbugs-94187-release-4.22
Open

OCPBUGS-94187: Backport HA topology deployment scaling to release-4.22#758
tmshort wants to merge 3 commits into
openshift:release-4.22from
tmshort:ocpbugs-94187-release-4.22

Conversation

@tmshort

@tmshort tmshort commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Backport of OCPBUGS-62517 fix to release-4.22.

Bug: During OCP upgrades, ClusterOperator olm goes Available=False because the operator-controller and catalogd deployments each run with only 1 replica. When the single pod is replaced during a rolling update, there is a brief window where no pod is available.

Fix: For HighlyAvailable topologies, cluster-olm-operator overrides replicas to 2 and enables a PodDisruptionBudget with minAvailable: 1. Pod anti-affinity spreads replicas across different nodes.

Commits

  1. UPSTREAM: <carry>: OCPBUGS-62517: Set replicas=1, PDB, and pod anti-affinity for HA topology

    • Adds replicas: 1 default to openshift/helm/catalogd.yaml and openshift/helm/operator-controller.yaml (cluster-olm-operator overrides to 2 for HA)
    • Adds podDisruptionBudget.enabled: false default (cluster-olm-operator enables for HA)
    • Adds pod anti-affinity (preferredDuringSchedulingIgnoredDuringExecution, weight: 100, by hostname) to manifests
  2. UPSTREAM: <carry>: use AlwaysAllow UnhealthyPodEvictionPolicy option in PDBs (#2688)

    • Adds unhealthyPodEvictionPolicy: AlwaysAllow to PDB Helm templates
    • Allows eviction of unhealthy pods during node drain even when PDB would otherwise block it
  3. UPSTREAM: <carry>: add OLMv1 topology-based deployment scaling e2e test

    • Verifies HA topologies get replicas=2 + PDB, non-HA topologies keep replicas=1 + no PDB

Test plan

  • Verify CI passes for upgrade jobs
  • Check periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade passes clusteroperator/olm should not change condition/Available test
  • On HA cluster: operator-controller and catalogd deployments have 2 replicas and a PDB
  • On SNO cluster: replicas=1, no PDB (non-HA behavior unchanged)

Paired PR: cluster-olm-operator release-4.22 (HA topology detection and scaling logic)

Jira: https://redhat.atlassian.net/browse/OCPBUGS-94187

🤖 Generated with Claude Code

tmshort and others added 2 commits June 30, 2026 15:06
…ffinity for HA topology

Rolling updates in HighlyAvailable clusters leave catalogd and
operator-controller unavailable when the only running pod is evicted
before its replacement is ready.

Fix by defaulting replicas=1 and PDB disabled in the static Helm values
(safe for SNO/External topologies, passes the SNO conformance test that
asserts exactly one replica in SingleReplica topology mode). Add pod
anti-affinity to prefer scheduling replicas on different nodes.

cluster-olm-operator detects the cluster's ControlPlaneTopology at
startup and overrides these values to replicas=2 and PDB enabled when a
HighlyAvailable topology is detected, then re-renders the manifests
before starting controllers. When a topology change is observed at
runtime (exceedingly rare), the operator exits so its deployment
controller restarts it, triggering a fresh Helm render with the correct
values for the new topology.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Todd Short <tshort@redhat.com>
…in PDBs (#2688)

Allow eviction of unhealthy (not ready) pods even if there are no disruptions
allowed on a PodDisruptionBudget. This can help to drain/maintain a node and
recover without a manual intervention when multiple instances of nodes or pods
are misbehaving.

Upstream commit: 869124a
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 30, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@tmshort: This pull request references Jira Issue OCPBUGS-94187, which is valid. The bug has been moved to the POST state.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note type set to "Release Note Not Required"
  • dependent bug Jira Issue OCPBUGS-62517 is in the state Verified, which is one of the valid states (MODIFIED, ON_QA, VERIFIED)
  • dependent Jira Issue OCPBUGS-62517 targets the "5.0.0" version, which is one of the valid target versions: 5.0.0
  • bug has dependents

No GitHub users were found matching the public email listed for the QA contact in Jira (jkeister@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Backport of OCPBUGS-62517 fix to release-4.22.

Bug: During OCP upgrades, ClusterOperator olm goes Available=False because the operator-controller and catalogd deployments each run with only 1 replica. When the single pod is replaced during a rolling update, there is a brief window where no pod is available.

Fix: For HighlyAvailable topologies, cluster-olm-operator overrides replicas to 2 and enables a PodDisruptionBudget with minAvailable: 1. Pod anti-affinity spreads replicas across different nodes.

Commits

  1. UPSTREAM: <carry>: OCPBUGS-62517: Set replicas=1, PDB, and pod anti-affinity for HA topology
  • Adds replicas: 1 default to openshift/helm/catalogd.yaml and openshift/helm/operator-controller.yaml (cluster-olm-operator overrides to 2 for HA)
  • Adds podDisruptionBudget.enabled: false default (cluster-olm-operator enables for HA)
  • Adds pod anti-affinity (preferredDuringSchedulingIgnoredDuringExecution, weight: 100, by hostname) to manifests
  1. UPSTREAM: <carry>: use AlwaysAllow UnhealthyPodEvictionPolicy option in PDBs (#2688)
  • Adds unhealthyPodEvictionPolicy: AlwaysAllow to PDB Helm templates
  • Allows eviction of unhealthy pods during node drain even when PDB would otherwise block it
  1. UPSTREAM: <carry>: add OLMv1 topology-based deployment scaling e2e test
  • Verifies HA topologies get replicas=2 + PDB, non-HA topologies keep replicas=1 + no PDB

Test plan

  • Verify CI passes for upgrade jobs
  • Check periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade passes clusteroperator/olm should not change condition/Available test
  • On HA cluster: operator-controller and catalogd deployments have 2 replicas and a PDB
  • On SNO cluster: replicas=1, no PDB (non-HA behavior unchanged)

Paired PR: cluster-olm-operator release-4.22 (HA topology detection and scaling logic)

Jira: https://redhat.atlassian.net/browse/OCPBUGS-94187

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from bentito and fgiudici June 30, 2026 19:17
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 30, 2026
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 927fa637-1711-4d61-ac1a-fef5a4a25345

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@tmshort

tmshort commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

/test e2e-gcp-ovn-upgrade

@openshift-ci

openshift-ci Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

@tmshort: This PR was included in a payload test run from openshift/cluster-olm-operator#215
trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b3ce9cb0-74b9-11f1-8508-8fb8e861b82f-0

@tmshort

tmshort commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@tmshort

tmshort commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

These tests will fail until openshift/cluster-olm-operator#215 is merged.

@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@tmshort: This PR was included in a payload test run from openshift/cluster-olm-operator#215
trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a4ca57d0-7500-11f1-812b-c867d5ec9129-0

@tmshort

tmshort commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Adds a new test that verifies cluster-olm-operator correctly configures
operator-controller and catalogd deployments based on the cluster's
control plane topology:
- HA topologies (HighlyAvailable, HighlyAvailableArbiter, DualReplica):
  replicas=2 with a PodDisruptionBudget present
- Non-HA topologies (SingleReplica/SNO, External): replicas=1, no PDB

Also registers policyv1 in the test scheme to support PDB list queries.

Assisted-by: claude
Signed-off-by: Todd Short <tshort@redhat.com>
@tmshort tmshort force-pushed the ocpbugs-94187-release-4.22 branch from 1d73978 to a786e3a Compare July 1, 2026 13:30
@tmshort

tmshort commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/test default-catalog-consistency

@pedjak

pedjak commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jul 1, 2026

@rashmigottipati rashmigottipati left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@tmshort: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@tmshort: This PR was included in a payload test run from openshift/cluster-olm-operator#215
trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c8134f30-7578-11f1-8e7d-1d1eca22b8ec-0

@tmshort

tmshort commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

@rashmigottipati rashmigottipati left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/verified by @rashmigottipati

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jul 2, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@rashmigottipati: This PR has been marked as verified by @rashmigottipati.

Details

In response to this:

/verified by @rashmigottipati

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rashmigottipati, tmshort

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants