Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
337 changes: 337 additions & 0 deletions enhancements/ha-policy-management/ha-policy-management.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,337 @@
title: ha-policy-management
authors:
- "@nahorigu"
reviewers:
- "@dmesser"
- "@dgoodiwn"
approvers:
- TBD
api-approvers: None
creation-date: 2026-01-29
last-updated: 2026-01-29
tracking-link: N/A
see-also:
- https://issues.redhat.com/browse/OCPSTRAT-2649
replaces: N/A
superseded-by: N/A
---

# HA Policy Management

## Summary

This enhancement improves guideline compliance checks within the CI process
(the Red Hat-internal pipeline for OpenShift) to improve overall HA.
Specifically, it integrates a mechanism to evaluate HA levels based on
implementation status and developers' input. By notifying developers of
non-compliant components, the management process encourages developers to follow the
guidelines. All data will be stored in a common repository, allowing both
developers and partners to grasp the overall HA status early and easily.

## Motivation

A service on an OpenShift cluster relies on multiple components, so overall
cluster availability depends on the product of each component's availability.
If any component in the dependency chain lacks high availability (HA),
total availability is degraded.

Currently, HA implementation is often left to developers’ discretion,
leading to inconsistent or insufficient HA configurations.
Although general guidelines exist ([CONVENTIONS.md](https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability)).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fair to say this is the set of conventions you want to enforce with this framework? Are there additional items you would like added? If so I would suggest a PR to that linked enhancement. It helps to have agreed conventions before we start enforcing.

they are often overlooked and overall conformance status is unknown.
As a result, HA levels can be inconsistent across components, especially
when new components are added or existing ones undergo major changes.

Therefore, an automated checking process and enforcement mechanism are needed.
This proposal aims to introduce a cluster-wide mechanism to ensure consistent
HA implementation across components.

### User Stories

* As an OpenShift Product Manager, I want a clear overview of HA
implementation status across components, so I can identify issues
from overall HA quality earlier.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you define the list of statuses you envision? Is it just compliant and non-compliant? What other levels/statuses do you envision?

* As an OpenShift component developer, I want to be alerted to any HA gaps
and have the opportunity to explain why HA is lacking, unnecessary, or
when planned, reducing repetitive queries from end users.
* As a service provider on OpenShift, I want a stable, reliable platform
with consistent HA, enabling easier service development without repeatedly
consulting each component team about HA status and plans.

### Goals

Collect HA policy data (defined below) during the CI process and notify
component owners of any guideline violations.

### Non-Goals

* Strict enforcement of guidelines that block product releases is out of scope.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be easy to do when we're ready with the approaches I will spell out below.

* Extending HA policy management to cover general guideline compliance beyond
HA is also out of scope for now.
* This proposal targets only all core and infrastructure-related components,
and the other components are out of scope.

## Proposal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose radical simplification of the proposal. I believe we can meet your goals with well established precedents, without the need for lots of new processes and systems, and you can get up and running quite quickly.

We've done this sort of things many times, the process is as follows:

Establish the Tests

Typically these are implemented as monitortests, they will run at the end of most of our hundreds of CI jobs. The monitortests generate junit test results per openshift component namespace, and per HA check you'd like to implement.

Example: https://github.com/openshift/origin/blob/00eaaf722f71858b3af6091af44b7225b5f8a6d7/pkg/monitortests/kubelet/containerfailures/container_failures.go#L137

Typically these kinds of tests encode exceptions linked to jiras. So you write the tests, do some preliminary testing in the PR (we can help), see what violations it finds, then write a Jira for each. (more below)

I suggest having the tests only flake when they find a problem for now, so we do not merge the PR and cause mass failures. Once all the problems are identified with bugs filed and exceptions added, the test can be moved to a state where it's allowed to fail.

In this case envision:

[Monitor:ha-compliance][Jira:"console"] pods in ns/openshift-console should define health checks
[Monitor:ha-compliance][Jira:"console"] pods in ns/openshift-console should sufficient replicas for HA

etc.

File Bugs for Violations

Sippy provides the dashboard of current state. Example for the monitortest linked above.

As problems are identified, someone will need to file bugs and add exceptions within the test. Typically we'll label the jiras with a specific label to help keep track. For any approved exception the test will usually permanently flake.

In the event the jiras is closed as not applicable or can't be fixed by engineering or PM, those should d likely transition from exceptions to just permanently approved whitelist with a comment explaining why, or a link to the jira that explains.

Once the test is stable in the wild, new violations will immediately start failing jobs and we have ample provisions for that to make it's way to dev teams. This prevents new components from coming in without the capability unless someone explicitly approves it, as well as regressions for existing components.

It can take time and effort for someone to find all the exceptions to be added and allow the test to start failing on regressions/problems, but in the interim the tests are live, gathering data, and not causing mass failures/panic.


* Create test cases to collect HA policy information from running OpenShift clusters.
* Define HA configs to define the type of HA feature to be handled
(redundancy and health check in the first proposal).
* Define the data structure of input and output of "HA level check" process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered by junit test results.

* Create test cases to assess the output of "HA level check" process and
detect degradations in the HA implementation status.
* Define how to store the result of HA level check of each OpenShift version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled automatically by virtue of plugging into our testing framework.

to track the record of previous check results.
* Introduce a mechanism to notify the degradations to component owners whose
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled automatically. Tests map to component teams via flags in their test names. Problems then surface in component readiness. https://sippy-auth.dptools.openshift.org/sippy-ng/component_readiness/main?view=4.22-main

projects have failed test cases.
* Define the criteria which conditions should be met for each component pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encoded in the test logic.

an HA level check for an HA config.
* Define the criteria that must be met to pass the HA level check for each
component and for each HA config.
* Define the workflow of how to collect the responses from notified component owners.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jira collects the responses from component owners.


### Workflow Description

The following figure shows the overview of HA policy management process.

![](./process-overview-ha-policy-management.png)

As shown in this figure, this management process is fundamentally based on
collecting and centrally managing the HA implementation status of each
component that makes up OpenShift cluster.
Then, utilizing the collected information, a core function of the management
process (called "HA level check") requests the component owners for additional
information, and encourages them to implement the relevant HA features.

HA policy information is defined as info which is needed by HA level check
process, and there are 2 types:

- Type 1: HA implementation status info which can be collected
programmatically from the actual OpenShift cluster.
- Type 2: Component specific info which is collected by questions
for component owners about HA design or development plan.

#### Actors in the workflow

- **Project owner** is a human responsible for the decision over anything
about development of the component which the project is about.
- **Process owner of HA policy management** is a human or non-human user
responsible for checking the result of HA level check and interacting with
component owners to encourage to implement HA or reason the decision.

#### Steps of how HA policy management works

- Trigger CI process.
- CI process runs a test case that do HA level check process.
- In HA level check process, the tool collect HA-related information from a
running cluster (like probe settings and redundancy settings).
- The result is stored in storage with assessment result.
- Compared with previous check results, the tool identifies newly found
failed test cases.
- The tool sends notifications to the component owners whose components
have failed the HA level check.
- Record a summary of the results of the current check results.
- Terminate the current HA level check process.
- Component owners who received notifications of new failed test cases,
have 2 options to respond:
- If the product manager (PM) of the component determines that it is
unnecessary for implementation, design, or operational reasons, the HA
feature will not be implemented, and the reasons for this determination
will be documented.
- If the HA feature has not yet been implemented solely due to development
priority or resource constraints, the reason and the planned
implementation version shall be recorded.
- The reason and plan given by component owners will be used in the next
run as justification of leaving failed testcases.

#### HA level check

HA level check uses these types of input information to judge whether each
component properly covers HA configs or not, then the result is output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This storage specifically is a concern, we need this to fit existing processes, and introducing new storage mechanisms and formats is probably beyond what we can undertake and fit into our existing org workflows. The good news however is that with the above we can get you up and running and working towards these goals much more quickly.

in JSON data format so that it can be stored in some shared repository (like
GitHub or some internal repository) for later use. There’re multiple HA
configs in each component, such as healthCheck and redundancy.
Generally, HA level check obeys the flowchart in the following diagram.

![](./general-flowchart-ha-level-check.png)

The check is done for each component for each HA config, then returns
one of the three values: pass, fail, and skip. Each config has its own
HA implementation status info and component specific info.

This flowchart is essential for HA policy management, so detailed explanations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be replaced by the states in the test:

  • pass
  • flake (because the component is permanently whitelisted)
  • flake (because a pending jira is awaiting a response)
  • fail (no exception/whitelist entry exists, and the violation appears unapproved)

about the intentions follow:
- The 1st condition class *“the HA feature is already implemented?”* is judged
only with HA implementation status info, which returns true if the target
HA feature is already implemented or else false.
- The 2nd condition class *“the HA feature is not necessary for typical
reasons?”* helps component owners easily judge whether HA implementation is
actually needed or not. This condition class includes typical viewpoint
and saves time and effort of component owners to think about how to judge.
- The 3rd condition class *“the reason and plan are given?”* asks why the HA
is not implemented for the reason other than the answers in the 2nd
condition class, and when to implement the HA feature. These are judged by
open-ended answer, which covers very component-specific technical reason
and/or development-related issues. If these are left unfulfilled, the HA
level check will fail, and the component owner will be warned with an
information request.

#### How component owners respond?

A component owner whose component failed the HA Level Check will receive a
notification containing the following data (details are omitted for brevity):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hoping to avoid any new notification mechanisms, the above outlines how we notify component owners when they have a problem that needs addressing.


```
{
"kind": "StatefulSet",
"name": "lokistack-index-gateway",
"namespace": "openshift-logging",
"container": "loki-index-gateway",
"healthCheck": {
"hasReadinessProbe": "true",
"hasLivenessProbe": "true",
"hasStartupProbe": "false",
"hasRouterOrK8sService": "true",
"hasMultiReplicas": "true"
},
"haLevelCheckResult": {
"healthCheckReadinessProbe": "pass",
"healthCheckLivenessProbe": "pass",
"healthCheckStartupProbe": "fail"
}
}
```

Here `healthCheck` is set by info collected programmatically in CI process.
The result of HA level check is set in `haLevelCheckResult`.
In this case, `haLevelCheckResults.healthCheckStartupProbe` is `fail`.

There're two primary ways that recipients are expected to respond the
notification. The first is to simply implement the required HA config
(a startup probe in this example).
The second is to provide component-specific information.
For example, if a component owner believes that a startup probe is
unnecessary for their container, following response would be expected:

```
{
"kind": "StatefulSet",
"name": "lokistack-index-gateway",
"namespace": "openshift-logging",
"container": "loki-index-gateway",
"healthCheck": {
"componentSpecific": {
"_ignore": "startup probe is not required for design reasons (...more details...)"
}
}
}
```

Alternatively, if the component owner agrees that a startup probe is
necessary but cannot implement it immediately due to constraints such as
resource issues, the expected response would be as follows:

```
{
"kind": "StatefulSet",
"name": "lokistack-index-gateway",
"namespace": "openshift-logging",
"container": "loki-index-gateway",
"healthCheck": {
"componentSpecific": {
"_ignore": "the team is busy for higher priority tasks",
"targetVersion": "v4.22"
}
}
}
```

In this case, the response must also include the timeframe for resolving
the blocking issues, specified in the `componentSpecific.targetVersion` field.

### API Extensions

N/A

### Topology Considerations

No specific changes required

#### Hypershift / Hosted Control Planes

No specific changes required

#### Standalone Clusters

No specific changes required

#### Single-node Deployments or MicroShift

N/A

### Implementation Details/Notes/Constraints

None (HA policy management is implemented in CI process outside OpenShift components)

### Risks and Mitigations

Risk: Development teams bear the burden of responding to notifications
in a timely manner to prioritize and plan the development of HA features.
Mitigation: The management process will only issue warnings without
blocking the actual release process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we can accommodate this while the test is in flake mode only, and cannot fail. In future, once all exceptions look covered, we can make the test official and let it fail for anything new.

While an exception is granted with an open jira, the test will flake and not fail.

Periodic monitoring or automation is required to check the list of exception jiras to see if they were closed, and take appropriate action. (either reopen in disagreement, or moving the exception to the permanent whitelist) At this point I would recommend claude command helper in origin repo to help maintain this aspect.


### Drawbacks

None

## Alternatives (Not Implemented)

Not known

## Open Questions [optional]

- How to determine the exact coverage of target components of HA policy management?
- How to maintain and publish the result of HA policy management?
- Currently all defined HA configs are healthCheck and redundancy, but is there any
other possible HA configs?

## Test Plan

We start with limited enforcement (for example only on select components)
to verify that the management process works properly, and then gradually
expand the scope.

## Graduation Criteria

N/A.

### Dev Preview -> Tech Preview

N/A

### Tech Preview -> GA

N/A

### Removing a deprecated feature

N/A

## Upgrade / Downgrade Strategy

N/A

## Version Skew Strategy

N/A

## Operational Aspects of API Extensions

N/A

## Support Procedures

N/A

## Infrastructure Needed [optional]

N/A
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.