Skip to content

OSSM-4815: Document HA for a mesh #96010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 12, 2025

Conversation

rh-tokeefe
Copy link
Contributor

@rh-tokeefe rh-tokeefe commented Jul 11, 2025

Affects:

service-mesh-docs-main
service-mesh-docs-3.0
service-mesh-docs-3.1

PR must be merged to service docs main and CP'd back to the 3.0 and 3.1 branches.

Version(s): 3.1

Issue: https://issues.redhat.com/browse/OSSM-4815

Link to docs preview:
https://96010--ocpdocs-pr.netlify.app/openshift-service-mesh/latest/install/ossm-installing-openshift-service-mesh.html#ossm-about-istio-high-availability_ossm-customizing-istio-configuration

QE review:

  • QE has approved this change.

Additional information:

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 11, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jul 11, 2025

@rh-tokeefe: This pull request references OSSM-4815 which is a valid jira issue.

In response to this:

Version(s): 3.1

Issue: https://issues.redhat.com/browse/OSSM-4815

Link to docs preview:

QE review:

  • QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 11, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jul 11, 2025

@rh-tokeefe: This pull request references OSSM-4815 which is a valid jira issue.

In response to this:

Affects:

service-mesh-docs-main
service-mesh-docs-3.0
service-mesh-docs-3.1

PR must be merged to service docs main and CP'd back to the 3.0 and 3.1 branches.

Version(s): 3.1

Issue: https://issues.redhat.com/browse/OSSM-4815

Link to docs preview:

QE review:

  • QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ocpdocs-previewbot
Copy link

ocpdocs-previewbot commented Jul 11, 2025

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jul 11, 2025

@rh-tokeefe: This pull request references OSSM-4815 which is a valid jira issue.

In response to this:

Affects:

service-mesh-docs-main
service-mesh-docs-3.0
service-mesh-docs-3.1

PR must be merged to service docs main and CP'd back to the 3.0 and 3.1 branches.

Version(s): 3.1

Issue: https://issues.redhat.com/browse/OSSM-4815

Link to docs preview:
https://96010--ocpdocs-pr.netlify.app/openshift-service-mesh/latest/install/ossm-installing-openshift-service-mesh.html#ossm-about-istio-high-availability_ossm-customizing-istio-configuration

QE review:

  • QE has approved this change.

Additional information:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@fjglira fjglira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor changes


Running the {istio} control plane in High Availability (HA) mode prevents single points of failure, and ensures continuous mesh operation even if an `istiod` pod fails. By using HA, if one `istiod` pod becomes unavailable, another one continues to manage and configure the {istio} control plane, preventing service outages or disruptions. HA provides scalability by distributing the control plane workload, enables graceful upgrades, supports disaster recovery operations, and protects against zone-wide mesh outages.

There are two ways for a system administrator to configure HA: by defining replica count or by using autoscaling.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
There are two ways for a system administrator to configure HA: by defining replica count or by using autoscaling.
There are two ways for a system administrator to configure HA for the `istiod` deployment:
* Defining a static replica count: This involves setting a fixed number of `istiod` pods, providing a consistent level of redundancy.
* Using autoscaling: This dynamically adjusts the number of `istiod` pods based on observed resource utilization or custom metrics, offering more efficient resource consumption for fluctuating workloads.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a preview here will be better to give the users a first approach to what the configuration types are

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

[id="ossm-configuring-istio-ha-autoscaling_{context}"]
= Configuring Istio HA by using autoscaling

Configure the {istio} control plane in High Availability (HA) mode to prevent a single point of failure, and ensure continuous mesh operation even if one of the `istiod` pods fails. Autoscaling defines the minimum and maximum number of {istio} control plane pods that can operate. {ocp-product-title} uses these values to scale the number of control planes in operation in response to the varying number of workloads in the mesh.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Configure the {istio} control plane in High Availability (HA) mode to prevent a single point of failure, and ensure continuous mesh operation even if one of the `istiod` pods fails. Autoscaling defines the minimum and maximum number of {istio} control plane pods that can operate. {ocp-product-title} uses these values to scale the number of control planes in operation in response to the varying number of workloads in the mesh.
Configure the {istio} control plane in High Availability (HA) mode to prevent a single point of failure, and ensure continuous mesh operation even if one of the `istiod` pods fails. Autoscaling defines the minimum and maximum number of {istio} control plane pods that can operate. {ocp-product-title} uses these values to scale the number of control planes in operation based on observed resource utilization (such as CPU or memory) or custom metrics, effectively responding to the varying number of workloads and overall traffic patterns within the mesh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 48 to 51
<1> Defines the minimum number of {istio} control plane replicas that always run.
<2> Defines the maximum number of {istio} control plane replicas, allowing for scaling based on load. To support HA, there must be at least two replicas.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be highly beneficial to add a note here describing the specific metrics that can be used to configure Istiod autoscaling (scale up/down). For example, users can set spec.values.pilot.cpu.targetAverageUtilization and spec.values.pilot.memory.targetAverageUtilization to define CPU and Memory thresholds for triggering scaling actions. Sorry for not adding this also in the upstream docs, but I'll add it there. I think it's good to point the users which configuration is going to trigger this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

istiod-7c7b6564c9-xkmsl 1/1 Running 0 85s
----
+
Two `istiod` pods are running, which indicates HA was successfully configured.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Two `istiod` pods are running, which indicates HA was successfully configured.
Two `istiod` pods are running, which is the minimum requirement for a highly available Istio control plane and indicates a basic HA setup is in place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

[id="ossm-configuring-istio-ha-replicacount_{context}"]
= Configuring Istio HA by using replica count

Configure the {istio} control plane in High Availability (HA) mode to prevent a single point of failure, and ensure continuous mesh operation even if one of the `istiod` pods fails. The replica count defines a fixed number of {istio} control plane pods that can operate. Use replica count for mesh environments in which the number of workloads does not scale.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Configure the {istio} control plane in High Availability (HA) mode to prevent a single point of failure, and ensure continuous mesh operation even if one of the `istiod` pods fails. The replica count defines a fixed number of {istio} control plane pods that can operate. Use replica count for mesh environments in which the number of workloads does not scale.
Configure the {istio} control plane in High Availability (HA) mode to prevent a single point of failure, and ensure continuous mesh operation even if one of the `istiod` pods fails. The replica count defines a fixed number of {istio} control plane pods that can operate. Use replica count for mesh environments where the control plane workload is relatively stable or predictable, or when manual scaling of the `istiod` is preferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@rh-tokeefe rh-tokeefe force-pushed the OSSM-4815 branch 2 times, most recently from 6e0042d to 9062ebe Compare July 16, 2025 19:31
Copy link

@fjglira fjglira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, this looks great with the updated procedure

[id="ossm-about-istio-high-availability_{context}"]
= About Istio High Availability

Running the {istio} control plane in High Availability (HA) mode prevents single points of failure, and ensures continuous mesh operation even if an `istiod` pod fails. By using HA, if one `istiod` pod becomes unavailable, another one continues to manage and configure the {istio} control plane, preventing service outages or disruptions. HA provides scalability by distributing the control plane workload, enables graceful upgrades, supports disaster recovery operations, and protects against zone-wide mesh outages.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"another one continues to manage and configure the Istio control plane"
Should that be "continues to manage and configure the Istio data plane"?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my bad we should said here:

Running the {istio} control plane in High Availability (HA) mode prevents single points of failure, and ensures continuous mesh operation even if an `istiod` pod fails. By using HA, if one `istiod` pod becomes unavailable, another one continues to manage and configure the {istio} data plane, preventing service outages or disruptions. HA provides scalability by distributing the control plane workload, enables graceful upgrades, supports disaster recovery operations, and protects against zone-wide mesh outages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FilipB and @fjglira just want to check with you about this part:

"HA provides scalability by distributing the control plane workload..."

Do we keep control plane here or also change this to data plane?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"HA provides scalability by distributing the control plane workload..." part is correct. There is multiple istiod instasnces and istiod == control plane. Istiod (control plane) is managing data plane.


|`autoScaleMin` | Defines the minimum number of `istiod` pods for an istio deployment. Each pod contains one instance of the Istio control plane.

{ocp-short-name} only uses this parameter when the {istio} deployment uses the Horizontal Pod Autoscaler (HPA) configuration.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note is a bit confusing. Creation of HPA is controlled by Values.autoscaleEnabled

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and this is a table that shows all the parameters available to configure for HA deployment. Maybe we should say:

{ocp-short-name} only uses this parameter when Horizontal Pod Autoscaler (HPA) is enabled for the {istio} deployment, which is the default behavior (Values.autoscaleEnabled: true).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


|`autoScaleMin` | Defines the minimum number of `istiod` pods for an istio deployment. Each pod contains one instance of the Istio control plane.

{ocp-short-name} only uses this parameter when the {istio} deployment uses the Horizontal Pod Autoscaler (HPA) configuration.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and this is a table that shows all the parameters available to configure for HA deployment. Maybe we should say:

{ocp-short-name} only uses this parameter when Horizontal Pod Autoscaler (HPA) is enabled for the {istio} deployment, which is the default behavior (Values.autoscaleEnabled: true).


You must also configure metrics for autoscaling to work properly. If no metrics are configured, the autoscaler does not scale up or down.

{ocp-short-name} only uses this parameter when the {istio} deployment uses the HPA configuration.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here then:

{ocp-short-name} only uses this parameter when Horizontal Pod Autoscaler (HPA) is enabled for the {istio} deployment, which is the default behavior (Values.autoscaleEnabled: true).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

[id="ossm-about-istio-high-availability_{context}"]
= About Istio High Availability

Running the {istio} control plane in High Availability (HA) mode prevents single points of failure, and ensures continuous mesh operation even if an `istiod` pod fails. By using HA, if one `istiod` pod becomes unavailable, another one continues to manage and configure the {istio} data plane, preventing service outages or disruptions. HA provides scalability by distributing the control plane workload, enables graceful upgrades, supports disaster recovery operations, and protects against zone-wide mesh outages.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"By using HA, if one istiod pod becomes unavailable, another one continues..."
@rh-tokeefe does this make sense or it should be rephrased?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FilipB "By using" conforms to the style guide.

All I'm trying to say here is "When using HA, if one istiod pod becomes unavailable, another one continues to manage and configure the {istio} data plane, preventing service outages or disruptions."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can sync and chat if you'd like.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine I was just not sure if it is good English grammar.

Copy link
Contributor

@shreyasiddhartha shreyasiddhartha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rh-tokeefe!
I only have a few comments related to the OCP guidelines. Overall. LGTM.
Please reach out to me if you have any questions. :)


[role="_additional-resources-pod-scaling"]
.Additional resources
* link:https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/[Horizontal Pod Autoscaling]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* link:https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/[Horizontal Pod Autoscaling]
* link:https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/[Horizontal Pod Autoscaling (Kubernetes documentation)]
* link:https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#configurable-scaling-behavior[Configurable Scaling Behavior (Kubernetes documentation)]

Suggestion to provide the source of external links, if the link is not from our product docs.
I also added a suggestion to adda link here from another instance where a link was used in the module ossm-api-settings-mesh-ha-autoscaling.adoc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

|===
|Parameter |Description

|`autoScaleMin` | Defines the minimum number of `istiod` pods for an istio deployment. Each pod contains one instance of the Istio control plane.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|`autoScaleMin` | Defines the minimum number of `istiod` pods for an istio deployment. Each pod contains one instance of the Istio control plane.
|`autoScaleMin` | Defines the minimum number of `istiod` pods for an {istio} deployment. Each pod contains one instance of the {istio} control plane.

Suggestion to use the {istio} attribute here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

|`memory.targetAverageUtilization` | Defines the target memory utilization for the `istiod` pod. If the average memory usage exceeds the threshold that this parameter defines, the HPA automatically increases the number of replica pods.
|`behavior` | You can use the `behavior` field to define additional policies that {ocp-short-name} uses to scale {istio} resources up or down.

For more information, see link:https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#configurable-scaling-behavior[Configurable Scaling Behavior].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For more information, see link:https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#configurable-scaling-behavior[Configurable Scaling Behavior].
For more information, see "Configurable Scaling Behavior".

Just a suggestion to provide the link in the "Additional resources" section rather than the module. I will update the suggestion in the assembly file as well.

|===
|Parameter |Description

|`replicaCount` | Defines the number of `istiod` pods for an istio deployment. Each pod contains one instance of the `istio` control plane. The default setting is `1`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|`replicaCount` | Defines the number of `istiod` pods for an istio deployment. Each pod contains one instance of the `istio` control plane. The default setting is `1`.
|`replicaCount` | Defines the number of `istiod` pods for an {istio} deployment. Each pod contains one instance of the {istio} control plane. The default setting is `1`.

I have a confusion regarding the usage of the following:

  • Istio control plane or Istio control plane
  • Istio deployment or Istio deployment

I will leave it up to you to decide. My only suggestion is to make it consistent throughout and also the usage of {istio} attribute.


* You installed the {SMProductName} Operator.

* You deployed the {istio} resource.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* You deployed the {istio} resource.
* You have deployed the `{istio}` resource.

Suggestion to add code blocks here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

default 1 1 0 default Healthy v1.24.6 24m
----
+
The name of the {istio} resource is `default`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The name of the {istio} resource is `default`.
The name of the `{istio}` resource is `default`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

+
The name of the {istio} resource is `default`.

. Update the {istio} custom resource by adding the `autoscaleEnabled` and `replicaCount` parameters by running the following command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. Update the {istio} custom resource by adding the `autoscaleEnabled` and `replicaCount` parameters by running the following command:
. Update the `{istio}` custom resource (CR) by adding the `autoscaleEnabled` and `replicaCount` parameters by running the following command:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 48 to 49
<1> Disables autoscaling and ensures that the number of replicas remains fixed.
<2> Defines the number of {istio} control plane replicas. To support HA, there must be at least two replicas.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<1> Disables autoscaling and ensures that the number of replicas remains fixed.
<2> Defines the number of {istio} control plane replicas. To support HA, there must be at least two replicas.
<1> Specifies a setting that disables autoscaling and ensures a fixed number of replicas.
<2> Specifies the number of {istio} control plane replicas. To support HA, there must be at least two replicas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


.Verification

. Verify the status of the {istio} control pods by running the following command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. Verify the status of the {istio} control pods by running the following command:
* Verify the status of the {istio} control pods by running the following command:

Since this is a single step procedure, I think it is okay to not make it a numbered-list. Here's the guideline.


.Verification

. Verify the status of the {Istio} control pods by running the following command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. Verify the status of the {Istio} control pods by running the following command:
* Verify the status of the {Istio} control pods by running the following command:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@rh-tokeefe rh-tokeefe force-pushed the OSSM-4815 branch 2 times, most recently from 285c65c to 7fa34f0 Compare August 11, 2025 14:36
Copy link

openshift-ci bot commented Aug 12, 2025

@rh-tokeefe: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@rh-tokeefe rh-tokeefe merged commit 4754a15 into openshift:service-mesh-docs-main Aug 12, 2025
2 checks passed
@rh-tokeefe
Copy link
Contributor Author

/cherrypick service-mesh-docs-3.1

@rh-tokeefe
Copy link
Contributor Author

/cherrypick service-mesh-docs-3.0

@openshift-cherrypick-robot

@rh-tokeefe: new pull request created: #97438

In response to this:

/cherrypick service-mesh-docs-3.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

@rh-tokeefe: new pull request created: #97439

In response to this:

/cherrypick service-mesh-docs-3.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@rh-tokeefe
Copy link
Contributor Author

/cherrypick service-mesh-docs-3.1

@openshift-cherrypick-robot

@rh-tokeefe: new pull request could not be created: failed to create pull request against openshift/openshift-docs#service-mesh-docs-3.1 from head openshift-cherrypick-robot:cherry-pick-96010-to-service-mesh-docs-3.1: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"A pull request already exists for openshift-cherrypick-robot:cherry-pick-96010-to-service-mesh-docs-3.1."}],"documentation_url":"https://docs.github.com/rest/pulls/pulls#create-a-pull-request","status":"422"}

In response to this:

/cherrypick service-mesh-docs-3.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants