Merge pull request #62506 from apinnick/bz2219552-node-health-check-ha

apinnick · web-flow · commit 24c52e5d9e2c · 2023-07-20T11:31:56.000+03:00
BZ#2219552: HA and node health checks
diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml
@@ -3642,8 +3642,6 @@ Topics:
     File: virt-accessing-vm-consoles
   - Name: Automating Windows installation with sysprep
     File: virt-automating-windows-sysprep
-  - Name: Triggering virtual machine failover by resolving a failed node
-    File: virt-triggering-vm-failover-resolving-failed-node
   - Name: Installing the QEMU guest agent and VirtIO drivers
     File: virt-installing-qemu-guest-agent
   - Name: Viewing the QEMU guest agent information for virtual machines
@@ -3682,6 +3680,8 @@ Topics:
       File: virt-configuring-mediated-devices
     - Name: Enabling descheduler evictions on virtual machines
       File: virt-enabling-descheduler-evictions
+    - Name: About high availability for virtual machines
+      File: virt-high-availability-for-vms
 # Importing virtual machines
   - Name: Importing virtual machines
     Dir: importing_vms
@@ -3817,6 +3817,8 @@ Topics:
     File: virt-managing-node-labeling-obsolete-cpu-models
   - Name: Preventing node reconciliation
     File: virt-preventing-node-reconciliation
+  - Name: Deleting a failed node to trigger virtual machine failover
+    File: virt-triggering-vm-failover-resolving-failed-node
 - Name: Monitoring
   Dir: monitoring
   Topics:
diff --git a/modules/virt-about-node-maintenance.adoc b/modules/virt-about-node-maintenance.adoc
@@ -5,16 +5,18 @@
 [id="virt-about-node-maintenance_{context}"]
 = About node maintenance mode
 
-Nodes can be placed into maintenance mode using the `oc adm` utility, or using `NodeMaintenance` custom resources (CRs).
+Nodes can be placed into maintenance mode by using the `oc adm` utility or `NodeMaintenance` custom resources (CRs).
 
 [NOTE]
 ====
-The `node-maintenance-operator` (NMO) is no longer shipped with {VirtProductName}. It is now available to deploy as a standalone Operator from the *OperatorHub* in the {product-title} web console, or by using the OpenShift CLI (`oc`).
+The `node-maintenance-operator` (NMO) is no longer shipped with {VirtProductName}. It is deployed as a standalone Operator from the *OperatorHub* in the {product-title} web console or by using the OpenShift CLI (`oc`).
+
+For more information on remediation, fencing, and maintaining nodes, see the link:https://access.redhat.com/documentation/en-us/workload_availability_for_red_hat_openshift/23.2/html-single/remediation_fencing_and_maintenance/index#about-remediation-fencing-maintenance[Workload Availability for Red Hat OpenShift] documentation.
 ====
 
 Placing a node into maintenance marks the node as unschedulable and drains all the virtual machines and pods from it. Virtual machine instances that have a `LiveMigrate` eviction strategy are live migrated to another node without loss of service. This eviction strategy is configured by default in virtual machine created from common templates but must be configured manually for custom virtual machines.
 
-Virtual machine instances without an eviction strategy are shut down. Virtual machines with a `RunStrategy` of `Running` or `RerunOnFailure` are recreated on another node. Virtual machines with a `RunStrategy` of `Manual` are not automatically restarted.
+Virtual machine instances without an eviction strategy are shut down. Virtual machines with a `runStrategy` of `Running` or `RerunOnFailure` are recreated on another node. Virtual machines with a `runStrategy` of `Manual` are not automatically restarted.
 
 [IMPORTANT]
 ====
diff --git a/modules/virt-about-runstrategies-vms.adoc b/modules/virt-about-runstrategies-vms.adoc
@@ -1,31 +1,55 @@
 // Module included in the following assemblies:
 //
-// * virt/virtual_machines/virt-create-vms.adoc
+// * virt/node_maintenance/virt-about-node-maintenance.adoc
 
 :_content-type: CONCEPT
 [id="virt-about-runstrategies-vms_{context}"]
-= About RunStrategies for virtual machines
+= About run strategies for virtual machines
 
-A `RunStrategy` for virtual machines determines a virtual machine instance's (VMI) behavior, depending on a series of conditions. The `spec.runStrategy` setting exists in the virtual machine configuration process as an alternative to the `spec.running` setting.
-The `spec.runStrategy` setting allows greater flexibility for how VMIs are created and managed, in contrast to the `spec.running` setting with only `true` or `false` responses. However, the two settings are mutually exclusive. Only either `spec.running` or `spec.runStrategy` can be used. An error occurs if both are used.
+Run strategies for virtual machines (VMs) determine how virtual machine instances (VMIs) behave under certain conditions.
 
-There are four defined RunStrategies.
+You configure a run strategy by assigning a value to the `runStrategy` key in the `VirtualMachine` manifest as in the following example:
+
+.Example run strategy
+[source,yaml]
+----
+apiVersion: kubevirt.io/v1
+kind: VirtualMachine
+spec:
+  runStrategy: Always
+  template:
+# ...
+----
+
+[IMPORTANT]
+====
+The `runStrategy` and the `running` keys are mutually exclusive. Only one of them can be used.
+====
+
+The `runStrategy` key gives you more flexibility because it has four values, unlike the `running` key, which has a Boolean value.
+
+.`runStrategy` key values
 
 `Always`::
-A VMI is always present when a virtual machine is created. A new VMI is created if the original stops for any reason, which is the same behavior as `spec.running: true`.
+The VMI is always present when a virtual machine is created. A new VMI is created if the original stops for any reason. This is the same behavior as `running: true`.
+
 `RerunOnFailure`::
-A VMI is re-created if the previous instance fails due to an error. The instance is not re-created if the virtual machine stops successfully, such as when it shuts down.
+The VMI is re-created if the previous instance fails. The instance is not re-created if the virtual machine stops successfully, such as when it is shut down.
+
 `Manual`::
-The `start`, `stop`, and `restart` virtctl client commands can be used to control the VMI's state and existence.
+You control the VMI state manually with the `start`, `stop`, and `restart` virtctl client commands.
+
 `Halted`::
-No VMI is present when a virtual machine is created, which is the same behavior as `spec.running: false`.
+No VMI is present when a virtual machine is created. This is the same behavior as `running: false`.
 
-Different combinations of the `start`, `stop` and `restart` virtctl commands affect which `RunStrategy` is used.
+Different combinations of the `start`, `stop` and `restart` virtctl commands affect the run strategy.
 
-The following table follows a VM's transition from different states. The first column shows the VM's initial `RunStrategy`. Each additional column shows a virtctl command and the new `RunStrategy` after that command is run.
+The following table describes a VM's transition from different states. The first column shows the VM's initial run strategy. The remaining columns show a virtctl command and the new run strategy after that command is run.
 
+.Run strategy before and after `virtctl` commands
+[options="header"]
 |===
-|Initial RunStrategy |start |stop |restart
+|Initial run strategy |Start |Stop |Restart
 
 |Always
 |-
@@ -50,16 +74,6 @@ The following table follows a VM's transition from different states. The first c
 
 [NOTE]
 ====
-In {VirtProductName} clusters installed using installer-provisioned infrastructure, when a node fails the MachineHealthCheck and becomes unavailable to the cluster, VMs with a RunStrategy of `Always` or `RerunOnFailure` are rescheduled on a new node.
+If a node in a cluster installed by using installer-provisioned infrastructure fails the machine health check and is unavailable, VMs with `runStrategy: Always` or `runStrategy: RerunOnFailure` are rescheduled on a new node.
 ====
 
-[source,yaml]
-----
-apiVersion: kubevirt.io/v1
-kind: VirtualMachine
-spec:
-  RunStrategy: Always <1>
-  template:
-# ...
-----
-<1> The VMI's current `RunStrategy` setting.
diff --git a/modules/virt-about-workload-updates.adoc b/modules/virt-about-workload-updates.adoc
@@ -31,7 +31,7 @@ If you enable both `LiveMigrate` and `Evict`:
 
 * VMIs that support live migration use the `LiveMigrate` update strategy.
 
-* VMIs that do not support live migration use the `Evict` update strategy. If a VMI is controlled by a `VirtualMachine` object that has a `runStrategy` value of `always`, a new VMI is created in a new pod with updated components.
+* VMIs that do not support live migration use the `Evict` update strategy. If a VMI is controlled by a `VirtualMachine` object that has `runStrategy: Always` set, a new VMI is created in a new pod with updated components.
 
 [discrete]
 [id="migration-attempts-timeouts_{context}"]
diff --git a/modules/virt-configuring-workload-update-methods.adoc b/modules/virt-configuring-workload-update-methods.adoc
@@ -46,7 +46,7 @@ spec:
 <1> The methods that can be used to perform automated workload updates. The available values are `LiveMigrate` and `Evict`. If you enable both options as shown in this example, updates use `LiveMigrate` for VMIs that support live migration and `Evict` for any VMIs that do not support live migration. To disable automatic workload updates, you can either remove the `workloadUpdateStrategy` stanza or set `workloadUpdateMethods: []` to leave the array empty.
 //NOTE: in 4.10, removing the stanza will not disable the feature.
 <2> The least disruptive update method. VMIs that support live migration are updated by migrating the virtual machine (VM) guest into a new pod with the updated components enabled. If `LiveMigrate` is the only workload update method listed, VMIs that do not support live migration are not disrupted or updated.
-<3> A disruptive method that shuts down VMI pods during upgrade. `Evict` is the only update method available if live migration is not enabled in the cluster. If a VMI is controlled by a `VirtualMachine` object that has `runStrategy: always` configured, a new VMI is created in a new pod with updated components.
+<3> A disruptive method that shuts down VMI pods during upgrade. `Evict` is the only update method available if live migration is not enabled in the cluster. If a VMI is controlled by a `VirtualMachine` object that has `runStrategy: Always` configured, a new VMI is created in a new pod with updated components.
 <4> The number of VMIs that can be forced to be updated at a time by using the `Evict` method. This does not apply to the `LiveMigrate` method.
 <5> The interval to wait before evicting the next batch of workloads. This does not apply to the `LiveMigrate` method.
 +
diff --git a/modules/virt-runbook-outdatedvirtualmachineinstanceworkloads.adoc b/modules/virt-runbook-outdatedvirtualmachineinstanceworkloads.adoc
@@ -85,7 +85,7 @@ Update the `HyperConverged` CR to enable automatic workload updates.
 [id="stopping-a-vm-associated-with-a-non-live-migratable-vmi-outdatedvirtualmachineinstanceworkloads"]
 === Stopping a VM associated with a non-live-migratable VMI
 
-* If a VMI is not live-migratable and if `runStrategy: always` is
+* If a VMI is not live-migratable and if `runStrategy: Always` is
 set in the corresponding `VirtualMachine` object, you can update the
 VMI by manually stopping the virtual machine (VM):
 +
diff --git a/virt/install/preparing-cluster-for-virt.adoc b/virt/install/preparing-cluster-for-virt.adoc
@@ -143,7 +143,7 @@ You can configure one of the following high-availability (HA) options for your c
 +
 [NOTE]
 ====
-In {product-title} clusters installed using installer-provisioned infrastructure and with MachineHealthCheck properly configured, if a node fails the MachineHealthCheck and becomes unavailable to the cluster, it is recycled. What happens next with VMs that ran on the failed node depends on a series of conditions. See xref:../../virt/virtual_machines/virt-create-vms.adoc#virt-about-runstrategies-vms_virt-create-vms[About RunStrategies for virtual machines] for more detailed information about the potential outcomes and how RunStrategies affect those outcomes.
+In {product-title} clusters installed using installer-provisioned infrastructure and with MachineHealthCheck properly configured, if a node fails the MachineHealthCheck and becomes unavailable to the cluster, it is recycled. What happens next with VMs that ran on the failed node depends on a series of conditions. See xref:../../virt/node_maintenance/virt-about-node-maintenance.adoc#virt-about-runstrategies-vms_virt-about-node-maintenance[About RunStrategies for virtual machines] for more detailed information about the potential outcomes and how RunStrategies affect those outcomes.
 ====
 
 * Automatic high availability for both IPI and non-IPI is available by using the *Node Health Check Operator* on the {product-title} cluster to deploy the `NodeHealthCheck` controller. The controller identifies unhealthy nodes and uses the Self Node Remediation Operator to remediate the unhealthy nodes. For more information on remediation, fencing, and maintaining nodes, see the link:https://access.redhat.com/documentation/en-us/workload_availability_for_red_hat_openshift/23.2/html-single/remediation_fencing_and_maintenance/index#about-remediation-fencing-maintenance[Workload Availability for Red Hat OpenShift] documentation.
diff --git a/virt/node_maintenance/virt-about-node-maintenance.adoc b/virt/node_maintenance/virt-about-node-maintenance.adoc
@@ -8,14 +8,12 @@ toc::[]
 
 include::modules/virt-about-node-maintenance.adoc[leveloffset=+1]
 
+include::modules/virt-about-runstrategies-vms.adoc[leveloffset=+1]
+
 include::modules/virt-maintaining-bare-metal-nodes.adoc[leveloffset=+1]
 
 [role="_additional-resources"]
 [id="additional-resources_virt-about-node-maintenance"]
 == Additional resources
-* xref:../../nodes/nodes/nodes-remediating-fencing-maintaining-rhwa.adoc#nodes-remediating-fencing-maintaining-rhwa[Installing the Node Maintenance Operator by using the CLI]
-* xref:../../nodes/nodes/nodes-remediating-fencing-maintaining-rhwa.adoc#nodes-remediating-fencing-maintaining-rhwa[Setting a node to maintenance mode]
-* xref:../../nodes/nodes/nodes-remediating-fencing-maintaining-rhwa.adoc#nodes-remediating-fencing-maintaining-rhwa[Resuming a node from maintenance mode]
-* xref:../../virt/virtual_machines/virt-create-vms.adoc#virt-about-runstrategies-vms_virt-create-vms[About RunStrategies for virtual machines]
 * xref:../../virt/live_migration/virt-live-migration.adoc#virt-live-migration[Virtual machine live migration]
 * xref:../../virt/live_migration/virt-configuring-vmi-eviction-strategy.adoc#virt-configuring-vmi-eviction-strategy[Configuring virtual machine eviction strategy]
diff --git a/virt/node_maintenance/virt-triggering-vm-failover-resolving-failed-node.adoc b/virt/node_maintenance/virt-triggering-vm-failover-resolving-failed-node.adoc
@@ -1,26 +1,26 @@
 :_content-type: ASSEMBLY
 [id="virt-triggering-vm-failover-resolving-failed-node"]
-= Triggering virtual machine failover by resolving a failed node
+= Deleting a failed node to trigger virtual machine failover
 include::_attributes/common-attributes.adoc[]
 :context: virt-triggering-vm-failover-resolving-failed-node
 
 toc::[]
 
-If a node fails and xref:../../machine_management/deploying-machine-health-checks.adoc#machine-health-checks-about_deploying-machine-health-checks[machine health checks] are not deployed on your cluster, virtual machines (VMs) with `RunStrategy: Always` configured are not automatically relocated to healthy nodes. To trigger VM failover, you must manually delete the `Node` object.
+If a node fails and xref:../../machine_management/deploying-machine-health-checks.adoc#machine-health-checks-about_deploying-machine-health-checks[machine health checks] are not deployed on your cluster, virtual machines (VMs) with `runStrategy: Always` configured are not automatically relocated to healthy nodes. To trigger VM failover, you must manually delete the `Node` object.
 
 [NOTE]
 ====
-If you installed your cluster by using xref:../../installing/installing_bare_metal_ipi/ipi-install-overview.adoc#ipi-install-overview[installer-provisioned infrastructure] and you properly configured machine health checks:
+If you installed your cluster by using xref:../../installing/installing_bare_metal_ipi/ipi-install-overview.adoc#ipi-install-overview[installer-provisioned infrastructure] and you properly configured machine health checks, the following events occur:
 
 * Failed nodes are automatically recycled.
-* Virtual machines with xref:../../virt/virtual_machines/virt-create-vms.adoc#virt-about-runstrategies-vms_virt-create-vms[`RunStrategy`] set to `Always` or `RerunOnFailure` are automatically scheduled on healthy nodes.
+* Virtual machines with xref:../../virt/node_maintenance/virt-about-node-maintenance.adoc#virt-about-runstrategies-vms_virt-about-node-maintenance[`runStrategy`] set to `Always` or `RerunOnFailure` are automatically scheduled on healthy nodes.
 ====
 
 [id="prerequisites_{context}"]
 == Prerequisites
 
 * A node where a virtual machine was running has the `NotReady` xref:../../nodes/nodes/nodes-nodes-viewing.adoc#nodes-nodes-viewing-listing_nodes-nodes-viewing[condition].
-* The virtual machine that was running on the failed node has `RunStrategy` set to `Always`.
+* The virtual machine that was running on the failed node has `runStrategy` set to `Always`.
 * You have installed the OpenShift CLI (`oc`).
 
 include::modules/nodes-nodes-working-deleting-bare-metal.adoc[leveloffset=+1]
diff --git a/virt/virtual_machines/advanced_vm_management/virt-high-availability-for-vms.adoc b/virt/virtual_machines/advanced_vm_management/virt-high-availability-for-vms.adoc
@@ -0,0 +1,21 @@
+:_content-type: ASSEMBLY
+[id="virt-high-availability-for-vms"]
+= About high availability for virtual machines
+include::_attributes/common-attributes.adoc[]
+:context: virt-high-availability-for-vms
+
+toc::[]
+
+You can enable high availability for virtual machines (VMs) by manually deleting a failed node to trigger VM failover or by configuring remediating nodes.
+
+.Manually deleting a failed node
+
+If a node fails and machine health checks are not deployed on your cluster, virtual machines with `runStrategy: Always` configured are not automatically relocated to healthy nodes. To trigger VM failover, you must manually delete the `Node` object.
+
+See xref:../../../virt/node_maintenance/virt-triggering-vm-failover-resolving-failed-node.adoc#virt-triggering-vm-failover-resolving-failed-node[Deleting a failed node to trigger virtual machine failover].
+
+.Configuring remediating nodes
+
+You can configure remediating nodes by installing the Self Node Remediation Operator from the OperatorHub and enabling machine health checks or node remediation checks.
+
+For more information on remediation, fencing, and maintaining nodes, see the link:https://access.redhat.com/documentation/en-us/workload_availability_for_red_hat_openshift/23.2/html-single/remediation_fencing_and_maintenance/index#about-remediation-fencing-maintenance[Workload Availability for Red Hat OpenShift] documentation.
diff --git a/virt/virtual_machines/virt-create-vms.adoc b/virt/virtual_machines/virt-create-vms.adoc
@@ -45,9 +45,6 @@ include::modules/virt-creating-vm-instancetype.adoc[leveloffset=+2]
 
 include::modules/virt-creating-vm-cli.adoc[leveloffset=+1]
 
-// This should probably be moved somewhere else because it's a config.
-include::modules/virt-about-runstrategies-vms.adoc[leveloffset=+2]
-
 [id="additional-resources_virt-create-vms_{context}"]
 [role="_additional-resources"]
 == Additional resources

Original file line number	Diff line number	Diff line change
`@@ -46,7 +46,7 @@ spec:`
`46`	`46`	<1> The methods that can be used to perform automated workload updates. The available values are `LiveMigrate` and `Evict`. If you enable both options as shown in this example, updates use `LiveMigrate` for VMIs that support live migration and `Evict` for any VMIs that do not support live migration. To disable automatic workload updates, you can either remove the `workloadUpdateStrategy` stanza or set `workloadUpdateMethods: []` to leave the array empty.
`47`	`47`	`//NOTE: in 4.10, removing the stanza will not disable the feature.`
`48`	`48`	<2> The least disruptive update method. VMIs that support live migration are updated by migrating the virtual machine (VM) guest into a new pod with the updated components enabled. If `LiveMigrate` is the only workload update method listed, VMIs that do not support live migration are not disrupted or updated.
`49`		-<3> A disruptive method that shuts down VMI pods during upgrade. `Evict` is the only update method available if live migration is not enabled in the cluster. If a VMI is controlled by a `VirtualMachine` object that has `runStrategy: always` configured, a new VMI is created in a new pod with updated components.
	`49`	+<3> A disruptive method that shuts down VMI pods during upgrade. `Evict` is the only update method available if live migration is not enabled in the cluster. If a VMI is controlled by a `VirtualMachine` object that has `runStrategy: Always` configured, a new VMI is created in a new pod with updated components.
`50`	`50`	<4> The number of VMIs that can be forced to be updated at a time by using the `Evict` method. This does not apply to the `LiveMigrate` method.
`51`	`51`	<5> The interval to wait before evicting the next batch of workloads. This does not apply to the `LiveMigrate` method.
`52`	`52`	`+`
Original file line number	Diff line number	Diff line change
@@ -85,7 +85,7 @@ Update the `HyperConverged` CR to enable automatic workload updates.
`85`	`85`	`[id="stopping-a-vm-associated-with-a-non-live-migratable-vmi-outdatedvirtualmachineinstanceworkloads"]`
`86`	`86`	`=== Stopping a VM associated with a non-live-migratable VMI`
`87`	`87`
`88`		-* If a VMI is not live-migratable and if `runStrategy: always` is
	`88`	+* If a VMI is not live-migratable and if `runStrategy: Always` is
`89`	`89`	set in the corresponding `VirtualMachine` object, you can update the
`90`	`90`	`VMI by manually stopping the virtual machine (VM):`
`91`	`91`	`+`
Original file line number	Diff line number	Diff line change
`@@ -143,7 +143,7 @@ You can configure one of the following high-availability (HA) options for your c`
`143`	`143`	`+`
`144`	`144`	`[NOTE]`
`145`	`145`	`====`
`146`		-In {product-title} clusters installed using installer-provisioned infrastructure and with MachineHealthCheck properly configured, if a node fails the MachineHealthCheck and becomes unavailable to the cluster, it is recycled. What happens next with VMs that ran on the failed node depends on a series of conditions. See xref:../../virt/virtual_machines/virt-create-vms.adoc#virt-about-runstrategies-vms_virt-create-vms[About RunStrategies for virtual machines] for more detailed information about the potential outcomes and how RunStrategies affect those outcomes.
	`146`	+In {product-title} clusters installed using installer-provisioned infrastructure and with MachineHealthCheck properly configured, if a node fails the MachineHealthCheck and becomes unavailable to the cluster, it is recycled. What happens next with VMs that ran on the failed node depends on a series of conditions. See xref:../../virt/node_maintenance/virt-about-node-maintenance.adoc#virt-about-runstrategies-vms_virt-about-node-maintenance[About RunStrategies for virtual machines] for more detailed information about the potential outcomes and how RunStrategies affect those outcomes.
`147`	`147`	`====`
`148`	`148`
`149`	`149`	* Automatic high availability for both IPI and non-IPI is available by using the Node Health Check Operator on the {product-title} cluster to deploy the `NodeHealthCheck` controller. The controller identifies unhealthy nodes and uses the Self Node Remediation Operator to remediate the unhealthy nodes. For more information on remediation, fencing, and maintaining nodes, see the link:https://access.redhat.com/documentation/en-us/workload_availability_for_red_hat_openshift/23.2/html-single/remediation_fencing_and_maintenance/index#about-remediation-fencing-maintenance[Workload Availability for Red Hat OpenShift] documentation.