Update new Data Path Availability metric behavior

thisisanniefang · web-flow · commit 99db71f21747 · 2024-10-21T19:34:41.000-04:00
diff --git a/articles/load-balancer/troubleshoot-rhc.md b/articles/load-balancer/troubleshoot-rhc.md
@@ -14,46 +14,46 @@ ms.custom: FY23 content-maintenance
 
 This article is a guide to investigate issues impacting the availability of your load balancer frontend IP and backend resources. 
 
-The Resource Health Check (RHC) for the Load Balancer is used to determine the health of your load balancer. It analyzes the Data Path Availability metric over a **2-minute** interval to determine whether the load balancing endpoints, the frontend IP and frontend ports combinations with load balancing rules, are available.
+The Resource Health Check (RHC) for Azure Load Balancer is used to determine the health of your load balancer. It analyzes the Data Path Availability metric to determine whether the load balancing endpoints, the frontend IP and frontend ports combinations with load balancing rules, are available.
 
 > Note: RHC is not supported for Basic SKU Load Balancer
 
-The below table describes the RHC logic used to determine the health state of your load balancer.
+The below table describes the RHC logic used to determine the health status of your load balancer.
 
 | Resource health status | Description |
 | --- | --- |
-| Available | Your standard load balancer resource is healthy and available. |
-| Degraded | Your standard load balancer has platform or user initiated events impacting performance. The Datapath Availability metric reported as less than 90% but greater than 25% health for at least two minutes. You experience moderate to severe performance degradation.
-| Unavailable | Your standard load balancer resource isn't healthy. The Datapath Availability metric reported less than 25% health for at least two minutes. You experience significant performance degradation or a lack of availability for inbound connectivity. There can be user or platform events causing unavailability. |
-| Unknown | Resource health status for your standard load balancer resource hasn't updated or received Data Path availability information in the last 10 minutes. This state is transient and will reflect correct status as soon as data is received. |
+| Available | Your load balancer resource is healthy and available. |
+| Degraded | Your load balancer has platform or user initiated events impacting performance. The Data Path Availability metric reported as less than 90% but greater than 25% health for at least two minutes. You may be experiencing moderate to severe performance degradation.
+| Unavailable | Your load balancer resource isn't healthy. The Data Path Availability metric reported less than 25% health for at least two minutes. You may be experiencing significant performance degradation or a lack of availability for inbound connectivity. There can be user or platform events causing unavailability. |
+| Unknown | Resource health status for your load balancer resource hasn't updated or received Data Path Availability information in the last 10 minutes. This state may be transient or your load balancer might not support RHC. |
 
 
-## About the metrics we use
-The two metrics to be used are *Data path availability* and *Health probe status* and it's important to understand their meaning to derive correct insights. 
+## Monitoring your load balancer availability
+The two metrics to be used are *Data Path Availability* and *Health Probe Status* and it's important to understand their meaning to derive correct insights. 
 
-## Data path availability
-The data path availability metric is generated by a TCP ping every 25 seconds on all frontend ports that have load-balancing and inbound NAT rules configured. This TCP ping is routed to any of the healthy (probed up) backend instances. If the service receives a response to the ping, it's a successful response and the sum of the metric is iterated once. If there's no response, no iteration happens. The count of this metric is 1/100 of the total TCP pings per sample period. Thus, we want to consider the average, which is the average of sum/count for the time period. The data shows the path availability metric aggregated by average thus gives us a percentage success rate for TCP pings on your frontend IP:port for each of your load-balancing and inbound NAT rules.
+## Data Path Availability
+The Data Path Availability metric is generated by a TCP ping every 25 seconds on all frontend ports that have load-balancing rules configured. This TCP ping is routed to any of the healthy (probed up) backend instances. The metric is an aggregated percentage success rate of TCP pings on each frontend IP:port combination for each of your load balancing rules, across a sample period of time.
 
-## Health probe status
-The health probe status metric is generated by a ping of the protocol defined in the health probe. This ping is sent to each instance in the backend pool and on the port defined in the health probe. For HTTP and HTTPS probes, a successful ping requires an HTTP 200 OK response whereas with TCP probes any response is considered successful. The consecutive successes or failures of each probe determine the health of the backend instance and whether the assigned backend pool is able to receive traffic. Similar to data path availability we use the average aggregation, which tells us the average successful/total pings during the sampling interval. This health probe status value indicates the backend health in isolation from your load balancer by probing your backend instances without sending traffic through the frontend.
+## Health Probe Status
+The Health Probe Status metric is generated by a ping of the protocol defined in the health probe. This ping is sent to each instance in the backend pool and on the port defined in the health probe. For HTTP and HTTPS probes, a successful ping requires an HTTP 200 OK response whereas with TCP probes any response is considered successful. The health of each backend instance is determined when the probe has reached the number of consecutive successes or failures necessary, based on your configuration of the probe threshold property. The health status of each backend instance determines whether or not the backend instance is allowed to receive traffic. Similar to the Data Path Availability metric, the Health Probe Status metric aggregates the average successful/total pings during the sampling interval. The Health Probe Status value indicates the backend health in isolation from your load balancer by probing your backend instances without sending traffic through the frontend.
 
 >[!IMPORTANT]
->Health probe status is sampled on a one minute basis. This can lead to minor fluctuations in an otherwise steady value. For example, if there are two backend instances, one probed up and one probed down, the health probe service may capture 7 samples for the healthy instance and 6 for the unhealthy instance. This will lead to a previously steady value of 50 showing as 46.15 for a one minute interval. 
+>Health Probe Status is sampled on a one minute basis. This can lead to minor fluctuations in an otherwise steady value. For example, in Active/Passive scenarios where there are two backend instances, one probed up and one probed down, the health probe service may capture 7 samples for the healthy instance and 6 for the unhealthy instance. This will lead to a previously steady value of 50 showing as 46.15 for a one minute interval. 
 
 ## Diagnose degraded and unavailable load balancers
 
-As outlined in the [resource health article](load-balancer-standard-diagnostics.md#resource-health-status), a degraded load balancer is one that shows between 25% and 90% data path availability. An unavailable load balancer is one with less than 25% data path availability, over a two-minute period. The same steps can be taken to investigate the failure you see in any health probe status or data path availability alerts you've configured. We explore the case where we've checked our resource health and found our load balancer to be unavailable with a data path availability of 0% - our service is down.
+As outlined in the [resource health article](load-balancer-standard-diagnostics.md#resource-health-status), a degraded load balancer is one that shows between 25% and 90% data path availability. An unavailable load balancer is one with less than 25% data path availability, over a two-minute period. The same steps can be taken to investigate the failure you see in any Health Probe Status or Data Path Availability alerts you've configured. We explore the case where we've checked our resource health and found our load balancer to be unavailable with a Data Path Availability of 0% - our service is down.
 
-First, we go to the detailed metrics view of our load balancer insights page in the Azure portal. Access the view from your load balancer resource page or the link in your resource health message. Next we navigate to the Frontend and Backend availability tab and review a thirty-minute window of the time period when the degraded or unavailable state occurred. If we see our data path availability is 0%, we know there's an issue preventing traffic for all of our load-balancing and inbound NAT rules, and we can see how long this issue has lasted. 
+First, we go to the detailed metrics view of our load balancer insights page in the Azure portal. Access the view from your load balancer resource page or the link in your resource health message. Next we navigate to the Frontend and Backend availability tab and review a thirty-minute window of the time period when the degraded or unavailable state occurred. If we see our data path availability is 0%, we know there's an issue preventing traffic for all of our load-balancing rules, and we can see how long this issue has lasted. 
 
-The next place we need to look is our health probe status metric to determine whether our data path is unavailable is because we have no healthy backend instances to serve traffic. If we have at least one healthy backend instance for all of our load-balancing and inbound rules, we know it isn't our configuration causing our data paths to be unavailable. This scenario indicates an Azure platform issue. While platform issues are rare, an automated alert is sent to our team to rapidly resolve all platform issues.
+The next place we need to look is our Health Probe Status metric to determine whether our data path is unavailable is because we have no healthy backend instances to serve traffic. If we have at least one healthy backend instance for all of our load-balancing and inbound rules, we know it isn't our configuration causing our data paths to be unavailable. This scenario indicates an Azure platform issue. While platform issues are rare, an automated alert is sent to our team to rapidly resolve all platform issues.
 
 ## Diagnose health probe failures
-Let's say we check our health probe status and find out that all instances are showing as unhealthy. This finding explains why our data path is unavailable as traffic has nowhere to go. We should then go through the following checklist to rule out common configuration errors:
+If your Health Probe Status metric is reflecting that your backend instances are unhealthy, we recommend following the below checklist to rule out common configuration errors:
 * Check the CPU utilization for your resources to determine if they are under high load.
   * You can check this by viewing the resource's Percentage CPU metric via the Metrics page. Learn how to [Troubleshoot high-CPU issues for Azure virtual machines](/troubleshoot/azure/virtual-machines/troubleshoot-high-cpu-issues-azure-windows-vm).
 * If using an HTTP or HTTPS probe check if the application is healthy and responsive.
-  * Validate application is functional by directly accessing the applications through the private IP address or instance-level public IP address associated with your backend instance.
+  * Validate your application is functional by directly accessing the applications through the private IP address or instance-level public IP address associated with your backend instance.
 * Review the Network Security Groups applied to our backend resources. Ensure that there are no rules of a higher priority than *AllowAzureLoadBalancerInBound* that blocks the health probe.
   * You can do this by visiting the Networking settings of your backend VMs or Virtual Machine Scale Sets.
   * If you find this NSG issue is the case, move the existing Allow rule or create a new high priority rule to allow AzureLoadBalancer traffic.
@@ -62,8 +62,6 @@ Let's say we check our health probe status and find out that all instances are s
 * Ensure you're using the right protocol. For example, a probe using HTTP to probe a port listening for a non-HTTP application fails.
 * Azure Firewall shouldn't be placed in the backend pool of load balancers. See [Integrate Azure Firewall with Azure Standard Load Balancer](../firewall/integrate-lb.md) to properly integrate Azure Firewall with load balancer.
 
-If you've gone through this checklist and are still finding health probe failures, there can be rare platform issues impacting the probe service for your instances. In this case, Azure has your back and an automated alert is sent to our team to rapidly resolve all platform issues.
-
 ## Next steps
 
 * [Learn more about the Azure Load Balancer health probe](load-balancer-custom-probe-overview.md)