Skip to content

Commit a13159d

Browse files
authored
Merge pull request #95058 from MGoedtel/Task1632556
Task1632556 CI Health Feature
2 parents ff47be2 + a335001 commit a13159d

17 files changed

+189
-3
lines changed

articles/azure-monitor/insights/container-insights-analyze.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -315,3 +315,5 @@ You access these workbooks by selecting each one from the **View Workbooks** dro
315315
- Review [Create performance alerts with Azure Monitor for containers](container-insights-alerts.md) to learn how to create alerts for high CPU and memory utilization to support your DevOps or operational processes and procedures.
316316

317317
- View [log query examples](container-insights-log-search.md#search-logs-to-analyze-data) to see predefined queries and examples to evaluate or customize to alert, visualize, or analyze your clusters.
318+
319+
- View [monitor cluster health](container-insights-health.md) to learn about viewing the health status your Kubernetes cluster.
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
---
2+
title: Azure Monitor for containers health monitors configuration | Microsoft Docs
3+
description: This article provides content describing the detailed configuration of the health monitors in Azure Monitor for containers.
4+
services: azure-monitor
5+
documentationcenter: ''
6+
author: mgoedtel
7+
manager: carmonm
8+
editor:
9+
ms.assetid:
10+
ms.service: azure-monitor
11+
ms.topic: conceptual
12+
ms.workload: infrastructure-services
13+
ms.date: 11/12/2019
14+
ms.author: magoedte
15+
---
16+
17+
# Azure Monitor for containers health monitor configuration guide
18+
19+
Monitors are the primary element for measuring health and detecting errors in Azure Monitor for containers. This article helps you understand the concepts of how health is measured and the elements that comprise the health model to monitor and report on the health of your Kubernetes cluster with the [Health feature](container-insights-health.md).
20+
21+
## Monitors
22+
23+
A monitor measures the health of some aspect of a managed object. Monitors each have either two or three health states. A monitor will be in one and only one of its potential states at any given time. When a monitor loaded by the containerized agent, it is initialized to a healthy state. The state changes only if the specified conditions for another state are detected.
24+
25+
The overall health of a particular object is determined from the health of each of its monitors. This hierarchy is illustrated in the Health Hierarchy pane in Azure Monitor for containers. The policy for how health is rolled up is part of the configuration of the aggregate monitors.
26+
27+
## Types of monitors
28+
29+
|Monitor | Description |
30+
|--------|-------------|
31+
| Unit monitor |A unit monitor measures some aspect of a resource or application. This might be checking a performance counter to determine the performance of the resource, or its availability. |
32+
|Aggregate Monitor | Aggregate monitors group multiple monitors to provide a single health aggregated health state. Unit monitors are typically configured under a particular aggregate monitor. For example, a Node aggregate monitor rolls up the status of the Node CPU utilization, memory utilization, and Node status.
33+
|
34+
35+
### Aggregate monitor health rollup policy
36+
37+
Each aggregate monitor defines a health rollup policy, which is the logic that is used to determine the health of the aggregate monitor based on the health of the monitors under it. The possible health rollup policies for an aggregate monitor are as follows:
38+
39+
#### Worst state policy
40+
41+
The state of the aggregate monitor matches the state of the child monitor with the worst health state. This is the most common policy used by aggregate monitors.
42+
43+
![Example of aggregate monitor rollup worst state](./media/container-insights-health-monitoring-cfg/aggregate-monitor-rollup-worstof.png)
44+
45+
### Percentage policy
46+
47+
The source object matches the worst state of a single member of a specified percentage of target objects in the best state. This policy is used when a certain percentage of target objects must be healthy for the target object to be considered healthy. Percentage policy sorts the monitors in descending order of severity of state, and the aggregate monitor's state is computed as the worst state of N% (N is dictated by the configuration parameter *StateThresholdPercentage*).
48+
49+
For example, suppose there are five container instances of a container image, and their individual states are **Critical**, **Warning**, **Healthy**, **Healthy**, **Healthy**. The status of the container CPU utilization monitor will be **Critical**, since the worst state of 90% of the containers is **Critical** when sorted in descending order of severity.
50+
51+
## Understand the monitoring configuration
52+
53+
Azure Monitor for containers includes a number of key monitoring scenarios that are configured as follows.
54+
55+
### Unit monitors
56+
57+
|**Monitor name** | Monitor type | **Description** | **Parameter** | **Value** |
58+
|-----------------|--------------|-----------------|---------------|-----------|
59+
|Node Memory Utilization |Unit monitor |This monitor evaluates the memory utilization of a node every minute, using the cadvisor reported data. |ConsecutiveSamplesForStateTransition<br> FailIfGreaterThanPercentage<br> WarnIfGreaterThanPercentage | 3<br> 90<br> 80 ||
60+
|Node CPU Utilization |Unit Monitor |This monitor checks the CPU utilization of the node every minute, using the cadvisor reported data. | ConsecutiveSamplesForStateTransition<br> FailIfGreaterThanPercentage<br> WarnIfGreaterThanPercentage | 3<br> 90<br> 80 ||
61+
|Node Status |Unit monitor |This monitor checks node conditions reported by Kubernetes.<br> Currently the following node conditions are checked: Disk Pressure, Memory Pressure, PID Pressure, Out of Disk, Network unavailable, Ready status for the node.<br> Out of the above conditions, if either *Out of Disk* or *Network Unavailable* is **true**, the monitor changes to **Critical** state.<br> If any other conditions equal **true**, other than a **Ready** status, the monitor changes to a **Warning** state. | NodeConditionTypeForFailedState | outofdisk,networkunavailable ||
62+
|Container memory utilization |Unit monitor |This monitor reports combined health status of the Memory utilization(RSS) of the instances of the container.<br> It performs a simple comparison that compares each sample to a single threshold, and specified by the configuration parameter **ConsecutiveSamplesForStateTransition**.<br> Its state is calculated as the worst state of 90% of the container (StateThresholdPercentage) instances, sorted in descending order of severity of container health state (that is, Critical, Warning, Healthy).<br> If no record is received from a container instance, then the health state of the container instance is reported as **Unknown**, and has higher precedence in the sorting order over the **Critical** state.<br> Each individual container instance's state is calculated using the thresholds specified in the configuration. If the usage is over critical threshold (90%), then the instance is in a **Critical** state, if it is less than critical threshold (90%) but greater than warning threshold (80%), then the instance is in a **Warning** state. Otherwise, it is in **Healthy** state. |ConsecutiveSamplesForStateTransition<br> FailIfLessThanPercentage<br> StateThresholdPercentage<br> WarnIfGreaterThanPercentage| 3<br> 90<br> 90<br> 80 ||
63+
|Container CPU utilization |Unit monitor |This monitor reports combined health status of the CPU utilization of the instances of the container.<br> It performs a simple comparison that compares each sample to a single threshold, and specified by the configuration parameter **ConsecutiveSamplesForStateTransition**.<br> Its state is calculated as the worst state of 90% of the container (StateThresholdPercentage) instances, sorted in descending order of severity of container health state (that is, Critical, Warning, Healthy).<br> If no record is received from a container instance, then the health state of the container instance is reported as **Unknown**, and has higher precedence in the sorting order over the **Critical** state.<br> Each individual container instance's state is calculated using the thresholds specified in the configuration. If the usage is over critical threshold (90%), then the instance is in a **Critical** state, if it is less than critical threshold (90%) but greater than warning threshold (80%), then the instance is in a **Warning** state. Otherwise, it is in **Healthy** state. |ConsecutiveSamplesForStateTransition<br> FailIfLessThanPercentage<br> StateThresholdPercentage<br> WarnIfGreaterThanPercentage| 3<br> 90<br> 90<br> 80 ||
64+
|System workload pods ready |Unit monitor |This monitor reports status based on percentage of pods in ready state in a given workload. Its state is set to **Critical** if less than 100% of the pods are in a **Healthy** state |ConsecutiveSamplesForStateTransition<br> FailIfLessThanPercentage |2<br> 100 ||
65+
|Kube API status |Unit monitor |This monitor reports status of Kube Api service. Monitor is in critical state in case Kube Api endpoint is unavailable. For this particular monitor, the state is determined by making a query to the 'nodes' endpoint for the kube-api server. Anything other than an OK response code changes the monitor to a **Critical** state. | No configuration properties |||
66+
67+
### Aggregate monitors
68+
69+
|**Monitor name** | **Description** | **Algorithm** |
70+
|-----------------|-----------------|---------------|
71+
|Node |This monitor is an aggregate of the all the node monitors. It matches the state of the child monitor with the worst health state:<br> Node CPU utilization<br> Node memory utilization<br> Node Status | Worst of|
72+
|Node pool |This monitor reports combined health status of all nodes in the node pool *agentpool*. This is a three state monitor, whose state is based on the worst state of 80% of the nodes in the node pool, sorted in descending order of severity of node states (that is, Critical, Warning, Healthy).|Percentage |
73+
|Nodes (parent of Node pool) |This is an aggregate monitor of all the node pools. Its state is based on the worst state of its child monitors (that is, the node pools present in the cluster). |Worst of |
74+
|Cluster (parent of nodes/<br> Kubernetes infrastructure) |This is the parent monitor that matches the state of the child monitor with the worst health state, that is kubernetes infrastructure and nodes. |Worst of |
75+
|Kubernetes infrastructure |This monitor reports combined health status of the managed infrastructure components of the cluster. its status is calculated as the 'worst of' its child monitor states i.e. kube-system workloads and API Server status. |Worst of|
76+
|System workload |This monitor reports health status of a kube-system workload. This monitor matches the state of the child monitor with the worst health state, that is the **Pods in ready state** monitor and the containers in the workload). |Worst of |
77+
|Container |This monitor reports overall health status of a container in a given workload. This monitor matches the state of the child monitor with the worst health state, that is the **CPU utilization** and **Memory utilization** monitors. |Worst of |
78+
79+
## Next steps
80+
81+
View [monitor cluster health](container-insights-health.md) to learn about viewing the health status your Kubernetes cluster.
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
---
2+
title: Monitor Kubernetes cluster health with Azure Monitor for containers | Microsoft Docs
3+
description: This article describes how you can view and analyze the health of your AKS and non-AKS clusters with Azure Monitor for containers.
4+
services: azure-monitor
5+
documentationcenter: ''
6+
author: mgoedtel
7+
manager: carmonm
8+
editor:
9+
ms.assetid:
10+
ms.service: azure-monitor
11+
ms.topic: conceptual
12+
ms.workload: infrastructure-services
13+
ms.date: 11/14/2019
14+
ms.author: magoedte
15+
---
16+
17+
# Understand Kubernetes cluster health with Azure Monitor for containers
18+
19+
With Azure Monitor for containers, it monitors and reports health status of the managed infrastructure components and all nodes running on any Kubernetes cluster supported by Azure Monitor for containers. This experience extends beyond the cluster health status calculated and reported on the [multi-cluster view](container-insights-analyze.md#multi-cluster-view-from-azure-monitor), where now you can understand if one or more nodes in the cluster are resource constrained, or a node or pod is unavailable that could impact a running application in the cluster based on curated metrics.
20+
21+
For information about how to enable Azure Monitor for containers, see [Onboard Azure Monitor for containers](container-insights-onboard.md).
22+
23+
## Overview
24+
25+
In Azure Monitor for containers, the Health feature provides proactive health monitoring of your Kubernetes cluster to help you identify and diagnose issues. It gives you the ability to view significant issues detected. Monitors evaluating the health of your cluster run on the containerized agent in your cluster, and the health data is written to the **KubeHealth** table in your Log Analytics workspace.
26+
27+
Kubernetes cluster health is based on a number of monitoring scenarios organized by the following Kubernetes objects and abstractions:
28+
29+
- Kubernetes infrastructure - provides a rollup of the Kubernetes API server, ReplicaSets, and DaemonSets running on nodes deployed in your cluster by evaluating CPU and memory utilization, and a Pods availability
30+
31+
![Kubernetes infrastructure health rollup view](./media/container-insights-health/health-view-kube-infra-01.png)
32+
33+
- Nodes - provides a rollup of the Node pools and state of individual Nodes in each pool, by evaluating CPU and memory utilization, and a Node's status as reported by Kubernetes.
34+
35+
![Nodes health rollup view](./media/container-insights-health/health-view-nodes-01.png)
36+
37+
Currently, only the status of a virtual kubelet is supported. The health state for CPU and memory utilization of virtual kublet nodes is reported as **Unknown**, since a signal is not received from them.
38+
39+
All monitors are shown in a hierarchical layout in the Health Hierarchy pane, where an aggregate monitor representing the Kubernetes object or abstraction (that is, Kubernetes infrastructure or Nodes) are the top-most monitor reflecting the combined health of all dependent child monitors. The key monitoring scenarios used to derive health are:
40+
41+
* Evaluate CPU utilization from the node and container.
42+
* Evaluate memory utilization from the node and container.
43+
* Status of Pods and Nodes based on calculation of their ready state reported by Kubernetes.
44+
45+
The icons used to indicate state are as follows:
46+
47+
|Icon|Meaning|
48+
|--------|-----------|
49+
|![Green check icon indicates healthy](./media/container-insights-health/healthyicon.png)|Success, health is OK (green)|
50+
|![Yellow triangle and exclamation mark is warning](./media/container-insights-health/warningicon.png)|Warning (yellow)|
51+
|![Red button with white X indicates critical state](./media/container-insights-health/criticalicon.png)|Critical (red)|
52+
|![Grayed-out icon](./media/container-insights-health/grayicon.png)|Unknown (gray)|
53+
54+
## Monitor configuration
55+
56+
To understand the behavior and configuration of each monitor supporting Azure Monitor for containers Health feature, see [Health monitor configuration guide](container-insights-health-monitors-config.md).
57+
58+
## Sign in to the Azure portal
59+
60+
Sign in to the [Azure portal](https://portal.azure.com).
61+
62+
## View health of an AKS or non-AKS cluster
63+
64+
Access to the Azure Monitor for containers Health feature is available directly from an AKS cluster by selecting **Insights** from the left pane in the Azure portal. Under the **Insights** section, select **Containers**.
65+
66+
To view health from a non-AKS cluster, that is an AKS Engine cluster hosted on-premises or on Azure Stack, select **Azure Monitor** from the left pane in the Azure portal. Under the **Insights** section, select **Containers**. On the multi-cluster page, select the non-AKS cluster from the list.
67+
68+
In Azure Monitor for containers, from the **Cluster** page, select **Health**.
69+
70+
![Cluster health dashboard example](./media/container-insights-health/container-insights-health-page.png)
71+
72+
## Review cluster health
73+
74+
When the Health page opens, by default **Kubernetes Infrastructure** is selected in the **Health Aspect** grid. The grid summarizes current health rollup state of Kubernetes infrastructure and cluster nodes. Selecting either health aspect updates the results in the Health Hierarchy pane (that is, the middle-pane) and shows all child monitors in a hierarchical layout, displaying their current health state. To view more information about any dependent monitor, you can select one and a property pane automatically displays on the right side of the page.
75+
76+
![Cluster health property pane](./media/container-insights-health/health-view-property-pane.png)
77+
78+
On the property pane, you learn the following:
79+
80+
- On the **Overview** tab, it shows the current state of the monitor selected, when the monitor was last calculated, and when the last state change occurred. Additional information is shown depending on the type of monitor selected in the hierarchy.
81+
82+
If you select an aggregate monitor in the Health Hierarchy pane, under the **Overview** tab on the property pane it shows a rollup of the total number of child monitors in the hierarchy, and how many aggregate monitors are in a critical, warning, and healthy state.
83+
84+
![Health property pane Overview tab for aggregate monitor](./media/container-insights-health/health-overview-aggregate-monitor.png)
85+
86+
If you select a unit monitor in the Health Hierarchy pane, it also shows under **Last state change** the previous samples calculated and reported by the containerized agent within the last four hours. This is based on the unit monitors calculation for comparing several consecutive values to determine its state. For example, if you selected the *Pod ready state* unit monitor, it shows the last two samples controlled by the parameter *ConsecutiveSamplesForStateTransition*. For more information, see the detailed description of [unit monitors](container-insights-health-monitors-config.md#unit-monitors).
87+
88+
![Health property pane Overview tab](./media/container-insights-health/health-overview-unit-monitor.png)
89+
90+
If the time reported by **Last state change** is a day or older, it is the result of no changes in state for the monitor. However, if the last sample received for a unit monitor is more than four hours old, this likely indicates the containerized agent has not been sending data. If the agent knows that a particular resource exists, for example a Node, but it hasn't received data from the Node's CPU or memory utilization monitors (as an example), then the health state of the monitor is set to **Unknown**.
91+
92+
- On the**Config** tab, it shows the default configuration parameter settings (only for unit monitors, not aggregate monitors) and their values.
93+
- On the **Knowledge** tab, it contains information explaining the behavior of the monitor and how it evaluates for the unhealthy condition.
94+
95+
Monitoring data on this page does not refresh automatically and you need to select **Refresh** at the top of the page to see the most recent health state received from the cluster.
96+
97+
## Next steps
98+
99+
View [log query examples](container-insights-log-search.md#search-logs-to-analyze-data) to see predefined queries and examples to evaluate or customize to alert, visualize, or analyze your clusters.
Loading
116 KB
Loading
707 Bytes
Loading
609 Bytes
Loading
117 KB
Loading
148 KB
Loading
93.5 KB
Loading

0 commit comments

Comments
 (0)