Skip to content

Commit 9cb64a1

Browse files
authored
Update troubleshoot-node-cpu-pressure-psi.md
1 parent 09e831a commit 9cb64a1

File tree

1 file changed

+11
-11
lines changed

1 file changed

+11
-11
lines changed

support/azure/azure-kubernetes/availability-performance/troubleshoot-node-cpu-pressure-psi.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Troubleshoot CPU Pressure in AKS Clusters Using PSI Metrics
3-
description: Provides troubleshoot guidance for CPU pressure using PSI metrics in an AKS cluster.
3+
description: Provides troubleshoot guidance on CPU pressure using PSI metrics in an AKS cluster.
44
ms.date: 05/21/2025
55
ms.reviewer: aritraghosh, dafell, alvinli, v-weizhu
66
ms.service: azure-kubernetes-service
@@ -21,8 +21,8 @@ The following table outlines the common symptoms of CPU pressure:
2121
|Symptom | Description |
2222
|---|---|
2323
|Increased application latency|Services respond slower even when CPU utilization appears moderate.|
24-
|Throttled containers|Containers experience delays in processing despite having CPU resources available on the node|
25-
|Degraded performance|Applications experience unpredictable performance variations that don't correlate with CPU usage percentages|
24+
|Throttled containers|Containers experience delays in processing despite having CPU resources available on the node.|
25+
|Degraded performance|Applications experience unpredictable performance variations that don't correlate with CPU usage percentages.|
2626

2727
## Troubleshooting checklist
2828

@@ -45,20 +45,20 @@ Azure Monitoring Managed Prometheus provides a way to monitor PSI metrics:
4545

4646
2. Navigate to the Azure Monitor workspace associated with the AKS cluster from the [Azure portal](https://portal.azure.com).
4747

48-
:::image type="content" source="media/troubleshoot-node-cpu-pressure-psi/configure-azure-monitor-for-containers.png" alt-text="Screenshow that shows how to navigate to the Azure Monitor workspace." lightbox="media/troubleshoot-node-cpu-pressure-psi/configure-azure-monitor-for-containers.png":::
48+
:::image type="content" source="media/troubleshoot-node-cpu-pressure-psi/configure-azure-monitor-for-containers.png" alt-text="Screenshot that shows how to navigate to the Azure Monitor workspace." lightbox="media/troubleshoot-node-cpu-pressure-psi/configure-azure-monitor-for-containers.png":::
4949

5050
3. Under **Monitoring**, select **Metrics**.
5151

5252
4. Select **Prometheus metrics** as the data source.
5353

5454
> [!NOTE]
55-
> The metrics need to be enabled in Azure Monitoring Managed Prometheus for it to be available. These metrics are exposed by Node Exporter or cAdvisor.
55+
> To use the metrics, you need to enable them in Azure Monitoring Managed Prometheus. These metrics are exposed by Node Exporter or cAdvisor.
5656
5757
5. Query specific PSI metrics in Prometheus explorer:
5858

5959
- For node-level CPU pressure, use the `node_pressure_cpu_waiting_seconds_total` Prometheus Query Language (PromQL).
6060

61-
:::image type="content" source="media/troubleshoot-node-cpu-pressure-psi/node-level-cpu-pressure.png" alt-text="Screenshow that shows how to query node-level CPU pressure." lightbox="media/troubleshoot-node-cpu-pressure-psi/node-level-cpu-pressure.png":::
61+
:::image type="content" source="media/troubleshoot-node-cpu-pressure-psi/node-level-cpu-pressure.png" alt-text="Screenshot that shows how to query node-level CPU pressure." lightbox="media/troubleshoot-node-cpu-pressure-psi/node-level-cpu-pressure.png":::
6262

6363
- For pod-level CPU pressure, use the `container_cpu_cfs_throttled_seconds_total` PromQL.
6464

@@ -67,7 +67,7 @@ Azure Monitoring Managed Prometheus provides a way to monitor PSI metrics:
6767
`rate(node_pressure_cpu_waiting_seconds_total[5m]) * 100`
6868

6969
> [!NOTE]
70-
> Some of the container level metrics such as `container_pressure_cpu_waiting_seconds_total` and `container_pressure_cpu_stalled_seconds_total` aren't available in AKS as they're part of the Kubelet PSI feature gate which is in alpha state. AKS begins supporting the use of the feature when it reaches beta stage.
70+
> Some of the container level metrics such as `container_pressure_cpu_waiting_seconds_total` and `container_pressure_cpu_stalled_seconds_total` aren't available in AKS as they're part of the Kubelet PSI feature gate that is in alpha state. AKS begins supporting the use of the feature when it reaches beta stage.
7171
7272
### [Command Line](#tab/command-line)
7373

@@ -121,17 +121,17 @@ Review the following table to learn how to implement best practices for avoiding
121121
## Key PSI metrics to monitor
122122

123123
> [!NOTE]
124-
> If a node's CPU usage is moderate, but the containers on the node experience CFS throttling, consider increasing the resource limits or removing them and following [Linux's Completely Fair Scheduler (CFS)](https://docs.kernel.org/scheduler/sched-design-CFS.html) algorithm.
124+
> If a node's CPU usage is moderate but the containers on the node experience CFS throttling, increase the resource limits, or remove them and follow [Linux's Completely Fair Scheduler (CFS)](https://docs.kernel.org/scheduler/sched-design-CFS.html) algorithm.
125125
126126
### Node-level PSI metrics
127127

128-
- `node_pressure_cpu_waiting_seconds_total`: Cumulative time tasks have been waiting for CPU.
128+
- `node_pressure_cpu_waiting_seconds_total`: Cumulative time tasks wait for CPU.
129129
- `node_cpu_seconds_total`: Traditional CPU utilization for comparison.
130130

131131
### Container-level PSI indicators
132132

133-
- `container_cpu_cfs_throttled_periods_total`: The number of periods a container has been throttled.
134-
- `container_cpu_cfs_throttled_seconds_total`: Total time a container has been throttled.
133+
- `container_cpu_cfs_throttled_periods_total`: The number of periods a container is throttled.
134+
- `container_cpu_cfs_throttled_seconds_total`: Total time a container is throttled.
135135
- Throttling percentage: `rate(container_cpu_cfs_throttled_periods_total[5m]) / rate(container_cpu_cfs_periods_total[5m]) * 100`
136136

137137
## Why using PSI metrics?

0 commit comments

Comments
 (0)