Skip to content

Commit 70332e4

Browse files
author
Daniel Chadwick
committed
OSDOCS-7590: Adding metrics dashboard about info
1 parent 7b8a9f1 commit 70332e4

15 files changed

+266
-0
lines changed

_topic_maps/_topic_map.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2420,6 +2420,8 @@ Topics:
24202420
Topics:
24212421
- Name: Adding worker nodes to single-node OpenShift clusters
24222422
File: nodes-sno-worker-nodes
2423+
- Name: Node metrics dashboard
2424+
File: nodes-dashboard-using
24232425
---
24242426
Name: Windows Container Support for OpenShift
24252427
Dir: windows_containers
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes-dashboard-using.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="nodes-dashboard-using-about_{context}"]
7+
= About the node metrics dashboard
8+
9+
The node metrics dashboard enables administrative and support team members to monitor metrics related to pod scaling, including scaling limits used to diagnose and troubleshoot scaling issues. Particularly, you can use the visual analytics displayed through the dashboard to monitor workload distributions across nodes. Insights gained from these analytics help you determine the health of your CRI-O and Kubelet system components as well as identify potential sources of excessive or imbalanced resource consumption and system instability.
10+
11+
The dashboard displays visual analytics widgets organized into the following categories:
12+
13+
Critical:: Includes visualizations that can help you identify node issues that could result in system instability and inefficiency
14+
Outliers:: Includes histograms that visualize processes with runtime durations that fall outside of the 95th percentile
15+
Average durations:: Helps you track change in the time that system components take to process operations
16+
Number of operations:: Displays visualizations that help you identify changes in the number of operations being run, which in turn helps you determine the load balance and efficiency of your system
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes-dashboard-using.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="nodes-dashboard-using-accessing_{context}"]
7+
= Accessing the node metrics dashboard
8+
9+
You can access the node metrics dashboard from the *Administrator* perspective.
10+
11+
.Procedure
12+
13+
. Expand the *Observe* menu option and select *Dashboards*.
14+
. Under the *Dashboard* filter, select *Node cluster*.
15+
16+
[NOTE]
17+
====
18+
If no data appears in the visualizations under the *Critical* category, no critical anomalies were detected. The dashboard is working as intended.
19+
====
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes-dashboard-using.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="nodes-dashboard-using-identify-critical-cpu-crio"]
7+
= Nodes with CRI-O system reserved CPU utilization > 50%
8+
9+
The *Nodes with CRI-O system reserved CPU utilization > 50%* query identifies nodes where the CRI-O system reserved CPU utilization has exceeded 50% in the last 5 minutes. The query monitors CPU resource consumption by CRI-O, your container runtime, on a per-node basis.
10+
11+
.Example default query
12+
----
13+
sum by (node) (rate(container_cpu_usage_seconds_total{id="/system.slice/crio.service"}[5m]) * 100) / sum by (node) (kube_node_status_capacity{resource="cpu"} - kube_node_status_allocatable{resource="cpu"}) >= 50
14+
----
15+
16+
This query allows for quick identification of abnormal start times that could negatively impact pod performance. If this query returns a high value, your pod start times are slower than usual, which suggests potential issues with the kubelet, pod configuration, or resources.
17+
18+
Investigate further by checking your pod configurations and allocated resources. Make sure that they align with your system capabilities. If you still see high start times, explore metrics panels from other categories on the dashboard to determine the state of your system components.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes-dashboard-using.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="nodes-dashboard-using-identify-critical-cpu-kubelet.adoc"]
7+
= Nodes with Kubelet system reserved CPU utilization > 50%
8+
9+
The *Nodes with Kubelet system reserved CPU utilization > 50%* query calculates the percentage of the CPU that the Kubelet system is currently using from system reserved.
10+
11+
.Example default query
12+
----
13+
sum by (node) (rate(container_cpu_usage_seconds_total{id="/system.slice/kubelet.service"}[5m]) * 100) / sum by (node) (kube_node_status_capacity{resource="cpu"} - kube_node_status_allocatable{resource="cpu"}) >= 50
14+
----
15+
16+
The Kubelet uses the system reserved CPU for its own operations and for running critical system services. For the node's health, it is important to ensure that system reserve CPU usage does not exceed the 50% threshold. Exceeding this limit could indicate heavy utilization or load on the Kubelet, which affects node stability and potentially the performance of the entire Kubernetes cluster.
17+
18+
If any node is displayed in this metric, the Kubelet and the system overall are under heavy load. You can reduce overload on a particular node by balancing the load across other nodes in the cluster. Check other query metrics under the *Outliers*, *Average durations*, and *Number of operations* categories to gain further insights and take necessary corrective action.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes-dashboard-using.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="nodes-dashboard-using-identify-critical-cpu.adoc"]
7+
= Nodes with System Reserved CPU Utilization > 80%
8+
9+
The *Nodes with system reserved CPU utilization > 80%* query identifies nodes where the system-reserved CPU utilization is more than 80%. The query focuses on the system-reserved capacity to calculate the rate of CPU usage in the last 5 minutes and compares that to the CPU resources available on the nodes. If the ratio exceeds 80%, the node's result is displayed in the metric.
10+
11+
.Example default query
12+
----
13+
sum by (node) (rate(container_cpu_usage_seconds_total{id="/system.slice"}[5m]) * 100) / sum by (node) (kube_node_status_capacity{resource="cpu"} - kube_node_status_allocatable{resource="cpu"}) >= 80
14+
----
15+
16+
This query indicates a critical level of system-reserved CPU usage, which can lead to resource exhaustion. High system-reserved CPU usage can result in the inability of the system processes (including the Kubelet and CRI-O) to adequately manage resources on the node. This query can indicate excessive system processes or misconfigured CPU allocation.
17+
18+
Potential corrective measures include rebalancing workloads to other nodes or increasing the CPU resources allocated to the nodes. Investigate the cause of the high system CPU utilization and review the corresponding metrics in the *Outliers*, *Average durations*, and *Number of operations* categories for additional insights into the node's behavior.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes-dashboard-using.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="nodes-dashboard-using-identify-critical-memory-crio.adoc"]
7+
= Nodes with CRI-O system reserved memory utilization > 50%
8+
9+
The *Nodes with CRI-O system reserved memory utilization > 50%* query calculates all nodes where the percentage of used memory reserved for the CRI-O system is greater than or equal to 50%. In this case, memory usage is defined by the resident set size (RSS), which is the portion of the CRI-O system's memory held in RAM.
10+
11+
.Example default query
12+
----
13+
sum by (node) (container_memory_rss{id="/system.slice/crio.service"}) / sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"}) * 100 >= 50
14+
----
15+
16+
This query helps you monitor the status of memory reserved for the CRI-O system on each node. High utilization could indicate a lack of available resources and potential performance issues. If the memory reserved for the CRI-O system exceeds the advised limit of 50%, it indicates that half of the system reserved memory is being used by CRI-O on a node.
17+
18+
Check memory allocation and usage and assess whether memory resources need to be shifted or increased to prevent possible node instability. You can also examine the metrics under the *Outliers*, *Average durations*, and *Number of operations* categories to gain further insights.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes-dashboard-using.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="nodes-dashboard-using-identify-critical-memory-kubelet.adoc"]
7+
= Nodes with Kubelet system reserved memory utilization > 50%
8+
9+
The *Nodes with Kubelet system reserved memory utilization > 50%* query indicates nodes where the Kubelet's system reserved memory utilization exceeds 50%. The query examines the memory that the Kubelet process itself is consuming on a node.
10+
11+
.Example default query
12+
----
13+
sum by (node) (container_memory_rss{id="/system.slice/kubelet.service"}) / sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"}) * 100 >= 50
14+
----
15+
16+
This query helps you identify any possible memory pressure situations in your nodes that could affect the stability and efficiency of node operations. Kubelet memory utilization that consistently exceeds 50% of the system reserved memory, indicate that the system reserved settings are not configured properly and that there is a high risk of the node becoming unstable.
17+
18+
If this metric is highlighted, review your configuration policy and consider adjusting the system reserved settings or the resource limits settings for the Kubelet. Additionally, if your Kubelet memory utilization consistently exceeds half of your total reserved system memory, examine metrics under the *Outliers*, *Average durations*, and *Number of operations* categories to gain further insights for more precise diagnostics.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes-dashboard-using.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="nodes-dashboard-using-identify-critical-memory.adoc"]
7+
= Nodes with system reserved memory utilization > 80%
8+
9+
The *Nodes with system reserved memory utilization > 80%* query calculates the percentage of system reserved memory that is utilized for each node. The calculation divides the total resident set size (RSS) by the total memory capacity of the node subtracted from the allocatable memory. RSS is the portion of the system's memory occupied by a process that is held in main memory (RAM). Nodes are flagged if their resulting value equals or exceeds an 80% threshold.
10+
11+
.Example default query
12+
----
13+
sum by (node) (container_memory_rss{id="/system.slice"}) / sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"}) * 100 >= 80
14+
----
15+
16+
System reserved memory is crucial for a Kubernetes node as it is utilized to run system daemons and Kubernetes system daemons. System reserved memory utilization that exceeds 80% indicates that the system and Kubernetes daemons are consuming too much memory and can suggest node instability that could affect the performance of running pods. Excessive memory consumption can cause Out-of-Memory (OOM) killers that can terminate critical system processes to free up memory.
17+
18+
If a node is flagged by this metric, identify which system or Kubernetes processes are consuming excessive memory and take appropriate actions to mitigate the situation. These actions may include scaling back non-critical processes, optimizing program configurations to reduce memory usage, or upgrading node systems to hardware with greater memory capacity. You can also review the metrics under the *Outliers*, *Average durations*, and *Number of operations* categories to gain further insights into node performance.
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * nodes/nodes-dashboard-using.adoc
4+
5+
:_content-type: CONCEPT
6+
[id="nodes-dashboard-using-identify-critical-pulls.adoc"]
7+
= Failure rate for image pulls in the last hour
8+
9+
The *Failure rate for image pulls in the last hour* query divides the total number of failed image pulls by the sum of successful and failed image pulls to provide a ratio of failures.
10+
11+
.Example default query
12+
----
13+
rate(container_runtime_crio_image_pulls_failure_total[1h]) / (rate(container_runtime_crio_image_pulls_success_total[1h]) + rate(container_runtime_crio_image_pulls_failure_total[1h]))
14+
----
15+
16+
Understanding the failure rate of image pulls is crucial for maintaining the health of the node. A high failure rate might indicate networking issues, storage problems, misconfigurations, or other issues that could disrupt pod density and the deployment of new containers.
17+
18+
If the outcome of this query is high, investigate possible causes such as network connections, the availability of remote repositories, node storage, and the accuracy of image references. You can also review the metrics under the *Outliers*, *Average durations*, and *Number of operations* categories to gain further insights.

0 commit comments

Comments
 (0)