OSDOCS-7590: Adding metrics dashboard about info

Daniel Chadwick · Daniel Chadwick · commit 70332e4eef3a · 2023-10-27T14:13:57.000-04:00
diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml
@@ -2420,6 +2420,8 @@ Topics:
   Topics:
   - Name: Adding worker nodes to single-node OpenShift clusters
     File: nodes-sno-worker-nodes
+- Name: Node metrics dashboard
+  File: nodes-dashboard-using
 ---
 Name: Windows Container Support for OpenShift
 Dir: windows_containers
diff --git a/modules/nodes-dashboard-using-about.adoc b/modules/nodes-dashboard-using-about.adoc
@@ -0,0 +1,16 @@
+// Module included in the following assemblies:
+//
+// * nodes/nodes-dashboard-using.adoc
+
+:_content-type: CONCEPT
+[id="nodes-dashboard-using-about_{context}"]
+= About the node metrics dashboard
+
+The node metrics dashboard enables administrative and support team members to monitor metrics related to pod scaling, including scaling limits used to diagnose and troubleshoot scaling issues. Particularly, you can use the visual analytics displayed through the dashboard to monitor workload distributions across nodes. Insights gained from these analytics help you determine the health of your CRI-O and Kubelet system components as well as identify potential sources of excessive or imbalanced resource consumption and system instability.
+
+The dashboard displays visual analytics widgets organized into the following categories: 
+
+Critical:: Includes visualizations that can help you identify node issues that could result in system instability and inefficiency
+Outliers:: Includes histograms that visualize processes with runtime durations that fall outside of the 95th percentile
+Average durations:: Helps you track change in the time that system components take to process operations
+Number of operations:: Displays visualizations that help you identify changes in the number of operations being run, which in turn helps you determine the load balance and efficiency of your system
diff --git a/modules/nodes-dashboard-using-accessing.adoc b/modules/nodes-dashboard-using-accessing.adoc
@@ -0,0 +1,19 @@
+// Module included in the following assemblies:
+//
+// * nodes/nodes-dashboard-using.adoc
+
+:_content-type: PROCEDURE
+[id="nodes-dashboard-using-accessing_{context}"]
+= Accessing the node metrics dashboard
+
+You can access the node metrics dashboard from the *Administrator* perspective.
+
+.Procedure
+
+. Expand the *Observe* menu option and select *Dashboards*.
+. Under the *Dashboard* filter, select *Node cluster*.
+
+[NOTE]
+====
+If no data appears in the visualizations under the *Critical* category, no critical anomalies were detected. The dashboard is working as intended.
+====
diff --git a/modules/nodes-dashboard-using-identify-critical-cpu-crio.adoc b/modules/nodes-dashboard-using-identify-critical-cpu-crio.adoc
@@ -0,0 +1,18 @@
+// Module included in the following assemblies:
+//
+// * nodes/nodes-dashboard-using.adoc
+
+:_content-type: CONCEPT
+[id="nodes-dashboard-using-identify-critical-cpu-crio"]
+= Nodes with CRI-O system reserved CPU utilization > 50%
+
+The *Nodes with CRI-O system reserved CPU utilization > 50%* query identifies nodes where the CRI-O system reserved CPU utilization has exceeded 50% in the last 5 minutes. The query monitors CPU resource consumption by CRI-O, your container runtime, on a per-node basis.
+
+.Example default query
+----
+sum by (node) (rate(container_cpu_usage_seconds_total{id="/system.slice/crio.service"}[5m]) * 100) / sum by (node) (kube_node_status_capacity{resource="cpu"} - kube_node_status_allocatable{resource="cpu"}) >= 50
+----
+
+This query allows for quick identification of abnormal start times that could negatively impact pod performance. If this query returns a high value, your pod start times are slower than usual, which suggests potential issues with the kubelet, pod configuration, or resources. 
+
+Investigate further by checking your pod configurations and allocated resources. Make sure that they align with your system capabilities. If you still see high start times, explore metrics panels from other categories on the dashboard to determine the state of your system components.
diff --git a/modules/nodes-dashboard-using-identify-critical-cpu-kubelet.adoc b/modules/nodes-dashboard-using-identify-critical-cpu-kubelet.adoc
@@ -0,0 +1,18 @@
+// Module included in the following assemblies:
+//
+// * nodes/nodes-dashboard-using.adoc
+
+:_content-type: CONCEPT
+[id="nodes-dashboard-using-identify-critical-cpu-kubelet.adoc"]
+= Nodes with Kubelet system reserved CPU utilization > 50%
+
+The *Nodes with Kubelet system reserved CPU utilization > 50%* query calculates the percentage of the CPU that the Kubelet system is currently using from system reserved.
+
+.Example default query
+----
+sum by (node) (rate(container_cpu_usage_seconds_total{id="/system.slice/kubelet.service"}[5m]) * 100) / sum by (node) (kube_node_status_capacity{resource="cpu"} - kube_node_status_allocatable{resource="cpu"}) >= 50
+----
+
+The Kubelet uses the system reserved CPU for its own operations and for running critical system services. For the node's health, it is important to ensure that system reserve CPU usage does not exceed the 50% threshold. Exceeding this limit could indicate heavy utilization or load on the Kubelet, which affects node stability and potentially the performance of the entire Kubernetes cluster. 
+
+If any node is displayed in this metric, the Kubelet and the system overall are under heavy load. You can reduce overload on a particular node by balancing the load across other nodes in the cluster. Check other query metrics under the *Outliers*, *Average durations*, and *Number of operations* categories to gain further insights and take necessary corrective action.
diff --git a/modules/nodes-dashboard-using-identify-critical-cpu.adoc b/modules/nodes-dashboard-using-identify-critical-cpu.adoc
@@ -0,0 +1,18 @@
+// Module included in the following assemblies:
+//
+// * nodes/nodes-dashboard-using.adoc
+
+:_content-type: CONCEPT
+[id="nodes-dashboard-using-identify-critical-cpu.adoc"]
+= Nodes with System Reserved CPU Utilization > 80%
+
+The *Nodes with system reserved CPU utilization > 80%* query identifies nodes where the system-reserved CPU utilization is more than 80%. The query focuses on the system-reserved capacity to calculate the rate of CPU usage in the last 5 minutes and compares that to the CPU resources available on the nodes. If the ratio exceeds 80%, the node's result is displayed in the metric.
+
+.Example default query
+----
+sum by (node) (rate(container_cpu_usage_seconds_total{id="/system.slice"}[5m]) * 100) / sum by (node) (kube_node_status_capacity{resource="cpu"} - kube_node_status_allocatable{resource="cpu"}) >= 80
+----
+
+This query indicates a critical level of system-reserved CPU usage, which can lead to resource exhaustion. High system-reserved CPU usage can result in the inability of the system processes (including the Kubelet and CRI-O) to adequately manage resources on the node. This query can indicate excessive system processes or misconfigured CPU allocation. 
+
+Potential corrective measures include rebalancing workloads to other nodes or increasing the CPU resources allocated to the nodes. Investigate the cause of the high system CPU utilization and review the corresponding metrics in the *Outliers*, *Average durations*, and *Number of operations* categories for additional insights into the node's behavior.
diff --git a/modules/nodes-dashboard-using-identify-critical-memory-crio.adoc b/modules/nodes-dashboard-using-identify-critical-memory-crio.adoc
@@ -0,0 +1,18 @@
+// Module included in the following assemblies:
+//
+// * nodes/nodes-dashboard-using.adoc
+
+:_content-type: CONCEPT
+[id="nodes-dashboard-using-identify-critical-memory-crio.adoc"]
+= Nodes with CRI-O system reserved memory utilization > 50%
+
+The *Nodes with CRI-O system reserved memory utilization > 50%* query calculates all nodes where the percentage of used memory reserved for the CRI-O system is greater than or equal to 50%. In this case, memory usage is defined by the resident set size (RSS), which is the portion of the CRI-O system's memory held in RAM.
+
+.Example default query
+----
+sum by (node) (container_memory_rss{id="/system.slice/crio.service"}) / sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"}) * 100 >= 50
+----
+
+This query helps you monitor the status of memory reserved for the CRI-O system on each node. High utilization could indicate a lack of available resources and potential performance issues. If the memory reserved for the CRI-O system exceeds the advised limit of 50%, it indicates that half of the system reserved memory is being used by CRI-O on a node. 
+
+Check memory allocation and usage and assess whether memory resources need to be shifted or increased to prevent possible node instability. You can also examine the metrics under the *Outliers*, *Average durations*, and *Number of operations* categories to gain further insights.
diff --git a/modules/nodes-dashboard-using-identify-critical-memory-kubelet.adoc b/modules/nodes-dashboard-using-identify-critical-memory-kubelet.adoc
@@ -0,0 +1,18 @@
+// Module included in the following assemblies:
+//
+// * nodes/nodes-dashboard-using.adoc
+
+:_content-type: CONCEPT
+[id="nodes-dashboard-using-identify-critical-memory-kubelet.adoc"]
+= Nodes with Kubelet system reserved memory utilization > 50%
+
+The *Nodes with Kubelet system reserved memory utilization > 50%* query indicates nodes where the Kubelet's system reserved memory utilization exceeds 50%. The query examines the memory that the Kubelet process itself is consuming on a node.
+
+.Example default query
+----
+sum by (node) (container_memory_rss{id="/system.slice/kubelet.service"}) / sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"}) * 100 >= 50
+----
+
+This query helps you identify any possible memory pressure situations in your nodes that could affect the stability and efficiency of node operations. Kubelet memory utilization that consistently exceeds 50% of the system reserved memory, indicate that the system reserved settings are not configured properly and that there is a high risk of the node becoming unstable. 
+
+If this metric is highlighted, review your configuration policy and consider adjusting the system reserved settings or the resource limits settings for the Kubelet. Additionally, if your Kubelet memory utilization consistently exceeds half of your total reserved system memory, examine metrics under the *Outliers*, *Average durations*, and *Number of operations* categories to gain further insights for more precise diagnostics.
diff --git a/modules/nodes-dashboard-using-identify-critical-memory.adoc b/modules/nodes-dashboard-using-identify-critical-memory.adoc
@@ -0,0 +1,18 @@
+// Module included in the following assemblies:
+//
+// * nodes/nodes-dashboard-using.adoc
+
+:_content-type: CONCEPT
+[id="nodes-dashboard-using-identify-critical-memory.adoc"]
+= Nodes with system reserved memory utilization > 80%
+
+The *Nodes with system reserved memory utilization > 80%* query calculates the percentage of system reserved memory that is utilized for each node. The calculation divides the total resident set size (RSS) by the total memory capacity of the node subtracted from the allocatable memory. RSS is the portion of the system's memory occupied by a process that is held in main memory (RAM). Nodes are flagged if their resulting value equals or exceeds an 80% threshold.
+
+.Example default query
+----
+sum by (node) (container_memory_rss{id="/system.slice"}) / sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"}) * 100 >= 80
+----
+
+System reserved memory is crucial for a Kubernetes node as it is utilized to run system daemons and Kubernetes system daemons. System reserved memory utilization that exceeds 80% indicates that the system and Kubernetes daemons are consuming too much memory and can suggest node instability that could affect the performance of running pods. Excessive memory consumption can cause Out-of-Memory (OOM) killers that can terminate critical system processes to free up memory. 
+
+If a node is flagged by this metric, identify which system or Kubernetes processes are consuming excessive memory and take appropriate actions to mitigate the situation. These actions may include scaling back non-critical processes, optimizing program configurations to reduce memory usage, or upgrading node systems to hardware with greater memory capacity. You can also review the metrics under the *Outliers*, *Average durations*, and *Number of operations* categories to gain further insights into node performance.
diff --git a/modules/nodes-dashboard-using-identify-critical-pulls.adoc b/modules/nodes-dashboard-using-identify-critical-pulls.adoc
@@ -0,0 +1,18 @@
+// Module included in the following assemblies:
+//
+// * nodes/nodes-dashboard-using.adoc
+
+:_content-type: CONCEPT
+[id="nodes-dashboard-using-identify-critical-pulls.adoc"]
+= Failure rate for image pulls in the last hour
+
+The *Failure rate for image pulls in the last hour* query divides the total number of failed image pulls by the sum of successful and failed image pulls to provide a ratio of failures.
+
+.Example default query
+----
+rate(container_runtime_crio_image_pulls_failure_total[1h]) / (rate(container_runtime_crio_image_pulls_success_total[1h]) + rate(container_runtime_crio_image_pulls_failure_total[1h]))
+----
+
+Understanding the failure rate of image pulls is crucial for maintaining the health of the node. A high failure rate might indicate networking issues, storage problems, misconfigurations, or other issues that could disrupt pod density and the deployment of new containers. 
+
+If the outcome of this query is high, investigate possible causes such as network connections, the availability of remote repositories, node storage, and the accuracy of image references. You can also review the metrics under the *Outliers*, *Average durations*, and *Number of operations* categories to gain further insights.
diff --git a/modules/nodes-dashboard-using-identify-critical-top3.adoc b/modules/nodes-dashboard-using-identify-critical-top3.adoc
@@ -0,0 +1,18 @@
+// Module included in the following assemblies:
+//
+// * nodes/nodes-dashboard-using.adoc
+
+:_content-type: CONCEPT
+[id="nodes-dashboard-using-identify-critical-top3.adoc"]
+= Top 3 containers with the most OOM kills in the last day
+
+The *Top 3 containers with the most OOM kills in the last day* query fetches details regarding the top three containers that have experienced the most Out-Of-Memory (OOM) kills in the previous day. 
+
+.Example default query
+----
+topk(3, sum(increase(container_runtime_crio_containers_oom_count_total[1d])) by (name))
+----
+
+OOM kills force the system to terminate some processes due to low memory. Frequent OOM kills can hinder the functionality of the node and even the entire Kubernetes ecosystem. Containers experiencing frequent OOM kills might be consuming more memory than they should, which causes system instability. 
+
+Use this metric to identify containers that are experiencing frequent OOM kills and investigate why these containers are consuming an excessive amount of memory. Adjust the resource allocation if necessary and consider resizing the containers based on their memory usage. You can also review the metrics under the *Outliers*, *Average durations*, and *Number of operations* categories to gain further insights into the health and stability of your nodes.
diff --git a/modules/nodes-dashboard-using-identify.adoc b/modules/nodes-dashboard-using-identify.adoc
@@ -0,0 +1,18 @@
+// Module included in the following assemblies:
+//
+// * nodes/nodes-dashboard-using.adoc
+
+:_content-type: CONCEPT
+[id="nodes-dashboard-using-identify_{context}"]
+= Identify metrics for indicating optimal node resource usage
+
+The node metrics dashboard is organized into four categories: *Critical*, *Outliers*, *Average durations*, and *Number of Operations*. The metrics in the *Critical* category help you indicate optimal node resource usage. These metrics include:
+
+* Top 3 containers with the most OOM kills in the last day
+* Failure rate for image pulls in the last hour
+* Nodes with system reserved memory utilization > 80%
+* Nodes with Kubelet system reserved memory utilization > 50%
+* Nodes with CRI-O system reserved memory utilization > 50%
+* Nodes with system reserved CPU utilization > 80%
+* Nodes with Kubelet system reserved CPU utilization > 50%
+* Nodes with CRI-O system reserved CPU utilization > 50%
diff --git a/modules/nodes-dashboard-using-queries.adoc b/modules/nodes-dashboard-using-queries.adoc
@@ -0,0 +1,16 @@
+// Module included in the following assemblies:
+//
+// * nodes/nodes-dashboard-using.adoc
+
+:_content-type: PROCEDURE
+[id="nodes-dashboard-using-queries_{context}"]
+= Customizing dashboard queries
+
+You can customize the default queries used to build the node metrics dashboard.
+
+.Procedure
+
+. Choose a metric and click *Inspect* to navigate into the data. This page displays the metric in detail, including an expanded visualization of the results of the query, the Prometheus query used to analyze the data, and the data subset used in the query.
+. Make any required changes to the query parameters. 
+. Optional: Click *Add query* to run additional queries against the data.
+. Click *Run query* to rerun the query using your specified parameters.
diff --git a/monitoring/reviewing-monitoring-dashboards.adoc b/monitoring/reviewing-monitoring-dashboards.adoc
@@ -21,6 +21,7 @@ Use the *Administrator* perspective to access dashboards for the core {product-t
 * Kubernetes network resources
 * Prometheus
 * USE method dashboards relating to cluster and node performance
+* Node performance metrics
 
 .Example dashboard in the Administrator perspective
 image::monitoring-dashboard-administrator.png[]
diff --git a/nodes/nodes-dashboard-using.adoc b/nodes/nodes-dashboard-using.adoc
@@ -0,0 +1,50 @@
+:_content-type: ASSEMBLY
+[id="nodes-dashboard-using"]
+= Node metrics dashboard
+include::_attributes/common-attributes.adoc[]
+:context: nodes-dashboard-using
+
+toc::[]
+
+The node metrics dashboard is a visual analytics dashboard that helps you identify potential pod scaling issues.
+
+// The following include statements pull in the module files that comprise
+// the assembly. Include any combination of concept, procedure, or reference
+// modules required to cover the user story. You can also include other
+// assemblies.
+
+// About the node metrics dashboard
+include::modules/nodes-dashboard-using-about.adoc[leveloffset=+1]
+
+// Accessing the node metrics dashboard
+include::modules/nodes-dashboard-using-accessing.adoc[leveloffset=+1]
+
+// Identify metrics for indicating optimal node resource usage
+include::modules/nodes-dashboard-using-identify.adoc[leveloffset=+1]
+
+// Top 3 Containers With the Most OOM Kills in the Last Day
+include::modules/nodes-dashboard-using-identify-critical-top3.adoc[leveloffset=+2]
+
+// Failure Rate for Image Pulls in the Last Hour
+include::modules/nodes-dashboard-using-identify-critical-pulls.adoc[leveloffset=+2]
+
+// Nodes with System Reserved Memory Utilization > 80%
+include::modules/nodes-dashboard-using-identify-critical-memory.adoc[leveloffset=+2]
+
+// Nodes with Kubelet System Reserved Memory Utilization > 50%
+include::modules/nodes-dashboard-using-identify-critical-memory-kubelet.adoc[leveloffset=+2]
+
+// Nodes with CRI-O System Reserved Memory Utilization > 50%
+include::modules/nodes-dashboard-using-identify-critical-memory-crio.adoc[leveloffset=+2]
+
+// Nodes with System Reserved CPU Utilization > 80%
+include::modules/nodes-dashboard-using-identify-critical-cpu.adoc[leveloffset=+2]
+
+// Nodes with Kubelet System Reserved CPU Utilization > 50%
+include::modules/nodes-dashboard-using-identify-critical-cpu-kubelet.adoc[leveloffset=+2]
+
+// Nodes with CRI-O System Reserved CPU Utilization > 50%
+include::modules/nodes-dashboard-using-identify-critical-cpu-crio.adoc[leveloffset=+2]
+
+// Customizing dashboard queries
+include::modules/nodes-dashboard-using-queries.adoc[leveloffset=+1]