OBSDOCS-65-update-monitoring-troubleshooting-sample-code

bburt-rh · bburt-rh · commit 72b88647da90 · 2024-04-04T15:47:19.000-04:00
diff --git a/modules/monitoring-determining-why-prometheus-is-consuming-disk-space.adoc b/modules/monitoring-determining-why-prometheus-is-consuming-disk-space.adoc
@@ -13,9 +13,9 @@ Every assigned key-value pair has a unique time series. The use of many unbound
 
 You can use the following measures when Prometheus consumes a lot of disk:
 
-* *Check the number of scrape samples* that are being collected.
+* *Check the time series database (TSDB) status using the Prometheus HTTP API* for more information about which labels are creating the most time series data. Doing so requires cluster administrator privileges.
 
-* *Check the time series database (TSDB) status using the Prometheus HTTP API* for more information about which labels are creating the most time series. Doing so requires cluster administrator privileges.
+* *Check the number of scrape samples* that are being collected.
 
 * *Reduce the number of unique time series that are created* by reducing the number of unbound attributes that are assigned to user-defined metrics.
 +
@@ -40,48 +40,65 @@ endif::openshift-dedicated,openshift-rosa[]
 
 . In the *Administrator* perspective, navigate to *Observe* -> *Metrics*.
 
-. Run the following Prometheus Query Language (PromQL) query in the *Expression* field. This returns the ten metrics that have the highest number of scrape samples:
+. Enter a Prometheus Query Language (PromQL) query in the *Expression* field.
+The following example queries help to identify high cardinality metrics that might result in high disk space consumption:
+
+* By running the following query, you can identify the ten jobs that have the highest number of scrape samples:
 +
-[source,terminal]
+[source,text]
+----
+topk(10, max by(namespace, job) (topk by(namespace, job) (1, scrape_samples_post_metric_relabeling)))
+----
++
+* By running the following query, you can pinpoint time series churn by identifying the ten jobs that have created the most time series data in the last hour:
++
+[source,text]
 ----
-topk(10,count by (job)({__name__=~".+"}))
+topk(10, sum by(namespace, job) (sum_over_time(scrape_series_added[1h])))
 ----
 
-. Investigate the number of unbound label values assigned to metrics with higher than expected scrape sample counts.
-** *If the metrics relate to a user-defined project*, review the metrics key-value pairs assigned to your workload. These are implemented through Prometheus client libraries at the application level. Try to limit the number of unbound attributes referenced in your labels.
+. Investigate the number of unbound label values assigned to metrics with higher than expected scrape sample counts:
+
+* *If the metrics relate to a user-defined project*, review the metrics key-value pairs assigned to your workload. These are implemented through Prometheus client libraries at the application level. Try to limit the number of unbound attributes referenced in your labels.
 
-** *If the metrics relate to a core {product-title} project*, create a Red Hat support case on the link:https://access.redhat.com/[Red Hat Customer Portal].
+* *If the metrics relate to a core {product-title} project*, create a Red Hat support case on the link:https://access.redhat.com/[Red Hat Customer Portal].
 
-. Review the TSDB status using the Prometheus HTTP API by running the following commands as a
+. Review the TSDB status using the Prometheus HTTP API by following these steps when logged in as a
 ifndef::openshift-dedicated,openshift-rosa[]
 cluster administrator:
 endif::openshift-dedicated,openshift-rosa[]
 ifdef::openshift-dedicated,openshift-rosa[]
 `dedicated-admin`:
 endif::openshift-dedicated,openshift-rosa[]
 +
-[source,terminal]
-----
-$ oc login -u <username> -p <password>
-----
+.. Get the Prometheus API route URL by running the following command:
 +
 [source,terminal]
 ----
-$ host=$(oc -n openshift-monitoring get route prometheus-k8s -ojsonpath={.spec.host})
+$ HOST=$(oc -n openshift-monitoring get route prometheus-k8s -ojsonpath={.spec.host})
 ----
 +
+.. Extract an authentication token by running the following command:
++
 [source,terminal]
 ----
-$ token=$(oc whoami -t)
+$ TOKEN=$(oc whoami -t)
 ----
 +
+.. Query the TSDB status for Prometheus by running the following command:
++
 [source,terminal]
 ----
-$ curl -H "Authorization: Bearer $token" -k "https://$host/api/v1/status/tsdb"
+$ curl -H "Authorization: Bearer $TOKEN" -k "https://$HOST/api/v1/status/tsdb"
 ----
 +
 .Example output
 [source,terminal]
 ----
-"status": "success",
+"status": "success","data":{"headStats":{"numSeries":507473,
+"numLabelPairs":19832,"chunkCount":946298,"minTime":1712253600010,
+"maxTime":1712257935346},"seriesCountByMetricName":
+[{"name":"etcd_request_duration_seconds_bucket","value":51840},
+{"name":"apiserver_request_sli_duration_seconds_bucket","value":47718},
+...
 ----
diff --git a/monitoring/troubleshooting-monitoring-issues.adoc b/monitoring/troubleshooting-monitoring-issues.adoc
@@ -36,5 +36,6 @@ include::modules/monitoring-determining-why-prometheus-is-consuming-disk-space.a
 [role="_additional-resources"]
 .Additional resources
 
-* See xref:../monitoring/configuring-the-monitoring-stack.adoc#setting-scrape-sample-and-label-limits-for-user-defined-projects_configuring-the-monitoring-stack[Setting a scrape sample limit for user-defined projects] for details on how to set a scrape sample limit and create related alerting rules
+* xref:../monitoring/accessing-third-party-monitoring-apis.adoc#about-accessing-monitoring-web-service-apis_accessing-monitoring-apis-by-using-the-cli[Accessing monitoring APIs by using the CLI]
+* xref:../monitoring/configuring-the-monitoring-stack.adoc#setting-scrape-sample-and-label-limits-for-user-defined-projects_configuring-the-monitoring-stack[Setting a scrape sample limit for user-defined projects]
 * xref:../support/getting-support.adoc#support-submitting-a-case_getting-support[Submitting a support case]