Skip to content

Commit 72b8864

Browse files
committed
OBSDOCS-65-update-monitoring-troubleshooting-sample-code
1 parent 0847537 commit 72b8864

File tree

2 files changed

+36
-18
lines changed

2 files changed

+36
-18
lines changed

modules/monitoring-determining-why-prometheus-is-consuming-disk-space.adoc

Lines changed: 34 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,9 @@ Every assigned key-value pair has a unique time series. The use of many unbound
1313

1414
You can use the following measures when Prometheus consumes a lot of disk:
1515

16-
* *Check the number of scrape samples* that are being collected.
16+
* *Check the time series database (TSDB) status using the Prometheus HTTP API* for more information about which labels are creating the most time series data. Doing so requires cluster administrator privileges.
1717
18-
* *Check the time series database (TSDB) status using the Prometheus HTTP API* for more information about which labels are creating the most time series. Doing so requires cluster administrator privileges.
18+
* *Check the number of scrape samples* that are being collected.
1919
2020
* *Reduce the number of unique time series that are created* by reducing the number of unbound attributes that are assigned to user-defined metrics.
2121
+
@@ -40,48 +40,65 @@ endif::openshift-dedicated,openshift-rosa[]
4040

4141
. In the *Administrator* perspective, navigate to *Observe* -> *Metrics*.
4242

43-
. Run the following Prometheus Query Language (PromQL) query in the *Expression* field. This returns the ten metrics that have the highest number of scrape samples:
43+
. Enter a Prometheus Query Language (PromQL) query in the *Expression* field.
44+
The following example queries help to identify high cardinality metrics that might result in high disk space consumption:
45+
46+
* By running the following query, you can identify the ten jobs that have the highest number of scrape samples:
4447
+
45-
[source,terminal]
48+
[source,text]
49+
----
50+
topk(10, max by(namespace, job) (topk by(namespace, job) (1, scrape_samples_post_metric_relabeling)))
51+
----
52+
+
53+
* By running the following query, you can pinpoint time series churn by identifying the ten jobs that have created the most time series data in the last hour:
54+
+
55+
[source,text]
4656
----
47-
topk(10,count by (job)({__name__=~".+"}))
57+
topk(10, sum by(namespace, job) (sum_over_time(scrape_series_added[1h])))
4858
----
4959
50-
. Investigate the number of unbound label values assigned to metrics with higher than expected scrape sample counts.
51-
** *If the metrics relate to a user-defined project*, review the metrics key-value pairs assigned to your workload. These are implemented through Prometheus client libraries at the application level. Try to limit the number of unbound attributes referenced in your labels.
60+
. Investigate the number of unbound label values assigned to metrics with higher than expected scrape sample counts:
61+
62+
* *If the metrics relate to a user-defined project*, review the metrics key-value pairs assigned to your workload. These are implemented through Prometheus client libraries at the application level. Try to limit the number of unbound attributes referenced in your labels.
5263
53-
** *If the metrics relate to a core {product-title} project*, create a Red Hat support case on the link:https://access.redhat.com/[Red Hat Customer Portal].
64+
* *If the metrics relate to a core {product-title} project*, create a Red Hat support case on the link:https://access.redhat.com/[Red Hat Customer Portal].
5465
55-
. Review the TSDB status using the Prometheus HTTP API by running the following commands as a
66+
. Review the TSDB status using the Prometheus HTTP API by following these steps when logged in as a
5667
ifndef::openshift-dedicated,openshift-rosa[]
5768
cluster administrator:
5869
endif::openshift-dedicated,openshift-rosa[]
5970
ifdef::openshift-dedicated,openshift-rosa[]
6071
`dedicated-admin`:
6172
endif::openshift-dedicated,openshift-rosa[]
6273
+
63-
[source,terminal]
64-
----
65-
$ oc login -u <username> -p <password>
66-
----
74+
.. Get the Prometheus API route URL by running the following command:
6775
+
6876
[source,terminal]
6977
----
70-
$ host=$(oc -n openshift-monitoring get route prometheus-k8s -ojsonpath={.spec.host})
78+
$ HOST=$(oc -n openshift-monitoring get route prometheus-k8s -ojsonpath={.spec.host})
7179
----
7280
+
81+
.. Extract an authentication token by running the following command:
82+
+
7383
[source,terminal]
7484
----
75-
$ token=$(oc whoami -t)
85+
$ TOKEN=$(oc whoami -t)
7686
----
7787
+
88+
.. Query the TSDB status for Prometheus by running the following command:
89+
+
7890
[source,terminal]
7991
----
80-
$ curl -H "Authorization: Bearer $token" -k "https://$host/api/v1/status/tsdb"
92+
$ curl -H "Authorization: Bearer $TOKEN" -k "https://$HOST/api/v1/status/tsdb"
8193
----
8294
+
8395
.Example output
8496
[source,terminal]
8597
----
86-
"status": "success",
98+
"status": "success","data":{"headStats":{"numSeries":507473,
99+
"numLabelPairs":19832,"chunkCount":946298,"minTime":1712253600010,
100+
"maxTime":1712257935346},"seriesCountByMetricName":
101+
[{"name":"etcd_request_duration_seconds_bucket","value":51840},
102+
{"name":"apiserver_request_sli_duration_seconds_bucket","value":47718},
103+
...
87104
----

monitoring/troubleshooting-monitoring-issues.adoc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,5 +36,6 @@ include::modules/monitoring-determining-why-prometheus-is-consuming-disk-space.a
3636
[role="_additional-resources"]
3737
.Additional resources
3838

39-
* See xref:../monitoring/configuring-the-monitoring-stack.adoc#setting-scrape-sample-and-label-limits-for-user-defined-projects_configuring-the-monitoring-stack[Setting a scrape sample limit for user-defined projects] for details on how to set a scrape sample limit and create related alerting rules
39+
* xref:../monitoring/accessing-third-party-monitoring-apis.adoc#about-accessing-monitoring-web-service-apis_accessing-monitoring-apis-by-using-the-cli[Accessing monitoring APIs by using the CLI]
40+
* xref:../monitoring/configuring-the-monitoring-stack.adoc#setting-scrape-sample-and-label-limits-for-user-defined-projects_configuring-the-monitoring-stack[Setting a scrape sample limit for user-defined projects]
4041
* xref:../support/getting-support.adoc#support-submitting-a-case_getting-support[Submitting a support case]

0 commit comments

Comments
 (0)