Skip to content

Commit 4312632

Browse files
authored
Merge pull request #37209 from tengqm/tweak-sys-metrics
Tweak line wrapping for system-metrics page
2 parents f7689cb + 39bfd1e commit 4312632

File tree

1 file changed

+81
-33
lines changed

1 file changed

+81
-33
lines changed

content/en/docs/concepts/cluster-administration/system-metrics.md

Lines changed: 81 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@ weight: 70
1010

1111
<!-- overview -->
1212

13-
System component metrics can give a better look into what is happening inside them. Metrics are particularly useful for building dashboards and alerts.
13+
System component metrics can give a better look into what is happening inside them. Metrics are
14+
particularly useful for building dashboards and alerts.
1415

1516
Kubernetes components emit metrics in [Prometheus format](https://prometheus.io/docs/instrumenting/exposition_formats/).
1617
This format is structured plain text, designed so that people and machines can both read it.
@@ -19,7 +20,8 @@ This format is structured plain text, designed so that people and machines can b
1920

2021
## Metrics in Kubernetes
2122

22-
In most cases metrics are available on `/metrics` endpoint of the HTTP server. For components that doesn't expose endpoint by default it can be enabled using `--bind-address` flag.
23+
In most cases metrics are available on `/metrics` endpoint of the HTTP server. For components that
24+
doesn't expose endpoint by default it can be enabled using `--bind-address` flag.
2325

2426
Examples of those components:
2527

@@ -29,13 +31,18 @@ Examples of those components:
2931
* {{< glossary_tooltip term_id="kube-scheduler" text="kube-scheduler" >}}
3032
* {{< glossary_tooltip term_id="kubelet" text="kubelet" >}}
3133

32-
In a production environment you may want to configure [Prometheus Server](https://prometheus.io/) or some other metrics scraper
33-
to periodically gather these metrics and make them available in some kind of time series database.
34+
In a production environment you may want to configure [Prometheus Server](https://prometheus.io/)
35+
or some other metrics scraper to periodically gather these metrics and make them available in some
36+
kind of time series database.
3437

35-
Note that {{< glossary_tooltip term_id="kubelet" text="kubelet" >}} also exposes metrics in `/metrics/cadvisor`, `/metrics/resource` and `/metrics/probes` endpoints. Those metrics do not have same lifecycle.
38+
Note that {{< glossary_tooltip term_id="kubelet" text="kubelet" >}} also exposes metrics in
39+
`/metrics/cadvisor`, `/metrics/resource` and `/metrics/probes` endpoints. Those metrics do not
40+
have same lifecycle.
41+
42+
If your cluster uses {{< glossary_tooltip term_id="rbac" text="RBAC" >}}, reading metrics requires
43+
authorization via a user, group or ServiceAccount with a ClusterRole that allows accessing
44+
`/metrics`. For example:
3645

37-
If your cluster uses {{< glossary_tooltip term_id="rbac" text="RBAC" >}}, reading metrics requires authorization via a user, group or ServiceAccount with a ClusterRole that allows accessing `/metrics`.
38-
For example:
3946
```yaml
4047
apiVersion: rbac.authorization.k8s.io/v1
4148
kind: ClusterRole
@@ -55,6 +62,7 @@ Alpha metric → Stable metric → Deprecated metric → Hidden metric → De
5562
Alpha metrics have no stability guarantees. These metrics can be modified or deleted at any time.
5663
5764
Stable metrics are guaranteed to not change. This means:
65+
5866
* A stable metric without a deprecated signature will not be deleted or renamed
5967
* A stable metric's type will not be modified
6068
@@ -79,45 +87,64 @@ For example:
7987
some_counter 0
8088
```
8189

82-
Hidden metrics are no longer published for scraping, but are still available for use. To use a hidden metric, please refer to the [Show hidden metrics](#show-hidden-metrics) section.
90+
Hidden metrics are no longer published for scraping, but are still available for use. To use a
91+
hidden metric, please refer to the [Show hidden metrics](#show-hidden-metrics) section.
8392

8493
Deleted metrics are no longer published and cannot be used.
8594

86-
8795
## Show hidden metrics
8896

89-
As described above, admins can enable hidden metrics through a command-line flag on a specific binary. This intends to be used as an escape hatch for admins if they missed the migration of the metrics deprecated in the last release.
97+
As described above, admins can enable hidden metrics through a command-line flag on a specific
98+
binary. This intends to be used as an escape hatch for admins if they missed the migration of the
99+
metrics deprecated in the last release.
90100

91-
The flag `show-hidden-metrics-for-version` takes a version for which you want to show metrics deprecated in that release. The version is expressed as x.y, where x is the major version, y is the minor version. The patch version is not needed even though a metrics can be deprecated in a patch release, the reason for that is the metrics deprecation policy runs against the minor release.
101+
The flag `show-hidden-metrics-for-version` takes a version for which you want to show metrics
102+
deprecated in that release. The version is expressed as x.y, where x is the major version, y is
103+
the minor version. The patch version is not needed even though a metrics can be deprecated in a
104+
patch release, the reason for that is the metrics deprecation policy runs against the minor release.
92105

93-
The flag can only take the previous minor version as it's value. All metrics hidden in previous will be emitted if admins set the previous version to `show-hidden-metrics-for-version`. The too old version is not allowed because this violates the metrics deprecated policy.
106+
The flag can only take the previous minor version as it's value. All metrics hidden in previous
107+
will be emitted if admins set the previous version to `show-hidden-metrics-for-version`. The too
108+
old version is not allowed because this violates the metrics deprecated policy.
94109

95-
Take metric `A` as an example, here assumed that `A` is deprecated in 1.n. According to metrics deprecated policy, we can reach the following conclusion:
110+
Take metric `A` as an example, here assumed that `A` is deprecated in 1.n. According to metrics
111+
deprecated policy, we can reach the following conclusion:
96112

97113
* In release `1.n`, the metric is deprecated, and it can be emitted by default.
98-
* In release `1.n+1`, the metric is hidden by default and it can be emitted by command line `show-hidden-metrics-for-version=1.n`.
114+
* In release `1.n+1`, the metric is hidden by default and it can be emitted by command line
115+
`show-hidden-metrics-for-version=1.n`.
99116
* In release `1.n+2`, the metric should be removed from the codebase. No escape hatch anymore.
100117

101-
If you're upgrading from release `1.12` to `1.13`, but still depend on a metric `A` deprecated in `1.12`, you should set hidden metrics via command line: `--show-hidden-metrics=1.12` and remember to remove this metric dependency before upgrading to `1.14`
118+
If you're upgrading from release `1.12` to `1.13`, but still depend on a metric `A` deprecated in
119+
`1.12`, you should set hidden metrics via command line: `--show-hidden-metrics=1.12` and remember
120+
to remove this metric dependency before upgrading to `1.14`
102121

103122
## Disable accelerator metrics
104123

105-
The kubelet collects accelerator metrics through cAdvisor. To collect these metrics, for accelerators like NVIDIA GPUs, kubelet held an open handle on the driver. This meant that in order to perform infrastructure changes (for example, updating the driver), a cluster administrator needed to stop the kubelet agent.
124+
The kubelet collects accelerator metrics through cAdvisor. To collect these metrics, for
125+
accelerators like NVIDIA GPUs, kubelet held an open handle on the driver. This meant that in order
126+
to perform infrastructure changes (for example, updating the driver), a cluster administrator
127+
needed to stop the kubelet agent.
106128

107-
The responsibility for collecting accelerator metrics now belongs to the vendor rather than the kubelet. Vendors must provide a container that collects metrics and exposes them to the metrics service (for example, Prometheus).
129+
The responsibility for collecting accelerator metrics now belongs to the vendor rather than the
130+
kubelet. Vendors must provide a container that collects metrics and exposes them to the metrics
131+
service (for example, Prometheus).
108132

109-
The [`DisableAcceleratorUsageMetrics` feature gate](/docs/reference/command-line-tools-reference/feature-gates/) disables metrics collected by the kubelet, with a [timeline for enabling this feature by default](https://github.com/kubernetes/enhancements/tree/411e51027db842355bd489691af897afc1a41a5e/keps/sig-node/1867-disable-accelerator-usage-metrics#graduation-criteria).
133+
The [`DisableAcceleratorUsageMetrics` feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
134+
disables metrics collected by the kubelet, with a
135+
[timeline for enabling this feature by default](https://github.com/kubernetes/enhancements/tree/411e51027db842355bd489691af897afc1a41a5e/keps/sig-node/1867-disable-accelerator-usage-metrics#graduation-criteria).
110136

111137
## Component metrics
112138

113139
### kube-controller-manager metrics
114140

115-
Controller manager metrics provide important insight into the performance and health of the controller manager.
116-
These metrics include common Go language runtime metrics such as go_routine count and controller specific metrics such as
117-
etcd request latencies or Cloudprovider (AWS, GCE, OpenStack) API latencies that can be used
118-
to gauge the health of a cluster.
141+
Controller manager metrics provide important insight into the performance and health of the
142+
controller manager. These metrics include common Go language runtime metrics such as go_routine
143+
count and controller specific metrics such as etcd request latencies or Cloudprovider (AWS, GCE,
144+
OpenStack) API latencies that can be used to gauge the health of a cluster.
119145

120-
Starting from Kubernetes 1.7, detailed Cloudprovider metrics are available for storage operations for GCE, AWS, Vsphere and OpenStack.
146+
Starting from Kubernetes 1.7, detailed Cloudprovider metrics are available for storage operations
147+
for GCE, AWS, Vsphere and OpenStack.
121148
These metrics can be used to monitor health of persistent volume operations.
122149

123150
For example, for GCE these metrics are called:
@@ -136,9 +163,15 @@ cloudprovider_gce_api_request_duration_seconds { request = "list_disk"}
136163

137164
{{< feature-state for_k8s_version="v1.21" state="beta" >}}
138165

139-
The scheduler exposes optional metrics that reports the requested resources and the desired limits of all running pods. These metrics can be used to build capacity planning dashboards, assess current or historical scheduling limits, quickly identify workloads that cannot schedule due to lack of resources, and compare actual usage to the pod's request.
166+
The scheduler exposes optional metrics that reports the requested resources and the desired limits
167+
of all running pods. These metrics can be used to build capacity planning dashboards, assess
168+
current or historical scheduling limits, quickly identify workloads that cannot schedule due to
169+
lack of resources, and compare actual usage to the pod's request.
170+
171+
The kube-scheduler identifies the resource [requests and limits](/docs/concepts/configuration/manage-resources-containers/)
172+
configured for each Pod; when either a request or limit is non-zero, the kube-scheduler reports a
173+
metrics timeseries. The time series is labelled by:
140174

141-
The kube-scheduler identifies the resource [requests and limits](/docs/concepts/configuration/manage-resources-containers/) configured for each Pod; when either a request or limit is non-zero, the kube-scheduler reports a metrics timeseries. The time series is labelled by:
142175
- namespace
143176
- pod name
144177
- the node where the pod is scheduled or an empty string if not yet scheduled
@@ -147,32 +180,47 @@ The kube-scheduler identifies the resource [requests and limits](/docs/concepts/
147180
- the name of the resource (for example, `cpu`)
148181
- the unit of the resource if known (for example, `cores`)
149182

150-
Once a pod reaches completion (has a `restartPolicy` of `Never` or `OnFailure` and is in the `Succeeded` or `Failed` pod phase, or has been deleted and all containers have a terminated state) the series is no longer reported since the scheduler is now free to schedule other pods to run. The two metrics are called `kube_pod_resource_request` and `kube_pod_resource_limit`.
183+
Once a pod reaches completion (has a `restartPolicy` of `Never` or `OnFailure` and is in the
184+
`Succeeded` or `Failed` pod phase, or has been deleted and all containers have a terminated state)
185+
the series is no longer reported since the scheduler is now free to schedule other pods to run.
186+
The two metrics are called `kube_pod_resource_request` and `kube_pod_resource_limit`.
151187

152-
The metrics are exposed at the HTTP endpoint `/metrics/resources` and require the same authorization as the `/metrics`
153-
endpoint on the scheduler. You must use the `--show-hidden-metrics-for-version=1.20` flag to expose these alpha stability metrics.
188+
The metrics are exposed at the HTTP endpoint `/metrics/resources` and require the same
189+
authorization as the `/metrics` endpoint on the scheduler. You must use the
190+
`--show-hidden-metrics-for-version=1.20` flag to expose these alpha stability metrics.
154191

155192
## Disabling metrics
156193

157-
You can explicitly turn off metrics via command line flag `--disabled-metrics`. This may be desired if, for example, a metric is causing a performance problem. The input is a list of disabled metrics (i.e. `--disabled-metrics=metric1,metric2`).
194+
You can explicitly turn off metrics via command line flag `--disabled-metrics`. This may be
195+
desired if, for example, a metric is causing a performance problem. The input is a list of
196+
disabled metrics (i.e. `--disabled-metrics=metric1,metric2`).
158197

159198
## Metric cardinality enforcement
160199

161-
Metrics with unbounded dimensions could cause memory issues in the components they instrument. To limit resource use, you can use the `--allow-label-value` command line option to dynamically configure an allow-list of label values for a metric.
200+
Metrics with unbounded dimensions could cause memory issues in the components they instrument. To
201+
limit resource use, you can use the `--allow-label-value` command line option to dynamically
202+
configure an allow-list of label values for a metric.
162203

163204
In alpha stage, the flag can only take in a series of mappings as metric label allow-list.
164205
Each mapping is of the format `<metric_name>,<label_name>=<allowed_labels>` where
165206
`<allowed_labels>` is a comma-separated list of acceptable label names.
166207

167208
The overall format looks like:
168-
`--allow-label-value <metric_name>,<label_name>='<allow_value1>, <allow_value2>...', <metric_name2>,<label_name>='<allow_value1>, <allow_value2>...', ...`.
209+
210+
```
211+
--allow-label-value <metric_name>,<label_name>='<allow_value1>, <allow_value2>...', <metric_name2>,<label_name>='<allow_value1>, <allow_value2>...', ...
212+
```
169213

170214
Here is an example:
171-
`--allow-label-value number_count_metric,odd_number='1,3,5', number_count_metric,even_number='2,4,6', date_gauge_metric,weekend='Saturday,Sunday'`
172215

216+
```none
217+
--allow-label-value number_count_metric,odd_number='1,3,5', number_count_metric,even_number='2,4,6', date_gauge_metric,weekend='Saturday,Sunday'
218+
```
173219

174220
## {{% heading "whatsnext" %}}
175221

176-
* Read about the [Prometheus text format](https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md#text-based-format) for metrics
222+
* Read about the [Prometheus text format](https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md#text-based-format)
223+
for metrics
177224
* See the list of [stable Kubernetes metrics](https://github.com/kubernetes/kubernetes/blob/master/test/instrumentation/testdata/stable-metrics-list.yaml)
178225
* Read about the [Kubernetes deprecation policy](/docs/reference/using-api/deprecation-policy/#deprecating-a-feature-or-behavior)
226+

0 commit comments

Comments
 (0)