Skip to content

Commit 5d5ed90

Browse files
committed
fix docs
1 parent 8cf47d7 commit 5d5ed90

File tree

6 files changed

+754
-1
lines changed

6 files changed

+754
-1
lines changed

website/content/en/docs/reference/metrics.md

Lines changed: 351 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,329 @@ description: >
88
---
99
<!-- this document is generated from hack/docs/metrics_gen/main.go -->
1010
Karpenter makes several metrics available in Prometheus format to allow monitoring cluster provisioning status. These metrics are available by default at `karpenter.kube-system.svc.cluster.local:8080/metrics` configurable via the `METRICS_PORT` environment variable documented [here](../settings)
11+
12+
### `karpenter_ignored_pod_count`
13+
Number of pods ignored during scheduling by Karpenter
14+
- Stability Level: ALPHA
15+
16+
### `karpenter_build_info`
17+
A metric with a constant '1' value labeled by version from which karpenter was built.
18+
- Stability Level: STABLE
19+
20+
## Nodeclaims Metrics
21+
22+
### `karpenter_nodeclaims_termination_duration_seconds`
23+
Duration of NodeClaim termination in seconds.
24+
- Stability Level: BETA
25+
26+
### `karpenter_nodeclaims_terminated_total`
27+
Number of nodeclaims terminated in total by Karpenter. Labeled by the owning nodepool.
28+
- Stability Level: STABLE
29+
30+
### `karpenter_nodeclaims_instance_termination_duration_seconds`
31+
Duration of CloudProvider Instance termination in seconds.
32+
- Stability Level: BETA
33+
34+
### `karpenter_nodeclaims_disrupted_total`
35+
Number of nodeclaims disrupted in total by Karpenter. Labeled by reason the nodeclaim was disrupted and the owning nodepool.
36+
- Stability Level: ALPHA
37+
38+
### `karpenter_nodeclaims_created_total`
39+
Number of nodeclaims created in total by Karpenter. Labeled by reason the nodeclaim was created and the owning nodepool.
40+
- Stability Level: STABLE
41+
42+
### `operator_nodeclaim_status_condition_transitions_total`
43+
The count of transitions of a nodeclaim, type and status. Labeled by the type, reason, and status.
44+
- Stability Level: BETA
45+
46+
### `operator_nodeclaim_status_condition_transition_seconds`
47+
The amount of time a condition was in a given state before transitioning. Labeled by the name of the nodeclaim, and the namespace.
48+
- Stability Level: BETA
49+
50+
### `operator_nodeclaim_status_condition_current_status_seconds`
51+
The current amount of time in seconds that a status condition has been in a specific state. Labeled by the name of the nodelcaim, namespace, type, status, and reason.
52+
- Stability Level: BETA
53+
54+
### `operator_nodeclaim_status_condition_count`
55+
The number of a condition for a nodeclaim, type and status. Labeled by the name, namespace, type, status, and reason.
56+
- Stability Level: BETA
57+
58+
### `operator_nodeclaim_termination_current_time_seconds`
59+
The current amount of time in seconds that a nodeclaim has been in terminating state. Labeled by name, and namespace.
60+
- Stability Level: BETA
61+
62+
### `operator_nodeclaim_termination_duration_seconds`
63+
The amount of time taken by a nodeclaim to terminate completely.
64+
- Stability Level: BETA
65+
66+
## Nodes Metrics
67+
68+
### `karpenter_nodes_total_pod_requests`
69+
Node total pod requests are the resources requested by pods bound to nodes, including the DaemonSet pods.
70+
- Stability Level: BETA
71+
72+
### `karpenter_nodes_total_pod_limits`
73+
Node total pod limits are the resources specified by pod limits, including the DaemonSet pods.
74+
- Stability Level: BETA
75+
76+
### `karpenter_nodes_total_daemon_requests`
77+
Node total daemon requests are the resource requested by DaemonSet pods bound to nodes.
78+
- Stability Level: BETA
79+
80+
### `karpenter_nodes_total_daemon_limits`
81+
Node total daemon limits are the resources specified by DaemonSet pod limits.
82+
- Stability Level: BETA
83+
84+
### `karpenter_nodes_termination_duration_seconds`
85+
The time taken between a node's deletion request and the removal of its finalizer
86+
- Stability Level: BETA
87+
88+
### `karpenter_nodes_terminated_total`
89+
Number of nodes terminated in total by Karpenter. Labeled by owning nodepool.
90+
- Stability Level: STABLE
91+
92+
### `karpenter_nodes_system_overhead`
93+
Node system daemon overhead are the resources reserved for system overhead, the difference between the node's capacity and allocatable values are reported by the status.
94+
- Stability Level: BETA
95+
96+
### `karpenter_nodes_lifetime_duration_seconds`
97+
The lifetime duration of the nodes since creation.
98+
- Stability Level: ALPHA
99+
100+
### `karpenter_nodes_eviction_requests_total`
101+
The total number of eviction requests made by Karpenter
102+
- Stability Level: ALPHA
103+
104+
### `karpenter_nodes_drained_total`
105+
The total number of nodes drained by Karpenter
106+
- Stability Level: ALPHA
107+
108+
### `karpenter_nodes_current_lifetime_seconds`
109+
Node age in seconds
110+
- Stability Level: ALPHA
111+
112+
### `karpenter_nodes_created_total`
113+
Number of nodes created in total by Karpenter. Labeled by owning nodepool.
114+
- Stability Level: STABLE
115+
116+
### `karpenter_nodes_allocatable`
117+
Node allocatable are the resources allocatable by nodes.
118+
- Stability Level: BETA
119+
120+
### `operator_node_status_condition_transitions_total`
121+
The count of transitions of a node, type and status.
122+
- Stability Level: BETA
123+
124+
### `operator_node_status_condition_transition_seconds`
125+
The amount of time a condition was in a given state before transitioning. Labeled by the name of the nodeclaim, and the namespace.
126+
- Stability Level: BETA
127+
128+
### `operator_node_status_condition_current_status_seconds`
129+
The current amount of time in seconds that a status condition has been in a specific state. Labeled by the name of the nodelcaim, namespace, type, status, and reason.
130+
- Stability Level: BETA
131+
132+
### `operator_node_status_condition_count`
133+
The number of a condition for a node, type and status. Labeled by the name, namespace, type, status, and reason.
134+
- Stability Level: BETA
135+
136+
### `operator_node_termination_current_time_seconds`
137+
The current amount of time in seconds that a node has been in terminating state. Labeled by name, and namespace.
138+
- Stability Level: BETA
139+
140+
### `operator_node_termination_duration_seconds`
141+
The amount of time taken by a node to terminate completely.
142+
- Stability Level: BETA
143+
144+
### `operator_node_event_count`
145+
The number of a events for a node.
146+
- Stability Level: BETA
147+
148+
## Pods Metrics
149+
150+
### `karpenter_pods_state`
151+
Pod state is the current state of pods. This metric can be used several ways as it is labeled by the pod name, namespace, owner, node, nodepool name, zone, architecture, capacity type, instance type and pod phase.
152+
- Stability Level: BETA
153+
154+
### `karpenter_pods_startup_duration_seconds`
155+
The time from pod creation until the pod is running.
156+
- Stability Level: STABLE
157+
158+
## Termination Metrics
159+
160+
### `operator_termination_duration_seconds`
161+
The amount of time taken by an object to terminate completely.
162+
- Stability Level: DEPRECATED
163+
164+
### `operator_termination_current_time_seconds`
165+
The current amount of time in seconds that an object has been in terminating state.
166+
- Stability Level: DEPRECATED
167+
168+
## Voluntary Disruption Metrics
169+
170+
### `karpenter_voluntary_disruption_queue_failures_total`
171+
The number of times that an enqueued disruption decision failed. Labeled by disruption method.
172+
- Stability Level: BETA
173+
174+
### `karpenter_voluntary_disruption_eligible_nodes`
175+
Number of nodes eligible for disruption by Karpenter. Labeled by disruption reason.
176+
- Stability Level: BETA
177+
178+
### `karpenter_voluntary_disruption_decisions_total`
179+
Number of disruption decisions performed. Labeled by disruption decision, reason, and consolidation type.
180+
- Stability Level: STABLE
181+
182+
### `karpenter_voluntary_disruption_decision_evaluation_duration_seconds`
183+
Duration of the disruption decision evaluation process in seconds. Labeled by method and consolidation type.
184+
- Stability Level: BETA
185+
186+
### `karpenter_voluntary_disruption_consolidation_timeouts_total`
187+
Number of times the Consolidation algorithm has reached a timeout. Labeled by consolidation type.
188+
- Stability Level: BETA
189+
190+
## Scheduler Metrics
191+
192+
### `karpenter_scheduler_scheduling_duration_seconds`
193+
Duration of scheduling simulations used for deprovisioning and provisioning in seconds.
194+
- Stability Level: STABLE
195+
196+
### `karpenter_scheduler_queue_depth`
197+
The number of pods currently waiting to be scheduled.
198+
- Stability Level: BETA
199+
200+
## Nodepools Metrics
201+
202+
### `karpenter_nodepools_usage`
203+
The amount of resources that have been provisioned for a nodepool. Labeled by nodepool name and resource type.
204+
- Stability Level: ALPHA
205+
206+
### `karpenter_nodepools_limit`
207+
Limits specified on the nodepool that restrict the quantity of resources provisioned. Labeled by nodepool name and resource type.
208+
- Stability Level: ALPHA
209+
210+
### `karpenter_nodepools_allowed_disruptions`
211+
The number of nodes for a given NodePool that can be concurrently disrupting at a point in time. Labeled by NodePool. Note that allowed disruptions can change very rapidly, as new nodes may be created and others may be deleted at any point.
212+
- Stability Level: ALPHA
213+
214+
### `operator_nodepool_status_condition_transitions_total`
215+
The count of transitions of a nodepool, type and status. Labeled by the type, reason, and status.
216+
- Stability Level: BETA
217+
218+
### `operator_nodepool_status_condition_transition_seconds`
219+
The amount of time a condition was in a given state before transitioning. Labeled by the name of the nodeclaim, and the namespace.
220+
- Stability Level: BETA
221+
222+
### `operator_nodepool_status_condition_current_status_seconds`
223+
The current amount of time in seconds that a status condition has been in a specific state. Labeled by the name of the nodelcaim, namespace, type, status, and reason.
224+
- Stability Level: BETA
225+
226+
### `operator_nodepool_status_condition_count`
227+
The number of an condition for a nodepool, type and status. Labeled by the name, namespace, type, status, and reason.
228+
- Stability Level: BETA
229+
230+
### `operator_nodepool_termination_current_time_seconds`
231+
The current amount of time in seconds that a nodepool has been in terminating state. Labeled by name, and namespace.
232+
- Stability Level: BETA
233+
234+
### `operator_nodepool_termination_duration_seconds`
235+
Duration of NodePool termination in seconds.
236+
- Stability Level: BETA
237+
238+
## EC2NodeClass Metrics
239+
240+
### `operator_ec2nodeclass_status_condition_transitions_total`
241+
The count of transitions of a ec2nodeclass, type and status. Labeled by the type, reason, and status.
242+
- Stability Level: BETA
243+
244+
### `operator_ec2nodeclass_status_condition_transition_seconds`
245+
The amount of time a condition was in a given state before transitioning. Labeled by the name of the nodeclaim, and the namespace.
246+
- Stability Level: BETA
247+
248+
### `operator_ec2nodeclass_status_condition_current_status_seconds`
249+
The current amount of time in seconds that a status condition has been in a specific state. Labeled by the name of the nodelcaim, namespace, type, status, and reason.
250+
- Stability Level: BETA
251+
252+
### `operator_ec2nodeclass_status_condition_count`
253+
The number of an condition for an ec2nodeclass, type and status. Labeled by the name, namespace, type, status, and reason.
254+
- Stability Level: BETA
255+
256+
### `operator_ec2nodeclass_termination_current_time_seconds`
257+
The current amount of time in seconds that an ec2nodeclass has been in terminating state. Labeled by name, and namespace.
258+
- Stability Level: BETA
259+
260+
### `operator_ec2nodeclass_termination_duration_seconds`
261+
Duration of ec2nodeclass termination in seconds.
262+
- Stability Level: BETA
263+
264+
## Interruption Metrics
265+
266+
### `karpenter_interruption_received_messages_total`
267+
Count of messages received from the SQS queue. Broken down by message type and whether the message was actionable.
268+
- Stability Level: STABLE
269+
270+
### `karpenter_interruption_message_queue_duration_seconds`
271+
Amount of time an interruption message is on the queue before it is processed by karpenter.
272+
- Stability Level: STABLE
273+
274+
### `karpenter_interruption_deleted_messages_total`
275+
Count of messages deleted from the SQS queue.
276+
- Stability Level: STABLE
277+
278+
## Cluster Metrics
279+
280+
### `karpenter_cluster_utilization_percent`
281+
Utilization of allocatable resources by pod requests
282+
- Stability Level: ALPHA
283+
284+
## Cluster State Metrics
285+
286+
### `karpenter_cluster_state_unsynced_time_seconds`
287+
The time for which cluster state is not synced
288+
- Stability Level: ALPHA
289+
290+
### `karpenter_cluster_state_synced`
291+
Returns 1 if cluster state is synced and 0 otherwise. Synced checks that nodeclaims and nodes that are stored in the APIServer have the same representation as Karpenter's cluster state
292+
- Stability Level: STABLE
293+
294+
### `karpenter_cluster_state_node_count`
295+
Current count of nodes in cluster state
296+
- Stability Level: STABLE
297+
298+
## Cloudprovider Metrics
299+
300+
### `karpenter_cloudprovider_instance_type_offering_price_estimate`
301+
Instance type offering estimated hourly price used when making informed decisions on node cost calculation, based on instance type, capacity type, and zone.
302+
- Stability Level: BETA
303+
304+
### `karpenter_cloudprovider_instance_type_offering_available`
305+
Instance type offering availability, based on instance type, capacity type, and zone
306+
- Stability Level: BETA
307+
308+
### `karpenter_cloudprovider_instance_type_memory_bytes`
309+
Memory, in bytes, for a given instance type.
310+
- Stability Level: BETA
311+
312+
### `karpenter_cloudprovider_instance_type_cpu_cores`
313+
VCPUs cores for a given instance type.
314+
- Stability Level: BETA
315+
316+
### `karpenter_cloudprovider_errors_total`
317+
Total number of errors returned from CloudProvider calls.
318+
- Stability Level: BETA
319+
320+
### `karpenter_cloudprovider_duration_seconds`
321+
Duration of cloud provider method calls. Labeled by the controller, method name and provider.
322+
- Stability Level: BETA
323+
324+
## Cloudprovider Batcher Metrics
325+
326+
### `karpenter_cloudprovider_batcher_batch_time_seconds`
327+
Duration of the batching window per batcher
328+
- Stability Level: BETA
329+
330+
### `karpenter_cloudprovider_batcher_batch_size`
331+
Size of the request batch per batcher
332+
- Stability Level: BETA
333+
11334
## Controller Runtime Metrics
12335

13336
### `controller_runtime_terminal_reconcile_errors_total`
@@ -72,6 +395,34 @@ Current depth of workqueue by workqueue and priority
72395
Total number of adds handled by workqueue
73396
- Stability Level: STABLE
74397

398+
## Status Condition Metrics
399+
400+
### `operator_status_condition_transitions_total`
401+
The count of transitions of a given object, type and status.
402+
- Stability Level: DEPRECATED
403+
404+
### `operator_status_condition_transition_seconds`
405+
The amount of time a condition was in a given state before transitioning. e.g. Alarm := P99(Updated=False) > 5 minutes
406+
- Stability Level: DEPRECATED
407+
408+
### `operator_status_condition_current_status_seconds`
409+
The current amount of time in seconds that a status condition has been in a specific state. Alarm := P99(Updated=Unknown) > 5 minutes
410+
- Stability Level: DEPRECATED
411+
412+
### `operator_status_condition_count`
413+
The number of an condition for a given object, type and status. e.g. Alarm := Available=False > 0
414+
- Stability Level: DEPRECATED
415+
416+
## Client Go Metrics
417+
418+
### `client_go_request_total`
419+
Number of HTTP requests, partitioned by status code and method.
420+
- Stability Level: STABLE
421+
422+
### `client_go_request_duration_seconds`
423+
Request latency in seconds. Broken down by verb, group, version, kind, and subresource.
424+
- Stability Level: STABLE
425+
75426
## AWS SDK Go Metrics
76427

77428
### `aws_sdk_go_request_total`

website/content/en/docs/upgrading/upgrade-guide.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,23 @@ If you get the error `invalid ownership metadata; label validation error:` while
8686
WHEN CREATING A NEW SECTION OF THE UPGRADE GUIDANCE FOR NEWER VERSIONS, ENSURE THAT YOU COPY THE BETA API ALERT SECTION FROM THE LAST RELEASE TO PROPERLY WARN USERS OF THE RISK OF UPGRADING WITHOUT GOING TO 0.32.x FIRST
8787
-->
8888

89+
### Upgrading to `1.11.0`+
90+
91+
{{% alert title="Warning" color="warning" %}}
92+
Karpenter `1.1.0` drops the support for `v1beta1` APIs.
93+
**Do not** upgrade to `1.1.0`+ without following the [Migration Guide]({{<ref "../../v1.0/upgrading/v1-migration.md#before-upgrading-to-v110">}}).
94+
{{% /alert %}}
95+
96+
* In the [getting started guide's cloudformation template]({{<ref "../../docs/reference/cloudformation/">}}),
97+
there are new changes to IAM permissions in the Karpenter controller role for supporting placement groups:
98+
- `ec2:DescribePlacementGroups` action in [AllowRegionalReadActions]({{<ref "../../docs/reference/cloudformation/#allowregionalreadactions">}})
99+
- `arn:${AWS::Partition}:ec2:${AWS::Region}:*:placement-group/*` resource in [AllowScopedEC2InstanceAccessActions]({{<ref "../../docs/reference/cloudformation/#allowscopedec2instanceaccessactions">}})
100+
If you are using placement groups, you will need to update your Karpenter controller role.
101+
102+
Full Changelog:
103+
* https://github.com/aws/karpenter-provider-aws/releases/tag/v1.11.0
104+
* https://github.com/kubernetes-sigs/karpenter/releases/tag/v1.11.0
105+
89106
### Upgrading to `1.10.0`+
90107

91108
{{% alert title="Warning" color="warning" %}}

0 commit comments

Comments
 (0)