Skip to content

Commit 45bb99e

Browse files
committed
Add Pressure Stall Information (PSI) metrics
1 parent 553948e commit 45bb99e

File tree

5 files changed

+357
-0
lines changed

5 files changed

+357
-0
lines changed

.chloggen/2995-psi.yaml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
change_type: enhancement
2+
3+
component: system
4+
5+
note: "Add Linux PSI (Pressure Stall Information) metrics `system.linux.psi.pressure` and `system.linux.psi.total_time` for measuring resource contention."
6+
7+
issues: [2995]
8+
9+
subtext: |
10+
PSI metrics track CPU, memory, and I/O resource pressure by measuring the percentage of time tasks are stalled.
11+
These metrics help with workload sizing, detecting productivity losses, and dynamic system management.

docs/registry/attributes/system.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
- [Filesystem Attributes](#filesystem-attributes)
88
- [System Memory Attributes](#system-memory-attributes)
99
- [System Paging Attributes](#system-paging-attributes)
10+
- [System PSI (Pressure Stall Information) Attributes](#system-psi-pressure-stall-information-attributes)
1011
- [Deprecated System Attributes](#deprecated-system-attributes)
1112

1213
## General System Attributes
@@ -117,6 +118,47 @@ Describes System Memory Paging attributes
117118
| `free` | free | ![Development](https://img.shields.io/badge/-development-blue) |
118119
| `used` | used | ![Development](https://img.shields.io/badge/-development-blue) |
119120

121+
## System PSI (Pressure Stall Information) Attributes
122+
123+
Describes Linux Pressure Stall Information attributes
124+
125+
**Attributes:**
126+
127+
| Key | Stability | Value Type | Description | Example Values |
128+
|---|---|---|---|---|
129+
| <a id="system-psi-resource" href="#system-psi-resource">`system.psi.resource`</a> | ![Development](https://img.shields.io/badge/-development-blue) | string | The PSI resource being measured [2] | `cpu`; `memory`; `io` |
130+
| <a id="system-psi-stall-type" href="#system-psi-stall-type">`system.psi.stall_type`</a> | ![Development](https://img.shields.io/badge/-development-blue) | string | The PSI stall type [3] | `some`; `full` |
131+
| <a id="system-psi-window" href="#system-psi-window">`system.psi.window`</a> | ![Development](https://img.shields.io/badge/-development-blue) | string | The time window for PSI pressure calculation [4] | `10s`; `60s`; `300s` |
132+
133+
**[2] `system.psi.resource`:** Linux PSI (Pressure Stall Information) measures resource pressure for CPU, memory, and I/O. See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html).
134+
135+
**[3] `system.psi.stall_type`:** PSI distinguishes between "some" stall (at least some tasks stalled) and "full" stall (all non-idle tasks stalled simultaneously).
136+
137+
**[4] `system.psi.window`:** PSI tracks pressure as percentages over 10-second, 60-second, and 300-second windows. This attribute identifies which time window the metric represents.
138+
139+
---
140+
141+
`system.psi.resource` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.
142+
143+
| Value | Description | Stability |
144+
|---|---|---|
145+
| `cpu` | CPU resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
146+
| `io` | I/O resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
147+
| `memory` | Memory resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
148+
149+
---
150+
151+
`system.psi.stall_type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.
152+
153+
| Value | Description | Stability |
154+
|---|---|---|
155+
| `full` | All non-idle tasks are stalled on the resource simultaneously [5] | ![Development](https://img.shields.io/badge/-development-blue) |
156+
| `some` | At least some tasks are stalled on the resource [6] | ![Development](https://img.shields.io/badge/-development-blue) |
157+
158+
**[5]:** The "full" line indicates the share of time in which all non-idle tasks are stalled on a given resource simultaneously. This represents a state where actual CPU cycles are going to waste and the workload is thrashing. CPU full is undefined at the system level and is set to zero for backward compatibility (available since Linux 5.13).
159+
160+
**[6]:** The "some" line indicates the share of time in which at least some tasks are stalled on a given resource.
161+
120162
## Deprecated System Attributes
121163

122164
Deprecated system attributes.

docs/system/system-metrics.md

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,9 @@ Resource attributes related to a host, SHOULD be reported under the `host.*` nam
6060
- [`system.{os}.` - OS Specific System Metrics](#systemos---os-specific-system-metrics)
6161
- [Metric: `system.linux.memory.available`](#metric-systemlinuxmemoryavailable)
6262
- [Metric: `system.linux.memory.slab.usage`](#metric-systemlinuxmemoryslabusage)
63+
- [Linux PSI (Pressure Stall Information) metrics](#linux-psi-pressure-stall-information-metrics)
64+
- [Metric: `system.linux.psi.pressure`](#metric-systemlinuxpsipressure)
65+
- [Metric: `system.linux.psi.total_time`](#metric-systemlinuxpsitotal_time)
6366

6467
<!-- tocstop -->
6568

@@ -1291,3 +1294,165 @@ See also the [Slab allocator](https://blogs.oracle.com/linux/post/understanding-
12911294
<!-- prettier-ignore-end -->
12921295
<!-- END AUTOGENERATED TEXT -->
12931296
<!-- endsemconv -->
1297+
1298+
## Linux PSI (Pressure Stall Information) metrics
1299+
1300+
**Description:** Linux Pressure Stall Information (PSI) metrics captured under the namespace `system.linux.psi`.
1301+
1302+
PSI is a Linux kernel feature (available since kernel 4.20) that identifies and
1303+
quantifies resource contention. It measures the time impact that resource
1304+
crunches have on workloads by tracking the percentage of time tasks are stalled
1305+
waiting for CPU, memory, or I/O resources.
1306+
1307+
PSI helps in:
1308+
1309+
- Sizing workloads to hardware or provisioning hardware according to workload demand
1310+
- Detecting productivity losses caused by resource scarcity
1311+
- Dynamic system management (load shedding, job migration, strategic pausing)
1312+
- Maximizing hardware utilization without sacrificing workload health
1313+
1314+
For more details, see the [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html).
1315+
1316+
### Metric: `system.linux.psi.pressure`
1317+
1318+
This metric is [recommended][MetricRecommended].
1319+
1320+
<!-- semconv metric.system.linux.psi.pressure -->
1321+
<!-- NOTE: THIS TEXT IS AUTOGENERATED. DO NOT EDIT BY HAND. -->
1322+
<!-- see templates/registry/markdown/snippet.md.j2 -->
1323+
<!-- prettier-ignore-start -->
1324+
<!-- markdownlint-capture -->
1325+
<!-- markdownlint-disable -->
1326+
1327+
| Name | Instrument Type | Unit (UCUM) | Description | Stability | Entity Associations |
1328+
| -------- | --------------- | ----------- | -------------- | --------- | ------ |
1329+
| `system.linux.psi.pressure` | Gauge | `1` | Linux Pressure Stall Information (PSI) metric measuring resource contention as percentage of time. [1] | ![Development](https://img.shields.io/badge/-development-blue) | [`host`](/docs/registry/entities/host.md#host) |
1330+
1331+
**[1]:** PSI (Pressure Stall Information) identifies and quantifies resource contention.
1332+
The metric represents the percentage of time that tasks were stalled on a given resource
1333+
over the specified time window.
1334+
1335+
The "some" stall type indicates at least some tasks are stalled on the resource.
1336+
The "full" stall type indicates all non-idle tasks are stalled simultaneously, representing
1337+
a more severe state where the system is thrashing and CPU cycles are wasted.
1338+
1339+
PSI is available on Linux systems with kernel 4.20 or later and requires CONFIG_PSI=y.
1340+
CPU "full" stall is reported as zero at the system level for backward compatibility (available since 5.13).
1341+
1342+
Values are percentages in the range [0, 100]. The ratios are tracked over 10-second, 60-second,
1343+
and 300-second windows.
1344+
1345+
See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html) and
1346+
[/proc/pressure/*](https://man7.org/linux/man-pages/man5/proc.5.html) files.
1347+
1348+
**Attributes:**
1349+
1350+
| Key | Stability | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Value Type | Description | Example Values |
1351+
|---|---|---|---|---|---|
1352+
| [`system.psi.resource`](/docs/registry/attributes/system.md) | ![Development](https://img.shields.io/badge/-development-blue) | `Required` | string | The resource experiencing pressure (cpu, memory, or io) [1] | `cpu`; `memory`; `io` |
1353+
| [`system.psi.stall_type`](/docs/registry/attributes/system.md) | ![Development](https://img.shields.io/badge/-development-blue) | `Required` | string | The stall type (some or full) [2] | `some`; `full` |
1354+
| [`system.psi.window`](/docs/registry/attributes/system.md) | ![Development](https://img.shields.io/badge/-development-blue) | `Required` | string | The time window over which pressure is calculated [3] | `10s`; `60s`; `300s` |
1355+
1356+
**[1] `system.psi.resource`:** Linux PSI (Pressure Stall Information) measures resource pressure for CPU, memory, and I/O. See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html).
1357+
1358+
**[2] `system.psi.stall_type`:** PSI distinguishes between "some" stall (at least some tasks stalled) and "full" stall (all non-idle tasks stalled simultaneously).
1359+
1360+
**[3] `system.psi.window`:** Typically one of: 10s, 60s, or 300s
1361+
1362+
---
1363+
1364+
`system.psi.resource` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.
1365+
1366+
| Value | Description | Stability |
1367+
|---|---|---|
1368+
| `cpu` | CPU resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
1369+
| `io` | I/O resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
1370+
| `memory` | Memory resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
1371+
1372+
---
1373+
1374+
`system.psi.stall_type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.
1375+
1376+
| Value | Description | Stability |
1377+
|---|---|---|
1378+
| `full` | All non-idle tasks are stalled on the resource simultaneously [4] | ![Development](https://img.shields.io/badge/-development-blue) |
1379+
| `some` | At least some tasks are stalled on the resource [5] | ![Development](https://img.shields.io/badge/-development-blue) |
1380+
1381+
**[4]:** The "full" line indicates the share of time in which all non-idle tasks are stalled on a given resource simultaneously. This represents a state where actual CPU cycles are going to waste and the workload is thrashing. CPU full is undefined at the system level and is set to zero for backward compatibility (available since Linux 5.13).
1382+
1383+
**[5]:** The "some" line indicates the share of time in which at least some tasks are stalled on a given resource.
1384+
1385+
<!-- markdownlint-restore -->
1386+
<!-- prettier-ignore-end -->
1387+
<!-- END AUTOGENERATED TEXT -->
1388+
<!-- endsemconv -->
1389+
1390+
### Metric: `system.linux.psi.total_time`
1391+
1392+
This metric is [recommended][MetricRecommended].
1393+
1394+
<!-- semconv metric.system.linux.psi.total_time -->
1395+
<!-- NOTE: THIS TEXT IS AUTOGENERATED. DO NOT EDIT BY HAND. -->
1396+
<!-- see templates/registry/markdown/snippet.md.j2 -->
1397+
<!-- prettier-ignore-start -->
1398+
<!-- markdownlint-capture -->
1399+
<!-- markdownlint-disable -->
1400+
1401+
| Name | Instrument Type | Unit (UCUM) | Description | Stability | Entity Associations |
1402+
| -------- | --------------- | ----------- | -------------- | --------- | ------ |
1403+
| `system.linux.psi.total_time` | Counter | `us` | Linux Pressure Stall Information (PSI) total cumulative stall time. [1] | ![Development](https://img.shields.io/badge/-development-blue) | [`host`](/docs/registry/entities/host.md#host) |
1404+
1405+
**[1]:** This metric tracks the total absolute stall time in microseconds since system boot.
1406+
Unlike the percentage-based `system.linux.psi.pressure` metric, this allows detection
1407+
of latency spikes that wouldn't necessarily make a noticeable impact on time averages.
1408+
It also enables calculating average trends over custom time frames.
1409+
1410+
The "some" stall type indicates at least some tasks are stalled on the resource.
1411+
The "full" stall type indicates all non-idle tasks are stalled simultaneously.
1412+
1413+
PSI is available on Linux systems with kernel 4.20 or later and requires CONFIG_PSI=y.
1414+
CPU "full" stall is reported as zero at the system level for backward compatibility (available since 5.13).
1415+
1416+
This is a monotonically increasing counter that resets on system reboot.
1417+
1418+
See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html) and
1419+
[/proc/pressure/*](https://man7.org/linux/man-pages/man5/proc.5.html) files.
1420+
1421+
**Attributes:**
1422+
1423+
| Key | Stability | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Value Type | Description | Example Values |
1424+
|---|---|---|---|---|---|
1425+
| [`system.psi.resource`](/docs/registry/attributes/system.md) | ![Development](https://img.shields.io/badge/-development-blue) | `Required` | string | The resource experiencing pressure (cpu, memory, or io) [1] | `cpu`; `memory`; `io` |
1426+
| [`system.psi.stall_type`](/docs/registry/attributes/system.md) | ![Development](https://img.shields.io/badge/-development-blue) | `Required` | string | The stall type (some or full) [2] | `some`; `full` |
1427+
1428+
**[1] `system.psi.resource`:** Linux PSI (Pressure Stall Information) measures resource pressure for CPU, memory, and I/O. See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html).
1429+
1430+
**[2] `system.psi.stall_type`:** PSI distinguishes between "some" stall (at least some tasks stalled) and "full" stall (all non-idle tasks stalled simultaneously).
1431+
1432+
---
1433+
1434+
`system.psi.resource` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.
1435+
1436+
| Value | Description | Stability |
1437+
|---|---|---|
1438+
| `cpu` | CPU resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
1439+
| `io` | I/O resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
1440+
| `memory` | Memory resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
1441+
1442+
---
1443+
1444+
`system.psi.stall_type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.
1445+
1446+
| Value | Description | Stability |
1447+
|---|---|---|
1448+
| `full` | All non-idle tasks are stalled on the resource simultaneously [3] | ![Development](https://img.shields.io/badge/-development-blue) |
1449+
| `some` | At least some tasks are stalled on the resource [4] | ![Development](https://img.shields.io/badge/-development-blue) |
1450+
1451+
**[3]:** The "full" line indicates the share of time in which all non-idle tasks are stalled on a given resource simultaneously. This represents a state where actual CPU cycles are going to waste and the workload is thrashing. CPU full is undefined at the system level and is set to zero for backward compatibility (available since Linux 5.13).
1452+
1453+
**[4]:** The "some" line indicates the share of time in which at least some tasks are stalled on a given resource.
1454+
1455+
<!-- markdownlint-restore -->
1456+
<!-- prettier-ignore-end -->
1457+
<!-- END AUTOGENERATED TEXT -->
1458+
<!-- endsemconv -->

model/system/metrics.yaml

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -563,3 +563,81 @@ groups:
563563
- ref: linux.memory.slab.state
564564
entity_associations:
565565
- host
566+
567+
# system.linux.psi.* metrics
568+
- id: metric.system.linux.psi.pressure
569+
type: metric
570+
metric_name: system.linux.psi.pressure
571+
annotations:
572+
code_generation:
573+
metric_value_type: double
574+
stability: development
575+
brief: "Linux Pressure Stall Information (PSI) metric measuring resource contention as percentage of time."
576+
note: |
577+
PSI (Pressure Stall Information) identifies and quantifies resource contention.
578+
The metric represents the percentage of time that tasks were stalled on a given resource
579+
over the specified time window.
580+
581+
The "some" stall type indicates at least some tasks are stalled on the resource.
582+
The "full" stall type indicates all non-idle tasks are stalled simultaneously, representing
583+
a more severe state where the system is thrashing and CPU cycles are wasted.
584+
585+
PSI is available on Linux systems with kernel 4.20 or later and requires CONFIG_PSI=y.
586+
CPU "full" stall is reported as zero at the system level for backward compatibility (available since 5.13).
587+
588+
Values are percentages in the range [0, 100]. The ratios are tracked over 10-second, 60-second,
589+
and 300-second windows.
590+
591+
See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html) and
592+
[/proc/pressure/*](https://man7.org/linux/man-pages/man5/proc.5.html) files.
593+
instrument: gauge
594+
unit: "1"
595+
attributes:
596+
- ref: system.psi.resource
597+
requirement_level: required
598+
brief: "The resource experiencing pressure (cpu, memory, or io)"
599+
- ref: system.psi.stall_type
600+
requirement_level: required
601+
brief: "The stall type (some or full)"
602+
- ref: system.psi.window
603+
requirement_level: required
604+
brief: "The time window over which pressure is calculated"
605+
note: "Typically one of: 10s, 60s, or 300s"
606+
entity_associations:
607+
- host
608+
609+
- id: metric.system.linux.psi.total_time
610+
type: metric
611+
metric_name: system.linux.psi.total_time
612+
annotations:
613+
code_generation:
614+
metric_value_type: int
615+
stability: development
616+
brief: "Linux Pressure Stall Information (PSI) total cumulative stall time."
617+
note: |
618+
This metric tracks the total absolute stall time in microseconds since system boot.
619+
Unlike the percentage-based `system.linux.psi.pressure` metric, this allows detection
620+
of latency spikes that wouldn't necessarily make a noticeable impact on time averages.
621+
It also enables calculating average trends over custom time frames.
622+
623+
The "some" stall type indicates at least some tasks are stalled on the resource.
624+
The "full" stall type indicates all non-idle tasks are stalled simultaneously.
625+
626+
PSI is available on Linux systems with kernel 4.20 or later and requires CONFIG_PSI=y.
627+
CPU "full" stall is reported as zero at the system level for backward compatibility (available since 5.13).
628+
629+
This is a monotonically increasing counter that resets on system reboot.
630+
631+
See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html) and
632+
[/proc/pressure/*](https://man7.org/linux/man-pages/man5/proc.5.html) files.
633+
instrument: counter
634+
unit: "us"
635+
attributes:
636+
- ref: system.psi.resource
637+
requirement_level: required
638+
brief: "The resource experiencing pressure (cpu, memory, or io)"
639+
- ref: system.psi.stall_type
640+
requirement_level: required
641+
brief: "The stall type (some or full)"
642+
entity_associations:
643+
- host

0 commit comments

Comments
 (0)