Skip to content
11 changes: 11 additions & 0 deletions .chloggen/2995-psi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
change_type: enhancement

component: system

note: "Add Linux PSI (Pressure Stall Information) metrics `system.linux.psi.pressure` and `system.linux.psi.total_time` for measuring resource contention."

issues: [2995]

subtext: |
PSI metrics track CPU, memory, and I/O resource pressure by measuring the percentage of time tasks are stalled.
These metrics help with workload sizing, detecting productivity losses, and dynamic system management.
44 changes: 42 additions & 2 deletions docs/registry/attributes/system.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

- [General System Attributes](#general-system-attributes)
- [Filesystem Attributes](#filesystem-attributes)
- [System PSI (Pressure Stall Information) Attributes](#system-psi-pressure-stall-information-attributes)
- [System Memory Attributes](#system-memory-attributes)
- [System Paging Attributes](#system-paging-attributes)
- [Deprecated System Attributes](#deprecated-system-attributes)
Expand Down Expand Up @@ -55,6 +56,45 @@ Describes Filesystem attributes
| `ntfs` | ntfs | ![Development](https://img.shields.io/badge/-development-blue) |
| `refs` | refs | ![Development](https://img.shields.io/badge/-development-blue) |

## System PSI (Pressure Stall Information) Attributes

Describes Linux Pressure Stall Information attributes

**Attributes:**

| Key | Stability | Value Type | Description | Example Values |
| --- | --- | --- | --- | --- |
| <a id="system-linux-psi-resource" href="#system-linux-psi-resource">`system.linux.psi.resource`</a> | ![Development](https://img.shields.io/badge/-development-blue) | string | The resource experiencing pressure [1] | `cpu`; `memory`; `io` |
| <a id="system-linux-psi-stall-type" href="#system-linux-psi-stall-type">`system.linux.psi.stall_type`</a> | ![Development](https://img.shields.io/badge/-development-blue) | string | The PSI stall type | `some`; `full` |
| <a id="system-linux-psi-window" href="#system-linux-psi-window">`system.linux.psi.window`</a> | ![Development](https://img.shields.io/badge/-development-blue) | int | The time window over which pressure is calculated in seconds. [2] | `10`; `60`; `300` |

**[1] `system.linux.psi.resource`:** Linux PSI (Pressure Stall Information) measures resource pressure for CPU, memory, and I/O. See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html).

**[2] `system.linux.psi.window`:** PSI tracks pressure as percentages over 10-second, 60-second, and 300-second windows. This attribute identifies which time window the metric represents.

---

`system.linux.psi.resource` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
| --- | --- | --- |
| `cpu` | CPU resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
| `io` | I/O resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
| `memory` | Memory resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |

---

`system.linux.psi.stall_type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
| --- | --- | --- |
| `full` | All non-idle tasks are stalled on the resource simultaneously [3] | ![Development](https://img.shields.io/badge/-development-blue) |
| `some` | At least some tasks are stalled on the resource [4] | ![Development](https://img.shields.io/badge/-development-blue) |

**[3]:** The "full" line indicates the share of time in which all non-idle tasks are stalled on a given resource simultaneously. This represents a state where actual CPU cycles are going to waste and the workload is thrashing. CPU full is undefined at the system level and is set to zero for backward compatibility (available since Linux 5.13).

**[4]:** The "some" line indicates the share of time in which at least some tasks are stalled on a given resource.

## System Memory Attributes

Describes System Memory attributes
Expand Down Expand Up @@ -84,9 +124,9 @@ Describes System Memory attributes
| `buffers` | buffers | ![Development](https://img.shields.io/badge/-development-blue) |
| `cached` | cached | ![Development](https://img.shields.io/badge/-development-blue) |
| `free` | free | ![Development](https://img.shields.io/badge/-development-blue) |
| `used` | Actual used virtual memory in bytes. [1] | ![Development](https://img.shields.io/badge/-development-blue) |
| `used` | Actual used virtual memory in bytes. [5] | ![Development](https://img.shields.io/badge/-development-blue) |

**[1]:** Calculation based on the operating system metrics. On Linux, this corresponds to "MemTotal - MemAvailable" from /proc/meminfo, which more accurately reflects memory in active use by applications compared to older formulas based on free, cached, and buffers. If MemAvailable is not available, a fallback to those older formulas may be used.
**[5]:** Calculation based on the operating system metrics. On Linux, this corresponds to "MemTotal - MemAvailable" from /proc/meminfo, which more accurately reflects memory in active use by applications compared to older formulas based on free, cached, and buffers. If MemAvailable is not available, a fallback to those older formulas may be used.

## System Paging Attributes

Expand Down
148 changes: 148 additions & 0 deletions docs/system/system-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,9 @@ Resource attributes related to a host, SHOULD be reported under the `host.*` nam
- [`system.memory.{os}.` - OS Specific System Memory Metrics](#systemmemoryos---os-specific-system-memory-metrics)
- [Metric: `system.memory.linux.available`](#metric-systemmemorylinuxavailable)
- [Metric: `system.memory.linux.slab.usage`](#metric-systemmemorylinuxslabusage)
- [Linux PSI (Pressure Stall Information) metrics](#linux-psi-pressure-stall-information-metrics)
- [Metric: `system.linux.psi.pressure`](#metric-systemlinuxpsipressure)
- [Metric: `system.linux.psi.total_time`](#metric-systemlinuxpsitotal_time)

<!-- tocstop -->

Expand Down Expand Up @@ -1198,3 +1201,148 @@ See also the [Slab allocator](https://blogs.oracle.com/linux/post/understanding-
<!-- prettier-ignore-end -->
<!-- END AUTOGENERATED TEXT -->
<!-- endsemconv -->

## Linux PSI (Pressure Stall Information) metrics

**Description:** Linux Pressure Stall Information (PSI) metrics captured under the namespace `system.linux.psi`.

PSI is a Linux kernel feature (available since kernel 4.20) that identifies and
quantifies resource contention. It measures the time impact that resource
crunches have on workloads by tracking the percentage of time tasks are stalled
waiting for CPU, memory, or I/O resources.

PSI helps in:

- Sizing workloads to hardware or provisioning hardware according to workload demand
- Detecting productivity losses caused by resource scarcity
- Dynamic system management (load shedding, job migration, strategic pausing)
- Maximizing hardware utilization without sacrificing workload health

For more details, see the [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html).

### Metric: `system.linux.psi.pressure`

This metric is [recommended][MetricRecommended].

<!-- semconv metric.system.linux.psi.pressure -->
<!-- NOTE: THIS TEXT IS AUTOGENERATED. DO NOT EDIT BY HAND. -->
<!-- see templates/registry/markdown/snippet.md.j2 -->
<!-- prettier-ignore-start -->

| Name | Instrument Type | Unit (UCUM) | Description | Stability | Entity Associations |
| -------- | --------------- | ----------- | -------------- | --------- | ------ |
| `system.linux.psi.pressure` | Gauge | `1` | Linux Pressure Stall Information (PSI) metric measuring resource contention as percentage of time. [1] | ![Development](https://img.shields.io/badge/-development-blue) | [`host`](/docs/registry/entities/host.md#host) |

**[1]:** PSI (Pressure Stall Information) identifies and quantifies resource contention.
The metric represents the percentage of time that tasks were stalled on a given resource
over the specified time window.

PSI is available on Linux systems with kernel 4.20 or later and requires CONFIG_PSI=y.
CPU "full" stall is reported as zero at the system level for backward compatibility (available since 5.13).

The ratios are tracked over 10-second, 60-second and 300-second windows.

See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html)

**Attributes:**

| Key | Stability | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Value Type | Description | Example Values |
| --- | --- | --- | --- | --- | --- |
| [`system.linux.psi.resource`](/docs/registry/attributes/system.md) | ![Development](https://img.shields.io/badge/-development-blue) | `Required` | string | The resource experiencing pressure [1] | `cpu`; `memory`; `io` |
| [`system.linux.psi.stall_type`](/docs/registry/attributes/system.md) | ![Development](https://img.shields.io/badge/-development-blue) | `Required` | string | The PSI stall type | `some`; `full` |
| [`system.linux.psi.window`](/docs/registry/attributes/system.md) | ![Development](https://img.shields.io/badge/-development-blue) | `Required` | int | The time window over which pressure is calculated in seconds. [2] | `10`; `60`; `300` |

**[1] `system.linux.psi.resource`:** Linux PSI (Pressure Stall Information) measures resource pressure for CPU, memory, and I/O. See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html).

**[2] `system.linux.psi.window`:** PSI tracks pressure as percentages over 10-second, 60-second, and 300-second windows. This attribute identifies which time window the metric represents.

---

`system.linux.psi.resource` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
| --- | --- | --- |
| `cpu` | CPU resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
| `io` | I/O resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
| `memory` | Memory resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |

---

`system.linux.psi.stall_type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
| --- | --- | --- |
| `full` | All non-idle tasks are stalled on the resource simultaneously [3] | ![Development](https://img.shields.io/badge/-development-blue) |
| `some` | At least some tasks are stalled on the resource [4] | ![Development](https://img.shields.io/badge/-development-blue) |

**[3]:** The "full" line indicates the share of time in which all non-idle tasks are stalled on a given resource simultaneously. This represents a state where actual CPU cycles are going to waste and the workload is thrashing. CPU full is undefined at the system level and is set to zero for backward compatibility (available since Linux 5.13).

**[4]:** The "some" line indicates the share of time in which at least some tasks are stalled on a given resource.

<!-- prettier-ignore-end -->
<!-- END AUTOGENERATED TEXT -->
<!-- endsemconv -->

### Metric: `system.linux.psi.total_time`

This metric is [recommended][MetricRecommended].

<!-- semconv metric.system.linux.psi.total_time -->
<!-- NOTE: THIS TEXT IS AUTOGENERATED. DO NOT EDIT BY HAND. -->
<!-- see templates/registry/markdown/snippet.md.j2 -->
<!-- prettier-ignore-start -->

| Name | Instrument Type | Unit (UCUM) | Description | Stability | Entity Associations |
| -------- | --------------- | ----------- | -------------- | --------- | ------ |
| `system.linux.psi.total_time` | Counter | `s` | Linux Pressure Stall Information (PSI) total cumulative stall time. [1] | ![Development](https://img.shields.io/badge/-development-blue) | [`host`](/docs/registry/entities/host.md#host) |

**[1]:** This metric tracks the total absolute stall time since system boot.
Unlike the percentage-based `system.linux.psi.pressure` metric, this allows detection
of latency spikes that wouldn't necessarily make a noticeable impact on time averages.
It also enables calculating average trends over custom time frames.

PSI is available on Linux systems with kernel 4.20 or later and requires CONFIG_PSI=y.
CPU "full" stall is reported as zero at the system level for backward compatibility (available since 5.13).

This is a monotonically increasing counter that resets on system reboot.

Linux exposes this metric in microseconds. Following OpenTelemetry guidelines for measuring durations,
this metric uses seconds.

See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html)

**Attributes:**

| Key | Stability | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Value Type | Description | Example Values |
| --- | --- | --- | --- | --- | --- |
| [`system.linux.psi.resource`](/docs/registry/attributes/system.md) | ![Development](https://img.shields.io/badge/-development-blue) | `Required` | string | The resource experiencing pressure [1] | `cpu`; `memory`; `io` |
| [`system.linux.psi.stall_type`](/docs/registry/attributes/system.md) | ![Development](https://img.shields.io/badge/-development-blue) | `Required` | string | The PSI stall type | `some`; `full` |

**[1] `system.linux.psi.resource`:** Linux PSI (Pressure Stall Information) measures resource pressure for CPU, memory, and I/O. See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html).

---

`system.linux.psi.resource` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
| --- | --- | --- |
| `cpu` | CPU resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
| `io` | I/O resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
| `memory` | Memory resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |

---

`system.linux.psi.stall_type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
| --- | --- | --- |
| `full` | All non-idle tasks are stalled on the resource simultaneously [2] | ![Development](https://img.shields.io/badge/-development-blue) |
| `some` | At least some tasks are stalled on the resource [3] | ![Development](https://img.shields.io/badge/-development-blue) |

**[2]:** The "full" line indicates the share of time in which all non-idle tasks are stalled on a given resource simultaneously. This represents a state where actual CPU cycles are going to waste and the workload is thrashing. CPU full is undefined at the system level and is set to zero for backward compatibility (available since Linux 5.13).

**[3]:** The "some" line indicates the share of time in which at least some tasks are stalled on a given resource.

<!-- prettier-ignore-end -->
<!-- END AUTOGENERATED TEXT -->
<!-- endsemconv -->
65 changes: 65 additions & 0 deletions model/system/metrics.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -563,3 +563,68 @@ groups:
- ref: system.memory.linux.slab.state
entity_associations:
- host

# system.linux.psi.* metrics
- id: metric.system.linux.psi.pressure
type: metric
metric_name: system.linux.psi.pressure
annotations:
code_generation:
metric_value_type: double
stability: development
brief: "Linux Pressure Stall Information (PSI) metric measuring resource contention as percentage of time."
note: |
PSI (Pressure Stall Information) identifies and quantifies resource contention.
The metric represents the percentage of time that tasks were stalled on a given resource
over the specified time window.

PSI is available on Linux systems with kernel 4.20 or later and requires CONFIG_PSI=y.
CPU "full" stall is reported as zero at the system level for backward compatibility (available since 5.13).

The ratios are tracked over 10-second, 60-second and 300-second windows.

See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html)
instrument: gauge
unit: "1"
attributes:
- ref: system.linux.psi.resource
requirement_level: required
- ref: system.linux.psi.stall_type
requirement_level: required
- ref: system.linux.psi.window
requirement_level: required
entity_associations:
- host

- id: metric.system.linux.psi.total_time
type: metric
metric_name: system.linux.psi.total_time
annotations:
code_generation:
metric_value_type: double
stability: development
brief: "Linux Pressure Stall Information (PSI) total cumulative stall time."
note: |
This metric tracks the total absolute stall time since system boot.
Unlike the percentage-based `system.linux.psi.pressure` metric, this allows detection
of latency spikes that wouldn't necessarily make a noticeable impact on time averages.
It also enables calculating average trends over custom time frames.

PSI is available on Linux systems with kernel 4.20 or later and requires CONFIG_PSI=y.
CPU "full" stall is reported as zero at the system level for backward compatibility (available since 5.13).

This is a monotonically increasing counter that resets on system reboot.

Linux exposes this metric in microseconds. Following OpenTelemetry guidelines for measuring durations,
this metric uses seconds.

See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html)
instrument: counter
unit: "s"
attributes:
- ref: system.linux.psi.resource
requirement_level: required
- ref: system.linux.psi.stall_type
requirement_level: required
entity_associations:
- host
Loading
Loading