Skip to content
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .chloggen/2995-psi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
change_type: enhancement

component: system

note: "Add Linux PSI (Pressure Stall Information) metrics `system.linux.psi.pressure` and `system.linux.psi.total_time` for measuring resource contention."

issues: [2995]

subtext: |
PSI metrics track CPU, memory, and I/O resource pressure by measuring the percentage of time tasks are stalled.
These metrics help with workload sizing, detecting productivity losses, and dynamic system management.
40 changes: 40 additions & 0 deletions docs/registry/attributes/system.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
- [Filesystem Attributes](#filesystem-attributes)
- [System Memory Attributes](#system-memory-attributes)
- [System Paging Attributes](#system-paging-attributes)
- [System PSI (Pressure Stall Information) Attributes](#system-psi-pressure-stall-information-attributes)
- [Deprecated System Attributes](#deprecated-system-attributes)

## General System Attributes
Expand Down Expand Up @@ -127,6 +128,45 @@ Describes System Memory Paging attributes
| `free` | free | ![Development](https://img.shields.io/badge/-development-blue) |
| `used` | used | ![Development](https://img.shields.io/badge/-development-blue) |

## System PSI (Pressure Stall Information) Attributes

Describes Linux Pressure Stall Information attributes

**Attributes:**

| Key | Stability | Value Type | Description | Example Values |
|---|---|---|---|---|
| <a id="system-psi-resource" href="#system-psi-resource">`system.psi.resource`</a> | ![Development](https://img.shields.io/badge/-development-blue) | string | The resource experiencing pressure [2] | `cpu`; `memory`; `io` |
| <a id="system-psi-stall-type" href="#system-psi-stall-type">`system.psi.stall_type`</a> | ![Development](https://img.shields.io/badge/-development-blue) | string | The PSI stall type | `some`; `full` |
| <a id="system-psi-window" href="#system-psi-window">`system.psi.window`</a> | ![Development](https://img.shields.io/badge/-development-blue) | string | The time window over which pressure is calculated [3] | `10s`; `60s`; `300s` |

**[2] `system.psi.resource`:** Linux PSI (Pressure Stall Information) measures resource pressure for CPU, memory, and I/O. See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html).

**[3] `system.psi.window`:** PSI tracks pressure as percentages over 10-second, 60-second, and 300-second windows. This attribute identifies which time window the metric represents.

---

`system.psi.resource` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
|---|---|---|
| `cpu` | CPU resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
| `io` | I/O resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
| `memory` | Memory resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |

---

`system.psi.stall_type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
|---|---|---|
| `full` | All non-idle tasks are stalled on the resource simultaneously [4] | ![Development](https://img.shields.io/badge/-development-blue) |
| `some` | At least some tasks are stalled on the resource [5] | ![Development](https://img.shields.io/badge/-development-blue) |

**[4]:** The "full" line indicates the share of time in which all non-idle tasks are stalled on a given resource simultaneously. This represents a state where actual CPU cycles are going to waste and the workload is thrashing. CPU full is undefined at the system level and is set to zero for backward compatibility (available since Linux 5.13).

**[5]:** The "some" line indicates the share of time in which at least some tasks are stalled on a given resource.

## Deprecated System Attributes

Deprecated system attributes.
Expand Down
154 changes: 154 additions & 0 deletions docs/system/system-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,9 @@ Resource attributes related to a host, SHOULD be reported under the `host.*` nam
- [`system.memory.{os}.` - OS Specific System Memory Metrics](#systemmemoryos---os-specific-system-memory-metrics)
- [Metric: `system.memory.linux.available`](#metric-systemmemorylinuxavailable)
- [Metric: `system.memory.linux.slab.usage`](#metric-systemmemorylinuxslabusage)
- [Linux PSI (Pressure Stall Information) metrics](#linux-psi-pressure-stall-information-metrics)
- [Metric: `system.linux.psi.pressure`](#metric-systemlinuxpsipressure)
- [Metric: `system.linux.psi.total_time`](#metric-systemlinuxpsitotal_time)

<!-- tocstop -->

Expand Down Expand Up @@ -1291,3 +1294,154 @@ See also the [Slab allocator](https://blogs.oracle.com/linux/post/understanding-
<!-- prettier-ignore-end -->
<!-- END AUTOGENERATED TEXT -->
<!-- endsemconv -->

## Linux PSI (Pressure Stall Information) metrics

**Description:** Linux Pressure Stall Information (PSI) metrics captured under the namespace `system.linux.psi`.

PSI is a Linux kernel feature (available since kernel 4.20) that identifies and
quantifies resource contention. It measures the time impact that resource
crunches have on workloads by tracking the percentage of time tasks are stalled
waiting for CPU, memory, or I/O resources.

PSI helps in:

- Sizing workloads to hardware or provisioning hardware according to workload demand
- Detecting productivity losses caused by resource scarcity
- Dynamic system management (load shedding, job migration, strategic pausing)
- Maximizing hardware utilization without sacrificing workload health

For more details, see the [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html).

### Metric: `system.linux.psi.pressure`

This metric is [recommended][MetricRecommended].

<!-- semconv metric.system.linux.psi.pressure -->
<!-- NOTE: THIS TEXT IS AUTOGENERATED. DO NOT EDIT BY HAND. -->
<!-- see templates/registry/markdown/snippet.md.j2 -->
<!-- prettier-ignore-start -->
<!-- markdownlint-capture -->
<!-- markdownlint-disable -->

| Name | Instrument Type | Unit (UCUM) | Description | Stability | Entity Associations |
| -------- | --------------- | ----------- | -------------- | --------- | ------ |
| `system.linux.psi.pressure` | Gauge | `1` | Linux Pressure Stall Information (PSI) metric measuring resource contention as percentage of time. [1] | ![Development](https://img.shields.io/badge/-development-blue) | [`host`](/docs/registry/entities/host.md#host) |

**[1]:** PSI (Pressure Stall Information) identifies and quantifies resource contention.
The metric represents the percentage of time that tasks were stalled on a given resource
over the specified time window.

PSI is available on Linux systems with kernel 4.20 or later and requires CONFIG_PSI=y.
CPU "full" stall is reported as zero at the system level for backward compatibility (available since 5.13).

The ratios are tracked over 10-second, 60-second and 300-second windows.

See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html)

**Attributes:**

| Key | Stability | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Value Type | Description | Example Values |
|---|---|---|---|---|---|
| [`system.psi.resource`](/docs/registry/attributes/system.md) | ![Development](https://img.shields.io/badge/-development-blue) | `Required` | string | The resource experiencing pressure [1] | `cpu`; `memory`; `io` |
| [`system.psi.stall_type`](/docs/registry/attributes/system.md) | ![Development](https://img.shields.io/badge/-development-blue) | `Required` | string | The PSI stall type | `some`; `full` |
| [`system.psi.window`](/docs/registry/attributes/system.md) | ![Development](https://img.shields.io/badge/-development-blue) | `Required` | string | The time window over which pressure is calculated [2] | `10s`; `60s`; `300s` |

**[1] `system.psi.resource`:** Linux PSI (Pressure Stall Information) measures resource pressure for CPU, memory, and I/O. See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html).

**[2] `system.psi.window`:** PSI tracks pressure as percentages over 10-second, 60-second, and 300-second windows. This attribute identifies which time window the metric represents.

---

`system.psi.resource` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
|---|---|---|
| `cpu` | CPU resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
| `io` | I/O resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
| `memory` | Memory resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |

---

`system.psi.stall_type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
|---|---|---|
| `full` | All non-idle tasks are stalled on the resource simultaneously [3] | ![Development](https://img.shields.io/badge/-development-blue) |
| `some` | At least some tasks are stalled on the resource [4] | ![Development](https://img.shields.io/badge/-development-blue) |

**[3]:** The "full" line indicates the share of time in which all non-idle tasks are stalled on a given resource simultaneously. This represents a state where actual CPU cycles are going to waste and the workload is thrashing. CPU full is undefined at the system level and is set to zero for backward compatibility (available since Linux 5.13).

**[4]:** The "some" line indicates the share of time in which at least some tasks are stalled on a given resource.

<!-- markdownlint-restore -->
<!-- prettier-ignore-end -->
<!-- END AUTOGENERATED TEXT -->
<!-- endsemconv -->

### Metric: `system.linux.psi.total_time`

This metric is [recommended][MetricRecommended].

<!-- semconv metric.system.linux.psi.total_time -->
<!-- NOTE: THIS TEXT IS AUTOGENERATED. DO NOT EDIT BY HAND. -->
<!-- see templates/registry/markdown/snippet.md.j2 -->
<!-- prettier-ignore-start -->
<!-- markdownlint-capture -->
<!-- markdownlint-disable -->

| Name | Instrument Type | Unit (UCUM) | Description | Stability | Entity Associations |
| -------- | --------------- | ----------- | -------------- | --------- | ------ |
| `system.linux.psi.total_time` | Counter | `s` | Linux Pressure Stall Information (PSI) total cumulative stall time. [1] | ![Development](https://img.shields.io/badge/-development-blue) | [`host`](/docs/registry/entities/host.md#host) |

**[1]:** This metric tracks the total absolute stall time since system boot.
Unlike the percentage-based `system.linux.psi.pressure` metric, this allows detection
of latency spikes that wouldn't necessarily make a noticeable impact on time averages.
It also enables calculating average trends over custom time frames.

PSI is available on Linux systems with kernel 4.20 or later and requires CONFIG_PSI=y.
CPU "full" stall is reported as zero at the system level for backward compatibility (available since 5.13).

This is a monotonically increasing counter that resets on system reboot.

Linux exposes this metric in microseconds. Following OpenTelemetry guidelines for measuring durations,
this metric uses seconds.

See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html)

**Attributes:**

| Key | Stability | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Value Type | Description | Example Values |
|---|---|---|---|---|---|
| [`system.psi.resource`](/docs/registry/attributes/system.md) | ![Development](https://img.shields.io/badge/-development-blue) | `Required` | string | The resource experiencing pressure [1] | `cpu`; `memory`; `io` |
| [`system.psi.stall_type`](/docs/registry/attributes/system.md) | ![Development](https://img.shields.io/badge/-development-blue) | `Required` | string | The PSI stall type | `some`; `full` |

**[1] `system.psi.resource`:** Linux PSI (Pressure Stall Information) measures resource pressure for CPU, memory, and I/O. See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html).

---

`system.psi.resource` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
|---|---|---|
| `cpu` | CPU resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
| `io` | I/O resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |
| `memory` | Memory resource pressure | ![Development](https://img.shields.io/badge/-development-blue) |

---

`system.psi.stall_type` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
|---|---|---|
| `full` | All non-idle tasks are stalled on the resource simultaneously [2] | ![Development](https://img.shields.io/badge/-development-blue) |
| `some` | At least some tasks are stalled on the resource [3] | ![Development](https://img.shields.io/badge/-development-blue) |

**[2]:** The "full" line indicates the share of time in which all non-idle tasks are stalled on a given resource simultaneously. This represents a state where actual CPU cycles are going to waste and the workload is thrashing. CPU full is undefined at the system level and is set to zero for backward compatibility (available since Linux 5.13).

**[3]:** The "some" line indicates the share of time in which at least some tasks are stalled on a given resource.

<!-- markdownlint-restore -->
<!-- prettier-ignore-end -->
<!-- END AUTOGENERATED TEXT -->
<!-- endsemconv -->
65 changes: 65 additions & 0 deletions model/system/metrics.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -563,3 +563,68 @@ groups:
- ref: system.memory.linux.slab.state
entity_associations:
- host

# system.linux.psi.* metrics
- id: metric.system.linux.psi.pressure
type: metric
metric_name: system.linux.psi.pressure
annotations:
code_generation:
metric_value_type: double
stability: development
brief: "Linux Pressure Stall Information (PSI) metric measuring resource contention as percentage of time."
note: |
PSI (Pressure Stall Information) identifies and quantifies resource contention.
The metric represents the percentage of time that tasks were stalled on a given resource
over the specified time window.

PSI is available on Linux systems with kernel 4.20 or later and requires CONFIG_PSI=y.
CPU "full" stall is reported as zero at the system level for backward compatibility (available since 5.13).

The ratios are tracked over 10-second, 60-second and 300-second windows.

See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html)
instrument: gauge
unit: "1"
attributes:
- ref: system.psi.resource
requirement_level: required
- ref: system.psi.stall_type
requirement_level: required
- ref: system.psi.window
requirement_level: required
entity_associations:
- host

- id: metric.system.linux.psi.total_time
type: metric
metric_name: system.linux.psi.total_time
annotations:
code_generation:
metric_value_type: double
stability: development
brief: "Linux Pressure Stall Information (PSI) total cumulative stall time."
note: |
This metric tracks the total absolute stall time since system boot.
Unlike the percentage-based `system.linux.psi.pressure` metric, this allows detection
of latency spikes that wouldn't necessarily make a noticeable impact on time averages.
It also enables calculating average trends over custom time frames.

PSI is available on Linux systems with kernel 4.20 or later and requires CONFIG_PSI=y.
CPU "full" stall is reported as zero at the system level for backward compatibility (available since 5.13).

This is a monotonically increasing counter that resets on system reboot.

Linux exposes this metric in microseconds. Following OpenTelemetry guidelines for measuring durations,
this metric uses seconds.

See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html)
instrument: counter
unit: "s"
attributes:
- ref: system.psi.resource
requirement_level: required
- ref: system.psi.stall_type
requirement_level: required
entity_associations:
- host
58 changes: 58 additions & 0 deletions model/system/registry.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -155,3 +155,61 @@ groups:
stability: development
brief: "The filesystem mount path"
examples: ["/mnt/data"]
# system.psi.* attribute group
- id: registry.system.psi
type: attribute_group
display_name: System PSI (Pressure Stall Information) Attributes
brief: "Describes Linux Pressure Stall Information attributes"
attributes:
- id: system.psi.resource
type:
members:
- id: cpu
value: 'cpu'
stability: development
brief: "CPU resource pressure"
- id: memory
value: 'memory'
stability: development
brief: "Memory resource pressure"
- id: io
value: 'io'
stability: development
brief: "I/O resource pressure"
stability: development
brief: "The resource experiencing pressure"
examples: ["cpu", "memory", "io"]
note: >
Linux PSI (Pressure Stall Information) measures resource pressure for CPU, memory, and I/O.
See [Linux kernel PSI documentation](https://docs.kernel.org/accounting/psi.html).
- id: system.psi.stall_type
type:
members:
- id: some
value: 'some'
stability: development
brief: "At least some tasks are stalled on the resource"
note: >
The "some" line indicates the share of time in which at least some
tasks are stalled on a given resource.
- id: full
value: 'full'
stability: development
brief: "All non-idle tasks are stalled on the resource simultaneously"
note: >
The "full" line indicates the share of time in which all non-idle
tasks are stalled on a given resource simultaneously. This represents
a state where actual CPU cycles are going to waste and the workload
is thrashing. CPU full is undefined at the system level and is set to
zero for backward compatibility (available since Linux 5.13).
stability: development
brief: "The PSI stall type"
examples: ["some", "full"]
- id: system.psi.window
type: string
stability: development
brief: "The time window over which pressure is calculated"
examples: ["10s", "60s", "300s"]
note: >
PSI tracks pressure as percentages over 10-second, 60-second, and 300-second windows.
This attribute identifies which time window the metric represents.
Loading