Skip to content
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,14 @@ products:

Learn about key host metrics displayed in the Infrastructure UI:

* [Elastic System integration host metrics](#ecs-host-metrics)
* [OpenTelemetry host metrics](#open-telemetry-host-metrics)


## Elastic System integration host metrics [ecs-host-metrics]

Refer to the following sections for host metrics and field calculation formulas for the Elastic System integration data:

* [Hosts](#key-metrics-hosts)
* [CPU usage](#key-metrics-cpu)
* [Memory](#key-metrics-memory)
Expand All @@ -19,15 +27,14 @@ Learn about key host metrics displayed in the Infrastructure UI:
* [Disk](#key-metrics-network)
* [Legacy](#legacy-metrics)


## Hosts metrics [key-metrics-hosts]
### Hosts metrics [key-metrics-hosts]

| Metric | Description |
| --- | --- |
| **Hosts** | Number of hosts returned by your search criteria.<br><br>**Field Calculation**: `count(system.cpu.cores)`<br> |
| **Hosts** | Number of hosts returned by your search criteria.<br><br>**Field Calculation**: `unique_count(host.name)`<br> |


## CPU usage metrics [key-metrics-cpu]
### CPU usage metrics [key-metrics-cpu]

| Metric | Description |
| --- | --- |
Expand All @@ -45,7 +52,7 @@ Learn about key host metrics displayed in the Infrastructure UI:
| **Normalized Load** | 1 minute load average normalized by the number of CPU cores.<br><br>Load average gives an indication of the number of threads that are runnable (either busy running on CPU, waiting to run, or waiting for a blocking IO operation to complete).<br><br>100% means the 1 minute load average is equal to the number of CPU cores of the host.<br><br>Taking the example of a 32 CPU cores host, if the 1 minute load average is 32, the value reported here is 100%. If the 1 minute load average is 48, the value reported here is 150%.<br><br>**Field Calculation**: `average(system.load.1) / max(system.load.cores)`<br> |


## Memory metrics [key-metrics-memory]
### Memory metrics [key-metrics-memory]

| Metric | Description |
| --- | --- |
Expand All @@ -57,22 +64,22 @@ Learn about key host metrics displayed in the Infrastructure UI:
| **Memory Used** | Main memory usage excluding page cache.<br><br>**Field Calculation**: `average(system.memory.actual.used.bytes)`<br> |


## Log metrics [key-metrics-log]
### Log metrics [key-metrics-log]

| Metric | Description |
| --- | --- |
| **Log Rate** | Derivative of the cumulative sum of the document count scaled to a 1 second rate. This metric relies on the same indices as the logs.<br><br>**Field Calculation**: `cumulative_sum(doc_count)`<br> |


## Network metrics [key-metrics-network]
### Network metrics [key-metrics-network]

| Metric | Description |
| --- | --- |
| **Network Inbound (RX)** | Number of bytes that have been received per second on the public interfaces of the hosts.<br><br>**Field Calculation**: `sum(host.network.ingress.bytes) * 8 / 1000`<br><br>For legacy metric calculations, refer to [Legacy metrics](#legacy-metrics).<br> |
| **Network Outbound (TX)** | Number of bytes that have been sent per second on the public interfaces of the hosts.<br><br>**Field Calculation**: `sum(host.network.egress.bytes) * 8 / 1000`<br><br>For legacy metric calculations, refer to [Legacy metrics](#legacy-metrics).<br> |


## Disk metrics [observability-host-metrics-disk-metrics]
### Disk metrics [observability-host-metrics-disk-metrics]

| Metric | Description |
| --- | --- |
Expand All @@ -84,8 +91,7 @@ Learn about key host metrics displayed in the Infrastructure UI:
| **Disk Write IOPS** | Average count of write operations from the device per second.<br><br>**Field Calculation**: `counter_rate(max(system.diskio.write.count), kql='system.diskio.write.count: *')`<br> |
| **Disk Write Throughput** | Average number of bytes written from the device per second.<br><br>**Field Calculation**: `counter_rate(max(system.diskio.write.bytes), kql='system.diskio.write.bytes: *')`<br> |


## Legacy metrics [legacy-metrics]
### Legacy metrics [legacy-metrics]

Over time, we may change the formula used to calculate a specific metric. To avoid affecting your existing rules, instead of changing the actual metric definition, we create a new metric and refer to the old one as "legacy."

Expand All @@ -96,3 +102,75 @@ The UI and any new rules you create will use the new metric definition. However,
| **CPU Usage (legacy)** | Percentage of CPU time spent in states other than Idle and IOWait, normalized by the number of CPU cores. This includes both time spent on user space and kernel space. 100% means all CPUs of the host are busy.<br><br>**Field Calculation**: `(average(system.cpu.user.pct) + average(system.cpu.system.pct)) / max(system.cpu.cores)`<br> |
| **Network Inbound (RX) (legacy)** | Number of bytes that have been received per second on the public interfaces of the hosts.<br><br>**Field Calculation**: `average(host.network.ingress.bytes) * 8 / (max(metricset.period, kql='host.network.ingress.bytes: *') / 1000)`<br> |
| **Network Outbound (TX) (legacy)** | Number of bytes that have been sent per second on the public interfaces of the hosts.<br><br>**Field Calculation**: `average(host.network.egress.bytes) * 8 / (max(metricset.period, kql='host.network.egress.bytes: *') / 1000)`<br> |

## OpenTelemetry host metrics [open-telemetry-host-metrics]

Refer to the following sections for host metrics and field calculation formulas for OpenTelemetry data:

* [Hosts](#otel-metrics-hosts)
* [CPU usage](#otel-metrics-cpu)
* [Memory](#otel-metrics-memory)
* [Log](#otel-metrics-log)
* [Network](#otel-metrics-network)
* [Disk](#otel-metrics-network)

### OpenTelemetry hosts metrics [otel-metrics-hosts]

| Metric | Description |
| --- | --- |
| **Hosts** | Number of hosts returned by your search criteria.<br><br>**Field Calculation**: `unique_count(host.name)`<br> |

### OpenTelemetry CPU usage metrics [otel-metrics-cpu]

| Metric | Description |
| --- | --- |
| **CPU Usage (%)** | Average of percentage of CPU time spent in states other than Idle and IOWait, normalized by the number of CPU cores. Includes both time spent on user space and kernel space. 100% means all CPUs of the host are busy.<br><br>**Field Calculation**: `1-(average(metrics.system.cpu.utilization,kql='state: idle') + average(metrics.system.cpu.utilization,kql='state: wait'))`<br> |
| **CPU Usage - iowait (%)** | The percentage of CPU time spent in wait (on disk).<br><br>**Field Calculation**: `average(metrics.system.cpu.utilization,kql='state: wait') / max(metrics.system.cpu.logical.count)`<br> |
| **CPU Usage - irq (%)** | The percentage of CPU time spent servicing and handling hardware interrupts.<br><br>**Field Calculation**: `average(metrics.system.cpu.utilization,kql='state: interrupt') / max(metrics.system.cpu.logical.count)`<br> |
| **CPU Usage - nice (%)** | The percentage of CPU time spent on low-priority processes.<br><br>**Field Calculation**: `average(metrics.system.cpu.utilization,kql='state: nice') / max(metrics.system.cpu.logical.count)`<br> |
| **CPU Usage - softirq (%)** | The percentage of CPU time spent servicing and handling software interrupts.<br><br>**Field Calculation**: `average(metrics.system.cpu.utilization,kql='state: softirq') / max(metrics.system.cpu.logical.count)`<br> |
| **CPU Usage - steal (%)** | The percentage of CPU time spent in involuntary wait by the virtual CPU while the hypervisor was servicing another processor. Available only on Unix.<br><br>**Field Calculation**: `average(metrics.system.cpu.utilization,kql='state: steal') / max(metrics.system.cpu.logical.count)`<br> |
| **CPU Usage - system (%)** | The percentage of CPU time spent in kernel space.<br><br>**Field Calculation**: `average(metrics.system.cpu.utilization,kql='state: system') / max(metrics.system.cpu.logical.count)`<br> |
| **CPU Usage - user (%)** | The percentage of CPU time spent in user space. On multi-core systems, you can have percentages that are greater than 100%. For example, if 3 cores are at 60% use, then the system.cpu.user.pct will be 180%.<br><br>**Field Calculation**: `average(metrics.system.cpu.utilization,kql='state: user') / max(metrics.system.cpu.logical.count)`<br> |
| **Load (1m)** | 1 minute load average.<br><br>Load average gives an indication of the number of threads that are runnable (either busy running on CPU, waiting to run, or waiting for a blocking IO operation to complete).<br><br>**Field Calculation**: `average(metrics.system.cpu.load_average.1m)`<br> |
| **Load (5m)** | 5 minute load average.<br><br>Load average gives an indication of the number of threads that are runnable (either busy running on CPU, waiting to run, or waiting for a blocking IO operation to complete).<br><br>**Field Calculation**: `average(metrics.system.cpu.load_average.5m)`<br> |
| **Load (15m)** | 15 minute load average.<br><br>Load average gives an indication of the number of threads that are runnable (either busy running on CPU, waiting to run, or waiting for a blocking IO operation to complete).<br><br>**Field Calculation**: `average(metrics.system.cpu.load_average.15m)`<br> |
| **Normalized Load** | 1 minute load average normalized by the number of CPU cores.<br><br>Load average gives an indication of the number of threads that are runnable (either busy running on CPU, waiting to run, or waiting for a blocking IO operation to complete).<br><br>100% means the 1 minute load average is equal to the number of CPU cores of the host.<br><br>Taking the example of a 32 CPU cores host, if the 1 minute load average is 32, the value reported here is 100%. If the 1 minute load average is 48, the value reported here is 150%.<br><br>**Field Calculation**: `average(metrics.system.cpu.load_average.1m) / max(metrics.system.cpu.logical.count)`<br> |

### OpenTelemetry memory metrics [otel-metrics-memory]

| Metric | Description |
| --- | --- |
| **Memory Cache** | Memory (page) cache.<br><br>**Field Calculation**: `average(metrics.system.memory.usage, kql='state: cache') / average(metrics.system.memory.usage, kql='state: slab_reclaimable') + average(metrics.system.memory.usage, kql='state: slab_unreclaimable')`<br> |
| **Memory Free** | Total available memory.<br><br>**Field Calculation**: `(max(metrics.system.memory.usage, kql='state: free') + max(metrics.system.memory.usage, kql='state: cached')) - (average(metrics.system.memory.usage, kql='state: slab_unreclaimable') + average(metrics.system.memory.usage, kql='state: slab_reclaimable'))`<br> |
| **Memory Free (excluding cache)** | Total available memory excluding the page cache.<br><br>**Field Calculation**: `average(metrics.system.memory.usage, kql='state: free')`<br> |
| **Memory Total** | Total memory capacity.<br><br>**Field Calculation**: `avg(system.memory.total)`<br> |
| **Memory Usage (%)** | Percentage of main memory usage excluding page cache.<br><br>This includes resident memory for all processes plus memory used by the kernel structures and code apart from the page cache.<br><br>A high level indicates a situation of memory saturation for the host. For example, 100% means the main memory is entirely filled with memory that can’t be reclaimed, except by swapping out.<br><br>**Field Calculation**: `average(system.memory.utilization, kql='state: used') + average(system.memory.utilization, kql='state: buffered') + average(system.memory.utilization, kql='state: slab_reclaimable') + average(system.memory.utilization, kql='state: slab_unreclaimable')`<br> |
| **Memory Used** | Main memory usage excluding page cache.<br><br>**Field Calculation**: `average(metrics.system.memory.usage, kql='state: used') + average(metrics.system.memory.usage, kql='state: buffered') + average(metrics.system.memory.usage, kql='state: slab_reclaimable') + average(metrics.system.memory.usage, kql='state: slab_unreclaimable')`<br> |

### OpenTelemetry log metrics [otel-metrics-log]

| Metric | Description |
| --- | --- |
| **Log Rate** | Derivative of the cumulative sum of the document count scaled to a 1 second rate. This metric relies on the same indices as the logs.<br><br>**Field Calculation**: `cumulative_sum(doc_count)`<br> |

### OpenTelemetry network metrics [otel-metrics-network]

| Metric | Description |
| --- | --- |
| **Network Inbound (RX)** | Number of bytes that have been received per second on the public interfaces of the hosts.<br><br>**Field Calculation**: `8 * counter_rate(max(metrics.system.network.io, kql='direction: receive')))`<br> |
| **Network Outbound (TX)** | Number of bytes that have been sent per second on the public interfaces of the hosts.<br><br>**Field Calculation**: `8 * counter_rate(max(metrics.system.network.io, kql='direction: transmit'))`<br> |

### OpenTelemetry disk metrics [otel-metrics-disk]

| Metric | Description |
| --- | --- |
| **Disk Latency** | Time spent to service disk requests.<br><br>**Field Calculation**: `average(system.diskio.read.time + system.diskio.write.time) / (system.diskio.read.count + system.diskio.write.count)`<br> |
| **Disk Read IOPS** | Average count of read operations from the device per second.<br><br>**Field Calculation**: `counter_rate(max(system.disk.operations, kql='attributes.direction: read'))`<br> |
| **Disk Read Throughput** | Average number of bytes read from the device per second.<br><br>**Field Calculation**: `counter_rate(max(system.disk.io, kql='attributes.direction: read'))`<br> |
| **Disk Usage - Available (%)** | Percentage of disk space available.<br><br>**Field Calculation**: `average(system.filesystem.usage, kql='state: free')`<br> |
| **Disk Usage - Used (%)** | Percentage of disk space used. <br><br>**Field Calculation**: `1 - sum(metrics.system.filesystem.usage, kql='state: free') / sum(metrics.system.filesystem.usage)`<br> |
| **Disk Write IOPS** | Average count of write operations from the device per second.<br><br>**Field Calculation**: `counter_rate(max(system.disk.operations, kql='attributes.direction: write'))`<br> |
| **Disk Write Throughput** | Average number of bytes written from the device per second.<br><br>**Field Calculation**: `counter_rate(max(system.disk.io, kql='attributes.direction: write'))')`<br> |


Original file line number Diff line number Diff line change
Expand Up @@ -31,20 +31,22 @@ When you select **Create inventory alert**, the parameters you configured on the

::::



## Inventory conditions [inventory-conditions]

Conditions for each rule can be applied to specific metrics relating to the inventory type you select. You can choose the aggregation type, the metric, and by including a warning threshold value, you can be alerted on multiple threshold values based on severity scores. When creating the rule, you can still get notified if no data is returned for the specific metric or if the rule fails to query {{es}}.

In this example, Kubernetes Pods is the selected inventory type. The conditions state that you will receive a critical alert for any pods within the `ingress-nginx` namespace with a memory usage of 95% or above and a warning alert if memory usage is 90% or above. The chart shows the results of applying the rule to the last 20 minutes of data. Note that the chart time range is 20 times the value of the look-back window specified in the `FOR THE LAST` field.
:::{note}
{applies_to}`{stack: "ga 9.2", serverless: "ga"}`
Most inventory types respect the default data collection method (for example, [Elastic system integration](integration-docs://reference/system/index.md)). For the `Hosts` inventory type, however, you can use the **Schema** dropdown menu to explicitly target host data collected using **OpenTelemetry** or the **Elastic System Integration**.
:::

In the following example, Kubernetes Pods is the selected inventory type. The conditions state that you will receive a critical alert for any pods within the `ingress-nginx` namespace with a memory usage of 95% or above and a warning alert if memory usage is 90% or above. The chart shows the results of applying the rule to the last 20 minutes of data. Note that the chart time range is 20 times the value of the look-back window specified in the `FOR THE LAST` field.

:::{image} /solutions/images/serverless-inventory-alert.png
:alt: Inventory rule
:screenshot:
:::


## Add actions [action-types-infrastructure]

You can extend your rules with actions that interact with third-party systems, write to logs or indices, or send user notifications. You can add an action to a rule at any time. You can create rules without adding actions, and you can also define multiple actions for a single rule.
Expand Down
Loading
Loading