Skip to content

Commit 608fd69

Browse files
authored
pipeline: inputs: create gpu-metrics docs. Contributes to #2080 (#2112)
* create pipeline: inputs: gpu-metrics Signed-off-by: Maciej Wal <[email protected]> * fix vale linting issues Signed-off-by: Maciej Wal <[email protected]>
1 parent 3335697 commit 608fd69

File tree

1 file changed

+157
-0
lines changed

1 file changed

+157
-0
lines changed

pipeline/inputs/gpu-metrics.md

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
# GPU metrics
2+
3+
The **gpu_metrics** input plugin collects graphics processing unit (GPU) performance metrics from graphics cards on Linux systems. It provides real-time monitoring of GPU utilization, memory usage (VRAM), clock frequencies, power consumption, temperature, and fan speeds.
4+
5+
The plugin reads metrics directly from the Linux sysfs filesystem (`/sys/class/drm/`) without requiring external tools or libraries. Currently, **only AMD GPUs are supported** through the amdgpu kernel driver. NVIDIA and Intel GPUs aren't supported at this time.
6+
7+
## Metrics collected
8+
9+
The plugin collects the following metrics for each detected GPU:
10+
11+
| Key | Description |
12+
|---------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
13+
| `gpu_utilization_percent` | GPU core utilization as a percentage (0-100). Indicates how busy the GPU is processing workloads. |
14+
| `gpu_memory_used_bytes` | Amount of video RAM (VRAM) currently in use, measured in bytes. |
15+
| `gpu_memory_total_bytes` | Total video RAM (VRAM) capacity available on the GPU, measured in bytes. |
16+
| `gpu_clock_mhz` | Current GPU clock frequency in MHz. This metric has multiple instances with different type labels (see [Clock metrics](#clock-metrics)). |
17+
| `gpu_power_watts` | Current power consumption in watts. Can be disabled with enable_power false. |
18+
| `gpu_temperature_celsius` | GPU die temperature in degrees Celsius. Can be disabled with enable_temperature false. |
19+
| `gpu_fan_speed_rpm` | Fan rotation speed in revolutions per minute (RPM). |
20+
| `gpu_fan_pwm_percent` | Fan PWM duty cycle as a percentage (0-100). Indicates fan intensity. |
21+
22+
### Clock metrics
23+
24+
The `gpu_clock_mhz` metric is reported separately for three clock domains:
25+
26+
| Type | Description |
27+
|------------|--------------------------------------|
28+
| `graphics` | GPU core/shader clock frequency. |
29+
| `memory` | VRAM clock frequency. |
30+
| `soc` | System-on-chip clock frequency. |
31+
32+
## Configuration parameters
33+
34+
The plugin supports the following configuration parameters:
35+
36+
| Key | Description | Default |
37+
|----------------------|-------------------------------------------------------------------------------------------------------------------------|-----------|
38+
| `scrape_interval` | Interval in seconds between metric collection cycles. | `5` |
39+
| `path_sysfs` | Path to the sysfs root directory. Typically used for testing or non-standard systems. | `/sys` |
40+
| `cards_include` | Pattern specifying which GPU cards to monitor. Supports wildcards (*), ranges (0-3), and comma-separated lists (0,2,4). | `*` |
41+
| `cards_exclude` | Pattern specifying which GPU cards to exclude from monitoring. Uses the same syntax as cards_include. | _none_ |
42+
| `enable_power` | Enable collection of power consumption metrics (`gpu_power_watts`). | `true` |
43+
| `enable_temperature` | Enable collection of temperature metrics (`gpu_temperature_celsius`). | `true` |
44+
45+
## GPU detection
46+
47+
The GPU metrics plugin will automatically scan for supported **AMD GPUs** that are using the `amdgpu` kernel driver. GPUs using legacy drivers will be ignored.
48+
49+
To check if your AMD GPU will be detected run:
50+
51+
```bash
52+
lspci | grep -i vga | grep -i amd
53+
```
54+
55+
Example output:
56+
57+
```bash
58+
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT/7900 XTX/7900 GRE/7900M] (rev ce)
59+
73:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Granite Ridge [Radeon Graphics] (rev c5)
60+
```
61+
62+
### Multiple GPU systems
63+
64+
In systems with multiple GPUs, the GPU metrics plugin will detect all AMD cards by default. You can control which GPUs you want to monitor with the `cards_include` and `cards_exclude` parameters.
65+
66+
To list the GPUs running in your system run the following command:
67+
68+
```bash
69+
ls /sys/class/drm/card*/device/vendor
70+
```
71+
72+
Example output:
73+
74+
```bash
75+
/sys/class/drm/card0/device/vendor
76+
/sys/class/drm/card1/device/vendor
77+
```
78+
79+
## Getting started
80+
81+
To get GPU metrics from your system, you can run the plugin from either the command line or through the configuration file:
82+
83+
### Command line
84+
85+
Run the following command from the command line:
86+
87+
```bash
88+
fluent-bit -i gpu_metrics -o stdout
89+
```
90+
91+
Example output:
92+
93+
```json
94+
2025-10-25T20:36:55.236905093Z gpu_utilization_percent{card="1",vendor="amd"} = 2
95+
2025-10-25T20:36:55.237853918Z gpu_utilization_percent{card="0",vendor="amd"} = 0
96+
2025-10-25T20:36:55.236905093Z gpu_memory_used_bytes{card="1",vendor="amd"} = 1580118016
97+
2025-10-25T20:36:55.237853918Z gpu_memory_used_bytes{card="0",vendor="amd"} = 26083328
98+
2025-10-25T20:36:55.236905093Z gpu_memory_total_bytes{card="1",vendor="amd"} = 17163091968
99+
2025-10-25T20:36:55.237853918Z gpu_memory_total_bytes{card="0",vendor="amd"} = 2147483648
100+
2025-10-25T20:36:55.236905093Z gpu_clock_mhz{card="1",vendor="amd",type="graphics"} = 45
101+
2025-10-25T20:36:55.236905093Z gpu_clock_mhz{card="1",vendor="amd",type="memory"} = 96
102+
2025-10-25T20:36:55.236905093Z gpu_clock_mhz{card="1",vendor="amd",type="soc"} = 500
103+
2025-10-25T20:36:55.237853918Z gpu_clock_mhz{card="0",vendor="amd",type="graphics"} = 600
104+
2025-10-25T20:36:55.237853918Z gpu_clock_mhz{card="0",vendor="amd",type="memory"} = 2800
105+
2025-10-25T20:36:55.237853918Z gpu_clock_mhz{card="0",vendor="amd",type="soc"} = 1200
106+
2025-10-25T20:36:55.236905093Z gpu_power_watts{card="1",vendor="amd"} = 28
107+
2025-10-25T20:36:55.236905093Z gpu_temperature_celsius{card="1",vendor="amd"} = 28
108+
2025-10-25T20:36:55.237853918Z gpu_temperature_celsius{card="0",vendor="amd"} = 39
109+
2025-10-25T20:36:55.236905093Z gpu_fan_speed_rpm{card="1",vendor="amd"} = 0
110+
2025-10-25T20:36:55.236905093Z gpu_fan_pwm_percent{card="1",vendor="amd"} = 0
111+
```
112+
113+
### Configuration file
114+
115+
In your main configuration file append the following:
116+
117+
{% tabs %}
118+
{% tab title="fluent-bit.yaml" %}
119+
120+
```yaml
121+
pipeline:
122+
inputs:
123+
- name: gpu_metrics
124+
scrape_interval: 2
125+
path_sysfs: /sys
126+
cards_include: "1"
127+
cards_exclude: "0"
128+
enable_power: true
129+
enable_temperature: true
130+
131+
outputs:
132+
- name: stdout
133+
match: '*'
134+
```
135+
136+
{% endtab %}
137+
{% tab title="fluent-bit.conf" %}
138+
139+
```text
140+
[INPUT]
141+
Name gpu_metrics
142+
scrape_interval 2
143+
path_sysfs /sys
144+
cards_include 1
145+
cards_exclude 0
146+
enable_power true
147+
enable_temperature true
148+
149+
[OUTPUT]
150+
Name stdout
151+
Match *
152+
```
153+
154+
{% endtab %}
155+
{% endtabs %}
156+
157+

0 commit comments

Comments
 (0)