|
| 1 | +# GPU metrics |
| 2 | + |
| 3 | +The **gpu_metrics** input plugin collects graphics processing unit (GPU) performance metrics from graphics cards on Linux systems. It provides real-time monitoring of GPU utilization, memory usage (VRAM), clock frequencies, power consumption, temperature, and fan speeds. |
| 4 | + |
| 5 | +The plugin reads metrics directly from the Linux sysfs filesystem (`/sys/class/drm/`) without requiring external tools or libraries. Currently, **only AMD GPUs are supported** through the amdgpu kernel driver. NVIDIA and Intel GPUs aren't supported at this time. |
| 6 | + |
| 7 | +## Metrics collected |
| 8 | + |
| 9 | +The plugin collects the following metrics for each detected GPU: |
| 10 | + |
| 11 | +| Key | Description | |
| 12 | +|---------------------------|------------------------------------------------------------------------------------------------------------------------------------------| |
| 13 | +| `gpu_utilization_percent` | GPU core utilization as a percentage (0-100). Indicates how busy the GPU is processing workloads. | |
| 14 | +| `gpu_memory_used_bytes` | Amount of video RAM (VRAM) currently in use, measured in bytes. | |
| 15 | +| `gpu_memory_total_bytes` | Total video RAM (VRAM) capacity available on the GPU, measured in bytes. | |
| 16 | +| `gpu_clock_mhz` | Current GPU clock frequency in MHz. This metric has multiple instances with different type labels (see [Clock metrics](#clock-metrics)). | |
| 17 | +| `gpu_power_watts` | Current power consumption in watts. Can be disabled with enable_power false. | |
| 18 | +| `gpu_temperature_celsius` | GPU die temperature in degrees Celsius. Can be disabled with enable_temperature false. | |
| 19 | +| `gpu_fan_speed_rpm` | Fan rotation speed in revolutions per minute (RPM). | |
| 20 | +| `gpu_fan_pwm_percent` | Fan PWM duty cycle as a percentage (0-100). Indicates fan intensity. | |
| 21 | + |
| 22 | +### Clock metrics |
| 23 | + |
| 24 | +The `gpu_clock_mhz` metric is reported separately for three clock domains: |
| 25 | + |
| 26 | +| Type | Description | |
| 27 | +|------------|--------------------------------------| |
| 28 | +| `graphics` | GPU core/shader clock frequency. | |
| 29 | +| `memory` | VRAM clock frequency. | |
| 30 | +| `soc` | System-on-chip clock frequency. | |
| 31 | + |
| 32 | +## Configuration parameters |
| 33 | + |
| 34 | +The plugin supports the following configuration parameters: |
| 35 | + |
| 36 | +| Key | Description | Default | |
| 37 | +|----------------------|-------------------------------------------------------------------------------------------------------------------------|-----------| |
| 38 | +| `scrape_interval` | Interval in seconds between metric collection cycles. | `5` | |
| 39 | +| `path_sysfs` | Path to the sysfs root directory. Typically used for testing or non-standard systems. | `/sys` | |
| 40 | +| `cards_include` | Pattern specifying which GPU cards to monitor. Supports wildcards (*), ranges (0-3), and comma-separated lists (0,2,4). | `*` | |
| 41 | +| `cards_exclude` | Pattern specifying which GPU cards to exclude from monitoring. Uses the same syntax as cards_include. | _none_ | |
| 42 | +| `enable_power` | Enable collection of power consumption metrics (`gpu_power_watts`). | `true` | |
| 43 | +| `enable_temperature` | Enable collection of temperature metrics (`gpu_temperature_celsius`). | `true` | |
| 44 | + |
| 45 | +## GPU detection |
| 46 | + |
| 47 | +The GPU metrics plugin will automatically scan for supported **AMD GPUs** that are using the `amdgpu` kernel driver. GPUs using legacy drivers will be ignored. |
| 48 | + |
| 49 | +To check if your AMD GPU will be detected run: |
| 50 | + |
| 51 | +```bash |
| 52 | +lspci | grep -i vga | grep -i amd |
| 53 | +``` |
| 54 | + |
| 55 | +Example output: |
| 56 | + |
| 57 | +```bash |
| 58 | +03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT/7900 XTX/7900 GRE/7900M] (rev ce) |
| 59 | +73:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Granite Ridge [Radeon Graphics] (rev c5) |
| 60 | +``` |
| 61 | + |
| 62 | +### Multiple GPU systems |
| 63 | + |
| 64 | +In systems with multiple GPUs, the GPU metrics plugin will detect all AMD cards by default. You can control which GPUs you want to monitor with the `cards_include` and `cards_exclude` parameters. |
| 65 | + |
| 66 | +To list the GPUs running in your system run the following command: |
| 67 | + |
| 68 | +```bash |
| 69 | +ls /sys/class/drm/card*/device/vendor |
| 70 | +``` |
| 71 | + |
| 72 | +Example output: |
| 73 | + |
| 74 | +```bash |
| 75 | +/sys/class/drm/card0/device/vendor |
| 76 | +/sys/class/drm/card1/device/vendor |
| 77 | +``` |
| 78 | + |
| 79 | +## Getting started |
| 80 | + |
| 81 | +To get GPU metrics from your system, you can run the plugin from either the command line or through the configuration file: |
| 82 | + |
| 83 | +### Command line |
| 84 | + |
| 85 | +Run the following command from the command line: |
| 86 | + |
| 87 | +```bash |
| 88 | +fluent-bit -i gpu_metrics -o stdout |
| 89 | +``` |
| 90 | + |
| 91 | +Example output: |
| 92 | + |
| 93 | +```json |
| 94 | +2025-10-25T20:36:55.236905093Z gpu_utilization_percent{card="1",vendor="amd"} = 2 |
| 95 | +2025-10-25T20:36:55.237853918Z gpu_utilization_percent{card="0",vendor="amd"} = 0 |
| 96 | +2025-10-25T20:36:55.236905093Z gpu_memory_used_bytes{card="1",vendor="amd"} = 1580118016 |
| 97 | +2025-10-25T20:36:55.237853918Z gpu_memory_used_bytes{card="0",vendor="amd"} = 26083328 |
| 98 | +2025-10-25T20:36:55.236905093Z gpu_memory_total_bytes{card="1",vendor="amd"} = 17163091968 |
| 99 | +2025-10-25T20:36:55.237853918Z gpu_memory_total_bytes{card="0",vendor="amd"} = 2147483648 |
| 100 | +2025-10-25T20:36:55.236905093Z gpu_clock_mhz{card="1",vendor="amd",type="graphics"} = 45 |
| 101 | +2025-10-25T20:36:55.236905093Z gpu_clock_mhz{card="1",vendor="amd",type="memory"} = 96 |
| 102 | +2025-10-25T20:36:55.236905093Z gpu_clock_mhz{card="1",vendor="amd",type="soc"} = 500 |
| 103 | +2025-10-25T20:36:55.237853918Z gpu_clock_mhz{card="0",vendor="amd",type="graphics"} = 600 |
| 104 | +2025-10-25T20:36:55.237853918Z gpu_clock_mhz{card="0",vendor="amd",type="memory"} = 2800 |
| 105 | +2025-10-25T20:36:55.237853918Z gpu_clock_mhz{card="0",vendor="amd",type="soc"} = 1200 |
| 106 | +2025-10-25T20:36:55.236905093Z gpu_power_watts{card="1",vendor="amd"} = 28 |
| 107 | +2025-10-25T20:36:55.236905093Z gpu_temperature_celsius{card="1",vendor="amd"} = 28 |
| 108 | +2025-10-25T20:36:55.237853918Z gpu_temperature_celsius{card="0",vendor="amd"} = 39 |
| 109 | +2025-10-25T20:36:55.236905093Z gpu_fan_speed_rpm{card="1",vendor="amd"} = 0 |
| 110 | +2025-10-25T20:36:55.236905093Z gpu_fan_pwm_percent{card="1",vendor="amd"} = 0 |
| 111 | +``` |
| 112 | + |
| 113 | +### Configuration file |
| 114 | + |
| 115 | +In your main configuration file append the following: |
| 116 | + |
| 117 | +{% tabs %} |
| 118 | +{% tab title="fluent-bit.yaml" %} |
| 119 | + |
| 120 | +```yaml |
| 121 | +pipeline: |
| 122 | + inputs: |
| 123 | + - name: gpu_metrics |
| 124 | + scrape_interval: 2 |
| 125 | + path_sysfs: /sys |
| 126 | + cards_include: "1" |
| 127 | + cards_exclude: "0" |
| 128 | + enable_power: true |
| 129 | + enable_temperature: true |
| 130 | + |
| 131 | + outputs: |
| 132 | + - name: stdout |
| 133 | + match: '*' |
| 134 | +``` |
| 135 | +
|
| 136 | +{% endtab %} |
| 137 | +{% tab title="fluent-bit.conf" %} |
| 138 | +
|
| 139 | +```text |
| 140 | +[INPUT] |
| 141 | + Name gpu_metrics |
| 142 | + scrape_interval 2 |
| 143 | + path_sysfs /sys |
| 144 | + cards_include 1 |
| 145 | + cards_exclude 0 |
| 146 | + enable_power true |
| 147 | + enable_temperature true |
| 148 | + |
| 149 | +[OUTPUT] |
| 150 | + Name stdout |
| 151 | + Match * |
| 152 | +``` |
| 153 | + |
| 154 | +{% endtab %} |
| 155 | +{% endtabs %} |
| 156 | + |
| 157 | + |
0 commit comments