|
1 | 1 | # Metrics |
2 | 2 |
|
| 3 | +`dstack` automatically tracks essential metrics, which you can access via the CLI and UI. |
| 4 | +You can also configure the `dstack` server to export metrics to Prometheus—this is required to access advanced metrics such as those from DCGM. |
| 5 | + |
| 6 | +## UI |
| 7 | + |
| 8 | +To access metrics via the UI, open the page of the corresponding run or job and switch to the `Metrics` tab: |
| 9 | + |
| 10 | +{ width=800 } |
| 11 | + |
| 12 | +This tab displays key CPU, memory, and GPU metrics collected during the last hour of the run or job. |
| 13 | + |
| 14 | +## CLI |
| 15 | + |
| 16 | +As an alternative to the UI, you can track real-time essential metrics via the CLI. |
| 17 | +The `dstack metrics` command displays the most recently tracked CPU, memory, and GPU metrics. |
| 18 | + |
| 19 | +<div class="termy"> |
| 20 | + |
| 21 | +```shell |
| 22 | +dstack metrics gentle-mayfly-1 |
| 23 | + |
| 24 | + NAME STATUS CPU MEMORY GPU |
| 25 | + gentle-mayfly-1 done 0% 16.27GB/2000GB gpu=0 mem=72.48GB/80GB util=0% |
| 26 | + gpu=1 mem=64.99GB/80GB util=0% |
| 27 | + gpu=2 mem=580MB/80GB util=0% |
| 28 | + gpu=3 mem=4MB/80GB util=0% |
| 29 | + gpu=4 mem=4MB/80GB util=0% |
| 30 | + gpu=5 mem=4MB/80GB util=0% |
| 31 | + gpu=6 mem=4MB/80GB util=0% |
| 32 | + gpu=7 mem=292MB/80GB util=0% |
| 33 | +``` |
| 34 | + |
| 35 | +</div> |
| 36 | + |
3 | 37 | ## Prometheus |
4 | 38 |
|
5 | | -To collect and export fleet and run metrics to Prometheus, enable the |
6 | | -`DSTACK_ENABLE_PROMETHEUS_METRICS` environment variable and configure Prometheus to fetch metrics from |
| 39 | +To enable exporting metrics to Prometheus, set the |
| 40 | +`DSTACK_ENABLE_PROMETHEUS_METRICS` environment variable and configure Prometheus to scrape metrics from |
7 | 41 | `<dstack server URL>/metrics`. |
8 | 42 |
|
| 43 | +In addition to the essential metrics available via the CLI and UI, `dstack` exports additional metrics to Prometheus, including data on fleets, runs, jobs, and DCGM metrics. |
| 44 | + |
9 | 45 | ??? info "NVIDIA DCGM" |
10 | 46 | NVIDIA DCGM metrics are automatically collected for `aws`, `azure`, `gcp`, and `oci` backends, |
11 | 47 | as well as for [SSH fleets](../concepts/fleets.md#ssh). |
|
0 commit comments