Skip to content

Commit 0e50bcf

Browse files
[Blog] Built-in UI for monitoring basic GPU metrics (#2470)
1 parent e23783a commit 0e50bcf

File tree

5 files changed

+70
-7
lines changed

5 files changed

+70
-7
lines changed
Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
---
2-
title: "Monitoring basic GPU metrics via dstack stats"
2+
title: "Monitoring basic GPU metrics via CLI"
33
date: 2024-10-22
44
description: "dstack introduces a new CLI command (and API) for monitoring container metrics, incl. GPU usage for NVIDIA, AMD, and other accelerators."
5-
slug: dstack-stats
5+
slug: dstack-metrics
66
image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-stats-v2.png?raw=true
77
categories:
88
- AMD
99
- NVIDIA
1010
- Monitoring
1111
---
1212

13-
# Monitoring basic GPU metrics via dstack stats
13+
# Monitoring basic GPU metrics via CLI
1414

1515
## How it works { style="display:none"}
1616

@@ -22,6 +22,8 @@ for monitoring container metrics, including GPU usage for `NVIDIA`, `AMD`, and o
2222

2323
<!-- more -->
2424

25+
> Note, the `dstack stats` command has been renamed to `dstack metrics`. The old name is also supported by deprecated.
26+
2527
The command is similar to `kubectl top` (in terms of semantics) and `docker stats` (in terms of the CLI interface). The key
2628
difference is that `dstack stats` includes GPU VRAM usage and GPU utilization percentage.
2729

docs/blog/posts/metrics-ui.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
---
2+
title: "Built-in UI for monitoring basic GPU metrics"
3+
date: 2025-04-03
4+
description: "TBA"
5+
slug: metrics-ui
6+
image: https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-metrics-ui-v2-min.png?raw=true
7+
categories:
8+
- Monitoring
9+
- AMD
10+
- NVIDIA
11+
---
12+
13+
# Built-in UI for monitoring basic GPU metrics
14+
15+
AI workloads generate vast amounts of metrics, making it essential to have efficient monitoring tools. While our recent
16+
update introduced the ability to export available metrics to Prometheus for maximum flexibility, there are times when
17+
users need to quickly access essential metrics without the need to switch to an external tool.
18+
19+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-metrics-ui-v2-min.png?raw=true" width="630"/>
20+
21+
Previously, we introduced a [CLI command](dstack-metrics.md) that allows users to view basic GPU metrics for both NVIDIA
22+
and AMD hardware. Now, with this latest update, we’re excited to announce the addition of a built-in dashboard within
23+
the `dstack` control plane.
24+
25+
<!-- more -->
26+
27+
The new feature provides an easy-to-use interface for tracking the most essential GPU metrics
28+
directly from the control plane, streamlining the real-time monitoring process without needing any additional tools.
29+
30+
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-metrics-ui-dashboard.png?raw=true" width="800">
31+
32+
Additionally, we’ve renamed the CLI command previously known as `dstack stats` to `dstack metrics` for consistency.
33+
34+
<div class="termy">
35+
36+
```shell
37+
$ dstack metrics nccl-tests -w
38+
NAME CPU MEMORY GPU
39+
nccl-tests 81% 2754MB/1638400MB #0 100740MB/144384MB 100% Util
40+
#1 100740MB/144384MB 100% Util
41+
#2 100740MB/144384MB 99% Util
42+
#3 100740MB/144384MB 99% Util
43+
#4 100740MB/144384MB 99% Util
44+
#5 100740MB/144384MB 99% Util
45+
#6 100740MB/144384MB 99% Util
46+
#7 100740MB/144384MB 100% Util
47+
```
48+
49+
</div>
50+
51+
By default, both the control plane and CLI show metrics from the last hour, which is particularly useful for debugging
52+
workloads.
53+
54+
For persistent storage and long-term access to metrics, we still recommend setting up Prometheus to fetch
55+
metrics from `dstack`.
56+
57+
!!! info "What's next?"
58+
1. See the [Monitoring](../../docs/guides/monitoring.md) guide
59+
2. Check [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), and [fleets](../../docs/concepts/fleets.md)
60+
3. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"}

docs/blog/posts/prometheus.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ Effective AI infrastructure management requires full visibility into compute per
1717
detailed insights into container- and GPU-level performance, while managers rely on cost metrics to track resource usage
1818
across projects.
1919

20-
While `dstack` provides key metrics through its UI and [`dstack metrics`](dstack-stats.md) CLI, teams often need more granular data and prefer
20+
While `dstack` provides key metrics through its UI and [`dstack metrics`](dstack-metrics.md) CLI, teams often need more granular data and prefer
2121
using their own monitoring tools. To support this, we’ve introduced a new endpoint that allows real-time exporting all collected
2222
metrics—covering fleets and runs—directly to Prometheus.
2323

@@ -57,7 +57,7 @@ For a full list of available metrics and labels, check out the [Monitoring](../.
5757

5858
??? info "AMD"
5959
AMD device metrics are not yet collected for any backends. This support will be available soon. For now, AMD metrics are
60-
only accessible through the UI and the [`dstack metrics`](dstack-stats.md) CLI.
60+
only accessible through the UI and the [`dstack metrics`](dstack-metrics.md) CLI.
6161

6262
!!! info "What's next?"
6363
1. See the [Monitoring](../../docs/guides/monitoring.md) guide

docs/docs/guides/protips.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -312,7 +312,7 @@ The GPU vendor is indicated by one of the following case-insensitive values:
312312

313313
While `dstack` allows the use of any third-party monitoring tools (e.g., Weights and Biases), you can also
314314
monitor container metrics such as CPU, memory, and GPU usage using the [built-in
315-
`dstack metrics` CLI command](../../blog/posts/dstack-stats.md) or the corresponding API.
315+
`dstack metrics` CLI command](../../blog/posts/dstack-metrics.md) or the corresponding API.
316316

317317
## Service quotas
318318

mkdocs.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -127,10 +127,11 @@ plugins:
127127
'backends.md': 'partners.md'
128128
'developers.md': 'community.md'
129129
'blog/ambassador-program.md': 'blog/archive/ambassador-program.md'
130-
'blog/monitoring-gpu-usage.md': 'blog/posts/dstack-stats.md'
130+
'blog/monitoring-gpu-usage.md': 'blog/posts/dstack-metrics.md'
131131
'blog/inactive-dev-environments-auto-shutdown.md': 'blog/posts/inactivity-duration.md'
132132
'blog/data-centers-and-private-clouds.md': 'blog/posts/gpu-blocks-and-proxy-jump.md'
133133
'blog/distributed-training-with-aws-efa.md': 'blog/posts/efa.md'
134+
'blog/dstack-stats.md': 'blog/posts/dstack-metrics.md'
134135
- typeset
135136
- gen-files:
136137
scripts: # always relative to mkdocs.yml

0 commit comments

Comments
 (0)