Skip to content

Commit 9f757c0

Browse files
MrAnayDongrepre-commit-ci[bot]BordaSkafteNicki
authored
docs: clarify DeviceStatsMonitor logged metrics (#20895)
* DOC: Clarify DeviceStatsMonitor logged metrics (#20807) * update * use nested list * Apply suggestions from code review --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka B <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Nicki Skafte <[email protected]>
1 parent 29e8ce4 commit 9f757c0

File tree

1 file changed

+61
-0
lines changed

1 file changed

+61
-0
lines changed

src/lightning/pytorch/callbacks/device_stats_monitor.py

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,67 @@ class DeviceStatsMonitor(Callback):
3434
r"""Automatically monitors and logs device stats during training, validation and testing stage.
3535
``DeviceStatsMonitor`` is a special callback as it requires a ``logger`` to passed as argument to the ``Trainer``.
3636
37+
**Logged Metrics**
38+
39+
Logs device statistics with keys prefixed as ``DeviceStatsMonitor.{hook_name}/{base_metric_name}``.
40+
The actual metrics depend on the active accelerator and the ``cpu_stats`` flag. Below are an overview of the
41+
possible available metrics and their meaning.
42+
43+
- CPU (via ``psutil``)
44+
45+
- ``cpu_percent`` — System-wide CPU utilization (%)
46+
- ``cpu_vm_percent`` — System-wide virtual memory (RAM) utilization (%)
47+
- ``cpu_swap_percent`` — System-wide swap memory utilization (%)
48+
49+
- CUDA GPU (via ``torch.cuda.memory_stats``)
50+
51+
Logs memory statistics from PyTorch caching allocator (all in bytes).
52+
GPU compute utilization is not logged by default.
53+
54+
- General Memory Usage:
55+
56+
- ``allocated_bytes.all.current`` — Current allocated GPU memory
57+
- ``allocated_bytes.all.peak`` — Peak allocated GPU memory
58+
- ``reserved_bytes.all.current`` — Current reserved GPU memory (allocated + cached)
59+
- ``reserved_bytes.all.peak`` — Peak reserved GPU memory
60+
- ``active_bytes.all.current`` — Current GPU memory in active use
61+
- ``active_bytes.all.peak`` — Peak GPU memory in active use
62+
- ``inactive_split_bytes.all.current`` — Memory in inactive, splittable blocks
63+
64+
- Allocator Pool Statistics* (for ``small_pool`` and ``large_pool``):
65+
66+
- ``allocated_bytes.{pool_type}.current`` / ``allocated_bytes.{pool_type}.peak``
67+
- ``reserved_bytes.{pool_type}.current`` / ``reserved_bytes.{pool_type}.peak``
68+
- ``active_bytes.{pool_type}.current`` / ``active_bytes.{pool_type}.peak``
69+
70+
- Allocator Events:
71+
72+
- ``num_ooms`` — Cumulative out-of-memory errors
73+
- ``num_alloc_retries`` — Number of allocation retries
74+
- ``num_device_alloc`` — Number of device allocations
75+
- ``num_device_free`` — Number of device deallocations
76+
77+
For a full list of CUDA memory stats, see the
78+
`PyTorch documentation <https://docs.pytorch.org/docs/stable//generated/torch.cuda.device_memory_used.html>`_.
79+
80+
- TPU (via ``torch_xla``)
81+
82+
- *Memory Metrics* (per device, e.g., ``xla:0``):
83+
84+
- ``memory.free.xla:0`` — Free HBM memory (MB)
85+
- ``memory.used.xla:0`` — Used HBM memory (MB)
86+
- ``memory.percent.xla:0`` — Percentage of HBM memory used (%)
87+
88+
- *XLA Operation Counters*:
89+
90+
- ``CachedCompile.xla``
91+
- ``CreateXlaTensor.xla``
92+
- ``DeviceDataCacheMiss.xla``
93+
- ``UncachedCompile.xla``
94+
- ``xla::add.xla``, ``xla::addmm.xla``, etc.
95+
96+
These counters can be retrieved using: ``torch_xla.debug.metrics.counter_names()``
97+
3798
Args:
3899
cpu_stats: if ``None``, it will log CPU stats only if the accelerator is CPU.
39100
If ``True``, it will log CPU stats regardless of the accelerator.

0 commit comments

Comments
 (0)