@@ -34,6 +34,67 @@ class DeviceStatsMonitor(Callback):
34
34
r"""Automatically monitors and logs device stats during training, validation and testing stage.
35
35
``DeviceStatsMonitor`` is a special callback as it requires a ``logger`` to passed as argument to the ``Trainer``.
36
36
37
+ **Logged Metrics**
38
+
39
+ Logs device statistics with keys prefixed as ``DeviceStatsMonitor.{hook_name}/{base_metric_name}``.
40
+ The actual metrics depend on the active accelerator and the ``cpu_stats`` flag. Below are an overview of the
41
+ possible available metrics and their meaning.
42
+
43
+ - CPU (via ``psutil``)
44
+
45
+ - ``cpu_percent`` — System-wide CPU utilization (%)
46
+ - ``cpu_vm_percent`` — System-wide virtual memory (RAM) utilization (%)
47
+ - ``cpu_swap_percent`` — System-wide swap memory utilization (%)
48
+
49
+ - CUDA GPU (via ``torch.cuda.memory_stats``)
50
+
51
+ Logs memory statistics from PyTorch caching allocator (all in bytes).
52
+ GPU compute utilization is not logged by default.
53
+
54
+ - General Memory Usage:
55
+
56
+ - ``allocated_bytes.all.current`` — Current allocated GPU memory
57
+ - ``allocated_bytes.all.peak`` — Peak allocated GPU memory
58
+ - ``reserved_bytes.all.current`` — Current reserved GPU memory (allocated + cached)
59
+ - ``reserved_bytes.all.peak`` — Peak reserved GPU memory
60
+ - ``active_bytes.all.current`` — Current GPU memory in active use
61
+ - ``active_bytes.all.peak`` — Peak GPU memory in active use
62
+ - ``inactive_split_bytes.all.current`` — Memory in inactive, splittable blocks
63
+
64
+ - Allocator Pool Statistics* (for ``small_pool`` and ``large_pool``):
65
+
66
+ - ``allocated_bytes.{pool_type}.current`` / ``allocated_bytes.{pool_type}.peak``
67
+ - ``reserved_bytes.{pool_type}.current`` / ``reserved_bytes.{pool_type}.peak``
68
+ - ``active_bytes.{pool_type}.current`` / ``active_bytes.{pool_type}.peak``
69
+
70
+ - Allocator Events:
71
+
72
+ - ``num_ooms`` — Cumulative out-of-memory errors
73
+ - ``num_alloc_retries`` — Number of allocation retries
74
+ - ``num_device_alloc`` — Number of device allocations
75
+ - ``num_device_free`` — Number of device deallocations
76
+
77
+ For a full list of CUDA memory stats, see the
78
+ `PyTorch documentation <https://docs.pytorch.org/docs/stable//generated/torch.cuda.device_memory_used.html>`_.
79
+
80
+ - TPU (via ``torch_xla``)
81
+
82
+ - *Memory Metrics* (per device, e.g., ``xla:0``):
83
+
84
+ - ``memory.free.xla:0`` — Free HBM memory (MB)
85
+ - ``memory.used.xla:0`` — Used HBM memory (MB)
86
+ - ``memory.percent.xla:0`` — Percentage of HBM memory used (%)
87
+
88
+ - *XLA Operation Counters*:
89
+
90
+ - ``CachedCompile.xla``
91
+ - ``CreateXlaTensor.xla``
92
+ - ``DeviceDataCacheMiss.xla``
93
+ - ``UncachedCompile.xla``
94
+ - ``xla::add.xla``, ``xla::addmm.xla``, etc.
95
+
96
+ These counters can be retrieved using: ``torch_xla.debug.metrics.counter_names()``
97
+
37
98
Args:
38
99
cpu_stats: if ``None``, it will log CPU stats only if the accelerator is CPU.
39
100
If ``True``, it will log CPU stats regardless of the accelerator.
0 commit comments