Skip to content

Commit 6f04f9c

Browse files
authored
Merge branch 'master' into fsdp-grad-clip-by-norm
2 parents 395c7fd + 663b6ce commit 6f04f9c

File tree

6 files changed

+100
-38
lines changed

6 files changed

+100
-38
lines changed

.github/workflows/ci-tests-fabric.yml

Lines changed: 34 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -62,49 +62,57 @@ jobs:
6262
env:
6363
PACKAGE_NAME: ${{ matrix.config.pkg-name }}
6464
FREEZE_REQUIREMENTS: ${{ ! (github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/heads/release/')) }}
65-
PYPI_CACHE_DIR: "_pip-wheels"
6665
TORCH_URL_STABLE: "https://download.pytorch.org/whl/cpu/"
6766
TORCH_URL_TEST: "https://download.pytorch.org/whl/test/cpu/"
6867
# TODO: Remove this - Enable running MPS tests on this platform
6968
DISABLE_MPS: ${{ matrix.os == 'macOS-14' && '1' || '0' }}
7069
steps:
7170
- uses: actions/checkout@v5
7271

73-
- name: Set up Python ${{ matrix.config.python-version }}
74-
uses: actions/setup-python@v5
72+
- name: Install uv and set Python version
73+
uses: astral-sh/setup-uv@v6
7574
with:
7675
python-version: ${{ matrix.config.python-version || '3.9' }}
76+
# TODO: Avoid activating environment like this
77+
# see: https://github.com/astral-sh/setup-uv/tree/v6/?tab=readme-ov-file#activate-environment
78+
activate-environment: true
79+
enable-cache: true
7780

78-
- name: basic setup
79-
run: pip install -q -r .actions/requirements.txt
81+
- name: Basic setup
82+
run: uv pip install -q -r .actions/requirements.txt
83+
84+
- name: Append Env. vars for Linux
85+
if: ${{ runner.os == 'Linux' }}
86+
run: echo "GLOO_SOCKET_IFNAME=eth0" >> $GITHUB_ENV
87+
88+
- name: Append Env. vars for MacOS
89+
if: ${{ runner.os == 'macOS' }}
90+
run: echo "GLOO_SOCKET_IFNAME=lo0" >> $GITHUB_ENV
91+
92+
- name: Append Env. vars for Windows
93+
if: ${{ runner.os == 'windows' }}
94+
run: |
95+
# Avoid issue on Windows with PyTorch 2.4: "RuntimeError: use_libuv was requested but PyTorch was build without libuv support"
96+
echo "USE_LIBUV=0" >> $GITHUB_ENV
8097
8198
- name: Set min. dependencies
8299
if: ${{ matrix.config.requires == 'oldest' }}
83100
run: |
84101
cd requirements/fabric
85-
pip install -U "lightning-utilities[cli]"
102+
uv pip install -U "lightning-utilities[cli]"
86103
python -m lightning_utilities.cli requirements set-oldest --req_files "['base.txt', 'strategies.txt', 'test.txt']"
87-
pip install "cython<3.0" wheel
88-
pip install "pyyaml==5.4" --no-build-isolation
104+
uv pip install "cython<3.0" wheel
105+
uv pip install "pyyaml==5.4" --no-build-isolation
89106
90107
- name: Adjust PyTorch versions in requirements files
91108
if: ${{ matrix.config.requires != 'oldest' }}
92109
run: |
93-
pip install -q -r requirements/ci.txt
110+
uv pip install -q -r requirements/ci.txt
94111
python -m wget https://raw.githubusercontent.com/Lightning-AI/utilities/main/scripts/adjust-torch-versions.py
95112
for fpath in `ls requirements/**/*.txt`; do \
96113
python ./adjust-torch-versions.py $fpath ${{ matrix.config.pytorch-version }}; \
97114
done
98115
99-
- name: pip wheels cache
100-
uses: actions/cache/restore@v4
101-
with:
102-
path: ${{ env.PYPI_CACHE_DIR }}
103-
key: pypi_wheels
104-
- run: |
105-
mkdir -p $PYPI_CACHE_DIR
106-
ls -lh $PYPI_CACHE_DIR
107-
108116
- name: Expand Env. variables
109117
run: |
110118
# Switch PyTorch URL between stable and test/future
@@ -113,25 +121,15 @@ jobs:
113121
python -c "print('COVERAGE_SCOPE=' + str('lightning' if '${{matrix.config.pkg-name}}' == 'lightning' else 'lightning_fabric'))" >> $GITHUB_ENV
114122
# if you install mono-package set dependency only for this subpackage
115123
python -c "print('EXTRA_PREFIX=' + str('' if '${{matrix.config.pkg-name}}' != 'lightning' else 'fabric-'))" >> $GITHUB_ENV
116-
- name: Append Env. vars for MacOS
117-
if: ${{ runner.os == 'macOS' }}
118-
run: |
119-
# trying to avoid "gloo" issue with SIGABRT
120-
echo "GLOO_SOCKET_IFNAME=lo0" >> $GITHUB_ENV
121-
- name: Append Env. vars for Windows
122-
if: ${{ runner.os == 'windows' }}
123-
run: |
124-
# Avoid issue on Windows with PyTorch 2.4: "RuntimeError: use_libuv was requested but PyTorch was build without libuv support"
125-
echo "USE_LIBUV=0" >> $GITHUB_ENV
126124
127125
- name: Install package & dependencies
128126
timeout-minutes: 20
129127
run: |
130-
pip install -e ".[${EXTRA_PREFIX}test,${EXTRA_PREFIX}strategies]" \
131-
-U --upgrade-strategy=eager --prefer-binary \
132-
--extra-index-url="${TORCH_URL}" \
133-
--find-links="${PYPI_CACHE_DIR}"
134-
pip list
128+
uv pip install ".[${EXTRA_PREFIX}test,${EXTRA_PREFIX}strategies]" \
129+
--upgrade \
130+
--find-links="${TORCH_URL}"
131+
uv pip list
132+
135133
- name: Dump handy wheels
136134
if: github.event_name == 'push' && github.ref == 'refs/heads/master'
137135
continue-on-error: true
@@ -179,6 +177,9 @@ jobs:
179177
name: CPU-coverage
180178
fail_ci_if_error: false
181179

180+
- name: Minimize uv cache
181+
run: uv cache prune --ci
182+
182183
fabric-cpu-guardian:
183184
runs-on: ubuntu-latest
184185
needs: fabric-cpu

.github/workflows/ci-tests-pytorch.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,7 @@ jobs:
8989
- name: Append Env. vars for Linux
9090
if: ${{ runner.os == 'Linux' }}
9191
run: echo "GLOO_SOCKET_IFNAME=eth0" >> $GITHUB_ENV
92+
9293
- name: Append Env. vars for MacOS
9394
if: ${{ runner.os == 'macOS' }}
9495
run: echo "GLOO_SOCKET_IFNAME=lo0" >> $GITHUB_ENV

docs/source-fabric/guide/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ Build your own Trainer
7878
<div class="row">
7979

8080
.. displayitem::
81-
:header: Organize your model code with with LightningModule
81+
:header: Organize your model code with LightningModule
8282
:description: Organize your code in a LightningModule and use it with Fabric
8383
:button_link: lightning_module.html
8484
:col_css: col-md-4

docs/source-fabric/levels/intermediate.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Intermediate skills
1919
<div class="row">
2020

2121
.. displayitem::
22-
:header: Organize your model code with with LightningModule
22+
:header: Organize your model code with LightningModule
2323
:description: Organize your code in a LightningModule and use it with Fabric
2424
:button_link: ../guide/lightning_module.html
2525
:col_css: col-md-4

src/lightning/pytorch/callbacks/device_stats_monitor.py

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,67 @@ class DeviceStatsMonitor(Callback):
3434
r"""Automatically monitors and logs device stats during training, validation and testing stage.
3535
``DeviceStatsMonitor`` is a special callback as it requires a ``logger`` to passed as argument to the ``Trainer``.
3636
37+
**Logged Metrics**
38+
39+
Logs device statistics with keys prefixed as ``DeviceStatsMonitor.{hook_name}/{base_metric_name}``.
40+
The actual metrics depend on the active accelerator and the ``cpu_stats`` flag. Below are an overview of the
41+
possible available metrics and their meaning.
42+
43+
- CPU (via ``psutil``)
44+
45+
- ``cpu_percent`` — System-wide CPU utilization (%)
46+
- ``cpu_vm_percent`` — System-wide virtual memory (RAM) utilization (%)
47+
- ``cpu_swap_percent`` — System-wide swap memory utilization (%)
48+
49+
- CUDA GPU (via ``torch.cuda.memory_stats``)
50+
51+
Logs memory statistics from PyTorch caching allocator (all in bytes).
52+
GPU compute utilization is not logged by default.
53+
54+
- General Memory Usage:
55+
56+
- ``allocated_bytes.all.current`` — Current allocated GPU memory
57+
- ``allocated_bytes.all.peak`` — Peak allocated GPU memory
58+
- ``reserved_bytes.all.current`` — Current reserved GPU memory (allocated + cached)
59+
- ``reserved_bytes.all.peak`` — Peak reserved GPU memory
60+
- ``active_bytes.all.current`` — Current GPU memory in active use
61+
- ``active_bytes.all.peak`` — Peak GPU memory in active use
62+
- ``inactive_split_bytes.all.current`` — Memory in inactive, splittable blocks
63+
64+
- Allocator Pool Statistics* (for ``small_pool`` and ``large_pool``):
65+
66+
- ``allocated_bytes.{pool_type}.current`` / ``allocated_bytes.{pool_type}.peak``
67+
- ``reserved_bytes.{pool_type}.current`` / ``reserved_bytes.{pool_type}.peak``
68+
- ``active_bytes.{pool_type}.current`` / ``active_bytes.{pool_type}.peak``
69+
70+
- Allocator Events:
71+
72+
- ``num_ooms`` — Cumulative out-of-memory errors
73+
- ``num_alloc_retries`` — Number of allocation retries
74+
- ``num_device_alloc`` — Number of device allocations
75+
- ``num_device_free`` — Number of device deallocations
76+
77+
For a full list of CUDA memory stats, see the
78+
`PyTorch documentation <https://docs.pytorch.org/docs/stable//generated/torch.cuda.device_memory_used.html>`_.
79+
80+
- TPU (via ``torch_xla``)
81+
82+
- *Memory Metrics* (per device, e.g., ``xla:0``):
83+
84+
- ``memory.free.xla:0`` — Free HBM memory (MB)
85+
- ``memory.used.xla:0`` — Used HBM memory (MB)
86+
- ``memory.percent.xla:0`` — Percentage of HBM memory used (%)
87+
88+
- *XLA Operation Counters*:
89+
90+
- ``CachedCompile.xla``
91+
- ``CreateXlaTensor.xla``
92+
- ``DeviceDataCacheMiss.xla``
93+
- ``UncachedCompile.xla``
94+
- ``xla::add.xla``, ``xla::addmm.xla``, etc.
95+
96+
These counters can be retrieved using: ``torch_xla.debug.metrics.counter_names()``
97+
3798
Args:
3899
cpu_stats: if ``None``, it will log CPU stats only if the accelerator is CPU.
39100
If ``True``, it will log CPU stats regardless of the accelerator.

src/lightning/pytorch/strategies/deepspeed.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -125,14 +125,13 @@ def __init__(
125125
exclude_frozen_parameters: bool = False,
126126
) -> None:
127127
"""Provides capabilities to run training using the DeepSpeed library, with training optimizations for large
128-
billion parameter models. `For more information: https://pytorch-
129-
lightning.readthedocs.io/en/stable/advanced/model_parallel.html#deepspeed`.
128+
billion parameter models. *For more information:* :ref:`deepspeed_advanced`.
130129
131130
.. warning:: This is an :ref:`experimental <versioning:Experimental API>` feature.
132131
133132
Defaults have been set to enable ZeRO-Offload and some have been taken from the link below.
134133
These defaults have been set generally, but may require tuning for optimum performance based on your model size.
135-
`For more information: https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training`.
134+
*For more information:* https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training.
136135
137136
Arguments:
138137

0 commit comments

Comments
 (0)