Skip to content

Commit ca88f81

Browse files
awaelchlicarmocca
andauthored
Deprecate auto_select_gpus (#16147)
Co-authored-by: Carlos Mocholí <[email protected]>
1 parent acdb145 commit ca88f81

File tree

15 files changed

+193
-56
lines changed

15 files changed

+193
-56
lines changed

docs/source-pytorch/accelerators/gpu_basic.rst

Lines changed: 21 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -88,10 +88,25 @@ The table below lists examples of possible input formats and how they are interp
8888
| "-1" | str | [0, 1, 2, ...] | all available GPUs |
8989
+------------------+-----------+---------------------+---------------------------------+
9090

91-
.. note::
9291

93-
When specifying number of ``devices`` as an integer ``devices=k``, setting the trainer flag
94-
``auto_select_gpus=True`` will automatically help you find ``k`` GPUs that are not
95-
occupied by other processes. This is especially useful when GPUs are configured
96-
to be in "exclusive mode", such that only one process at a time can access them.
97-
For more details see the :doc:`trainer guide <../common/trainer>`.
92+
Find usable CUDA devices
93+
^^^^^^^^^^^^^^^^^^^^^^^^
94+
95+
If you want to run several experiments at the same time on your machine, for example for a hyperparameter sweep, then you can
96+
use the following utility function to pick GPU indices that are "accessible", without having to change your code every time.
97+
98+
.. code-block:: python
99+
100+
from lightning.pytorch.accelerators import find_usable_cuda_devices
101+
102+
# Find two GPUs on the system that are not already occupied
103+
trainer = Trainer(accelerator="cuda", devices=find_usable_cuda_devices(2))
104+
105+
from lightning.lite.accelerators import find_usable_cuda_devices
106+
107+
# Works with LightningLite too
108+
lite = LightningLite(accelerator="cuda", devices=find_usable_cuda_devices(2))
109+
110+
111+
This is especially useful when GPUs are configured to be in "exclusive compute mode", such that only one process at a time is allowed access to the device.
112+
This special mode is often enabled on server GPUs or systems shared among multiple users.

docs/source-pytorch/common/trainer.rst

Lines changed: 0 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -314,34 +314,6 @@ before any training.
314314
# call tune to find the batch size
315315
trainer.tune(model)
316316
317-
auto_select_gpus
318-
^^^^^^^^^^^^^^^^
319-
320-
.. raw:: html
321-
322-
<video width="50%" max-width="400px" controls
323-
poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/thumb/auto_select+_gpus.jpg"
324-
src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/auto_select_gpus.mp4"></video>
325-
326-
|
327-
328-
If enabled and ``devices`` is an integer, pick available GPUs automatically.
329-
This is especially useful when GPUs are configured to be in "exclusive mode",
330-
such that only one process at a time can access them.
331-
332-
Example::
333-
334-
# no auto selection (picks first 2 GPUs on system, may fail if other process is occupying)
335-
trainer = Trainer(accelerator="gpu", devices=2, auto_select_gpus=False)
336-
337-
# enable auto selection (will find two available GPUs on system)
338-
trainer = Trainer(accelerator="gpu", devices=2, auto_select_gpus=True)
339-
340-
# specifies all GPUs regardless of its availability
341-
Trainer(accelerator="gpu", devices=-1, auto_select_gpus=False)
342-
343-
# specifies all available GPUs (if only one GPU is not occupied, uses one gpu)
344-
Trainer(accelerator="gpu", devices=-1, auto_select_gpus=True)
345317
346318
auto_lr_find
347319
^^^^^^^^^^^^

src/lightning_lite/CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
2323
- Added support for Fully Sharded Data Parallel (FSDP) training in Lightning Lite ([#14967](https://github.com/Lightning-AI/lightning/issues/14967))
2424

2525

26+
- Added `lightning_lite.accelerators.find_usable_cuda_devices` utility function ([#16147](https://github.com/PyTorchLightning/pytorch-lightning/pull/16147))
27+
28+
2629
### Changed
2730

2831
- The `LightningLite.run()` method is no longer abstract ([#14992](https://github.com/Lightning-AI/lightning/issues/14992))

src/lightning_lite/accelerators/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
from lightning_lite.accelerators.accelerator import Accelerator # noqa: F401
1414
from lightning_lite.accelerators.cpu import CPUAccelerator # noqa: F401
1515
from lightning_lite.accelerators.cuda import CUDAAccelerator # noqa: F401
16+
from lightning_lite.accelerators.cuda import find_usable_cuda_devices # noqa: F401
1617
from lightning_lite.accelerators.mps import MPSAccelerator # noqa: F401
1718
from lightning_lite.accelerators.registry import _AcceleratorRegistry, call_register_accelerators
1819
from lightning_lite.accelerators.tpu import TPUAccelerator # noqa: F401

src/lightning_lite/accelerators/cuda.py

Lines changed: 55 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -78,10 +78,62 @@ def register_accelerators(cls, accelerator_registry: Dict) -> None:
7878
)
7979

8080

81-
def _get_all_available_cuda_gpus() -> List[int]:
81+
def find_usable_cuda_devices(num_devices: int = -1) -> List[int]:
82+
"""Returns a list of all available and usable CUDA GPU devices.
83+
84+
A GPU is considered usable if we can successfully move a tensor to the device, and this is what this function
85+
tests for each GPU on the system until the target number of usable devices is found.
86+
87+
A subset of GPUs on the system might be used by other processes, and if the GPU is configured to operate in
88+
'exclusive' mode (configurable by the admin), then only one process is allowed to occupy it.
89+
90+
Args:
91+
num_devices: The number of devices you want to request. By default, this function will return as many as there
92+
are usable CUDA GPU devices available.
93+
94+
Warning:
95+
If multiple processes call this function at the same time, there can be race conditions in the case where
96+
both processes determine that the device is unoccupied, leading into one of them crashing later on.
8297
"""
83-
Returns:
84-
A list of all available CUDA GPUs
98+
visible_devices = _get_all_visible_cuda_devices()
99+
if not visible_devices:
100+
raise ValueError(
101+
f"You requested to find {num_devices} devices but there are no visible CUDA devices on this machine."
102+
)
103+
if num_devices > len(visible_devices):
104+
raise ValueError(
105+
f"You requested to find {num_devices} devices but this machine only has {len(visible_devices)} GPUs."
106+
)
107+
108+
available_devices = []
109+
unavailable_devices = []
110+
111+
for gpu_idx in visible_devices:
112+
try:
113+
torch.tensor(0, device=torch.device("cuda", gpu_idx))
114+
except RuntimeError:
115+
unavailable_devices.append(gpu_idx)
116+
continue
117+
118+
available_devices.append(gpu_idx)
119+
if len(available_devices) == num_devices:
120+
# exit early if we found the right number of GPUs
121+
break
122+
123+
if len(available_devices) != num_devices:
124+
raise RuntimeError(
125+
f"You requested to find {num_devices} devices but only {len(available_devices)} are currently available."
126+
f" The devices {unavailable_devices} are occupied by other processes and can't be used at the moment."
127+
)
128+
return available_devices
129+
130+
131+
def _get_all_visible_cuda_devices() -> List[int]:
132+
"""Returns a list of all visible CUDA GPU devices.
133+
134+
Devices masked by the environment variabale ``CUDA_VISIBLE_DEVICES`` won't be returned here. For example, assume you
135+
have 8 physical GPUs. If ``CUDA_VISIBLE_DEVICES="1,3,6"``, then this function will return the list ``[0, 1, 2]``
136+
because these are the three visible GPUs after applying the mask ``CUDA_VISIBLE_DEVICES``.
85137
"""
86138
return list(range(num_cuda_devices()))
87139

src/lightning_lite/utilities/device_parser.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -160,7 +160,7 @@ def _get_all_available_gpus(include_cuda: bool = False, include_mps: bool = Fals
160160
Returns:
161161
A list of all available GPUs
162162
"""
163-
cuda_gpus = accelerators.cuda._get_all_available_cuda_gpus() if include_cuda else []
163+
cuda_gpus = accelerators.cuda._get_all_visible_cuda_devices() if include_cuda else []
164164
mps_gpus = accelerators.mps._get_all_available_mps_gpus() if include_mps else []
165165
return cuda_gpus + mps_gpus
166166

src/pytorch_lightning/CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,12 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
7676
- Deprecated `pytorch_lightning.profiler` in favor of `pytorch_lightning.profilers` ([#16059](https://github.com/PyTorchLightning/pytorch-lightning/pull/16059))
7777

7878

79+
- Deprecated `Trainer(auto_select_gpus=...)` in favor of `pytorch_lightning.accelerators.find_usable_cuda_devices` ([#16147](https://github.com/PyTorchLightning/pytorch-lightning/pull/16147))
80+
81+
82+
- Deprecated `pytorch_lightning.tuner.auto_gpu_select.{pick_single_gpu,pick_multiple_gpus}` in favor of `pytorch_lightning.accelerators.find_usable_cuda_devices` ([#16147](https://github.com/PyTorchLightning/pytorch-lightning/pull/16147))
83+
84+
7985
- `nvidia/apex` deprecation ([#16039](https://github.com/PyTorchLightning/pytorch-lightning/pull/16039))
8086
* Deprecated `pytorch_lightning.plugins.NativeMixedPrecisionPlugin` in favor of `pytorch_lightning.plugins.MixedPrecisionPlugin`
8187
* Deprecated the `LightningModule.optimizer_step(using_native_amp=...)` argument

src/pytorch_lightning/accelerators/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1111
# See the License for the specific language governing permissions and
1212
# limitations under the License.
13+
from lightning_lite.accelerators.cuda import find_usable_cuda_devices # noqa: F401
1314
from lightning_lite.accelerators.registry import _AcceleratorRegistry, call_register_accelerators
1415
from pytorch_lightning.accelerators.accelerator import Accelerator # noqa: F401
1516
from pytorch_lightning.accelerators.cpu import CPUAccelerator # noqa: F401

src/pytorch_lightning/trainer/connectors/accelerator_connector.py

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ def __init__(
107107
benchmark: Optional[bool] = None,
108108
replace_sampler_ddp: bool = True,
109109
deterministic: Optional[Union[bool, _LITERAL_WARN]] = False,
110-
auto_select_gpus: bool = False,
110+
auto_select_gpus: Optional[bool] = None, # TODO: Remove in v1.10.0
111111
num_processes: Optional[int] = None, # deprecated
112112
tpu_cores: Optional[Union[List[int], str, int]] = None, # deprecated
113113
ipus: Optional[int] = None, # deprecated
@@ -177,7 +177,7 @@ def __init__(
177177
self.checkpoint_io: Optional[CheckpointIO] = None
178178
self._amp_type_flag: Optional[str] = None # TODO: Remove in v1.10.0
179179
self._amp_level_flag: Optional[str] = amp_level # TODO: Remove in v1.10.0
180-
self._auto_select_gpus: bool = auto_select_gpus
180+
self._auto_select_gpus: Optional[bool] = auto_select_gpus
181181

182182
self._check_config_and_set_final_flags(
183183
strategy=strategy,
@@ -558,8 +558,17 @@ def _set_devices_flag_if_auto_passed(self) -> None:
558558
self._devices_flag = self.accelerator.auto_device_count()
559559

560560
def _set_devices_flag_if_auto_select_gpus_passed(self) -> None:
561+
if self._auto_select_gpus is not None:
562+
rank_zero_deprecation(
563+
"The Trainer argument `auto_select_gpus` has been deprecated in v1.9.0 and will be removed in v1.10.0."
564+
" Please use the function `pytorch_lightning.accelerators.find_usable_cuda_devices` instead."
565+
)
561566
if self._auto_select_gpus and isinstance(self._gpus, int) and isinstance(self.accelerator, CUDAAccelerator):
562-
self._devices_flag = pick_multiple_gpus(self._gpus)
567+
self._devices_flag = pick_multiple_gpus(
568+
self._gpus,
569+
# we already show a deprecation message when user sets Trainer(auto_select_gpus=...)
570+
_show_deprecation=False,
571+
)
563572
log.info(f"Auto select gpus: {self._devices_flag}")
564573

565574
def _choose_and_init_cluster_environment(self) -> ClusterEnvironment:

src/pytorch_lightning/trainer/trainer.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,7 @@ def __init__(
123123
num_processes: Optional[int] = None, # TODO: Remove in 2.0
124124
devices: Optional[Union[List[int], str, int]] = None,
125125
gpus: Optional[Union[List[int], str, int]] = None, # TODO: Remove in 2.0
126-
auto_select_gpus: bool = False,
126+
auto_select_gpus: Optional[bool] = None, # TODO: Remove in 2.0
127127
tpu_cores: Optional[Union[List[int], str, int]] = None, # TODO: Remove in 2.0
128128
ipus: Optional[int] = None, # TODO: Remove in 2.0
129129
enable_progress_bar: bool = True,
@@ -210,6 +210,10 @@ def __init__(
210210
that only one process at a time can access them.
211211
Default: ``False``.
212212
213+
.. deprecated:: v1.9
214+
``auto_select_gpus`` has been deprecated in v1.9.0 and will be removed in v1.10.0.
215+
Please use the function :func:`~lightning_lite.accelerators.cuda.find_usable_cuda_devices` instead.
216+
213217
benchmark: The value (``True`` or ``False``) to set ``torch.backends.cudnn.benchmark`` to.
214218
The value for ``torch.backends.cudnn.benchmark`` set in the current session will be used
215219
(``False`` if not manually set). If :paramref:`~pytorch_lightning.trainer.Trainer.deterministic` is set

0 commit comments

Comments
 (0)