Proposal: GPU-safe utilities for Lightning (based on Universal CUDA Tools)

### Description & Motivation

Hi Lightning team,

I'm opening this feature request based on a suggestion from the PyTorch core community (see links below). I recently proposed a small but modular utility set called **Universal CUDA Tools** that helps simplify and safeguard GPU execution in PyTorch workflows.

Although it was initially proposed for torch core (`torch.utils` / `torch.cuda`), a core contributor (ptrblck from NVIDIA) recommended that it may be more suitable for high-level libraries such as PyTorch Lightning.

References:
- PyTorch issue: https://github.com/pytorch/pytorch/issues/152679
- PyTorch forum: https://discuss.pytorch.org/t/universal-cuda-tools-gpu-safe-execution-made-simple-for-pytorch/219712/3
- Docs (EN): https://www.funeralcs.com/posts/cuda_tools_dokuman/
- Repo: https://github.com/Tunahanyrd/universal-cuda-tools

---

## What It Offers

This toolset wraps common device and memory management logic into reusable components, including:

- Safe device selection and fallback (GPU → CPU)
- AMP context management
- Automatic `.to(device)` wrapping and tensor conversion
- OOM retry handling with optional fallback
- Batch size test utilities
- Decorators and context managers for device safety

These tools are designed to minimize boilerplate in long-running training loops, especially on low-memory GPUs.

---

##  Compatibility with Lightning

Even though the original codebase is based on plain PyTorch, the components are **modular** and could be selectively adapted into Lightning utilities.

For example:

```python
@cuda(device="cuda", retry=1, auto_tensorize=True)
def training_step(x):
    ...
```
Or more directly:
```
with DeviceContext("cuda", amp=True, fallback="cpu"):
    outputs = model(inputs)
```
Background
I’m not a native English speaker and I’m not deeply experienced with Lightning. This proposal emerged from real-world frustration during GPU-based training on limited resources, and I used AI tools (ChatGPT) to help design the API and documentation.

My goal is to open the idea for discussion and see if any part of it aligns with Lightning’s philosophy.

Thanks a lot for your time!

### Pitch
This proposal introduces a modular set of GPU-safe utilities inspired by real-world training frustrations. It focuses on simplifying repetitive device management logic in PyTorch and may be a great fit for Lightning's abstraction layer.

The core components can wrap functions or entire training steps to:
- auto-manage `.to(device)` logic
- handle CUDA OOM exceptions
- enable AMP easily
- fallback to CPU if needed



### Alternatives

Manually wrapping `.to(device)`, using try-except blocks for OOM, and manually enabling AMP are current options. However, these are repetitive, error-prone, and not reusable. Lightning offers abstractions for many of these, but utility-level tools could help simplify training step definitions even further.
### Additional context

This utility set was initially proposed for PyTorch core but was redirected here based on feedback:

- PyTorch issue: https://github.com/pytorch/pytorch/issues/152679
- PyTorch forum: https://discuss.pytorch.org/t/universal-cuda-tools-gpu-safe-execution-made-simple-for-pytorch/219712/3

Docs: https://www.funeralcs.com/posts/cuda_tools_dokuman/  
Repo: https://github.com/Tunahanyrd/universal-cuda-tools

The proposal originated from real training bottlenecks, and the code was designed with help from AI tools to be lightweight, composable, and optional.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: GPU-safe utilities for Lightning (based on Universal CUDA Tools) #20782

Description & Motivation

What It Offers

Compatibility with Lightning

Pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: GPU-safe utilities for Lightning (based on Universal CUDA Tools) #20782

Description

Description & Motivation

What It Offers

Compatibility with Lightning

Pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions