-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
๐ Bug
When I first encountered the bug, it manifested in step 13 of the evaluation loop. This is surprising because training was running smoothly, but I received a CUDA out of memory during evaluation. I then printed out the memory allocation in between each evaluation batch, and I saw memory allocated was increasing slightly.
Please reproduce using the BoringModel
To Reproduce
just run the above file: python val_mem_leak.py and observe the print statement. I will share a segment here just to illustrate the slow increase of the memory allocation. Look under the Allocated Memory section between batch idx 0 and 13.
While this does not generate a OOM, if left unfixed, over time, it will. On my experiment, since it was using a much bigger model, it was OOMing in the evaluation run.
memory summary: |===========================================================================| | PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 14724 KB | 14761 KB | 22139 KB | 7415 KB |
| from large pool | 13550 KB | 13592 KB | 20938 KB | 7388 KB |
| from small pool | 1174 KB | 1192 KB | 1201 KB | 27 KB |
|---------------------------------------------------------------------------|
| Active memory | 14724 KB | 14761 KB | 22139 KB | 7415 KB |
| from large pool | 13550 KB | 13592 KB | 20938 KB | 7388 KB |
| from small pool | 1174 KB | 1192 KB | 1201 KB | 27 KB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 22528 KB | 22528 KB | 22528 KB | 0 B |
| from large pool | 20480 KB | 20480 KB | 20480 KB | 0 B |
| from small pool | 2048 KB | 2048 KB | 2048 KB | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 7804 KB | 20075 KB | 31969 KB | 24165 KB |
| from large pool | 6930 KB | 18412 KB | 25800 KB | 18870 KB |
| from small pool | 874 KB | 2047 KB | 6168 KB | 5294 KB |
|---------------------------------------------------------------------------|
| Allocations | 54 | 56 | 72 | 18 |
| from large pool | 4 | 4 | 5 | 1 |
| from small pool | 50 | 52 | 67 | 17 |
|---------------------------------------------------------------------------|
| Active allocs | 54 | 56 | 72 | 18 |
| from large pool | 4 | 4 | 5 | 1 |
| from small pool | 50 | 52 | 67 | 17 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 2 | 2 | 2 | 0 |
| from large pool | 1 | 1 | 1 | 0 |
| from small pool | 1 | 1 | 1 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 3 | 3 | 9 | 6 |
| from large pool | 1 | 1 | 1 | 0 |
| from small pool | 2 | 2 | 8 | 6 |
|===========================================================================|
memory reserved: 23068672
memory allocated: 15077376
validation steps...0
memory summary: |===========================================================================| [54/1822]
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 14732 KB | 14761 KB | 22297 KB | 7564 KB |
| from large pool | 13550 KB | 13592 KB | 20938 KB | 7388 KB |
| from small pool | 1182 KB | 1194 KB | 1358 KB | 176 KB |
|---------------------------------------------------------------------------|
| Active memory | 14732 KB | 14761 KB | 22297 KB | 7564 KB |
| from large pool | 13550 KB | 13592 KB | 20938 KB | 7388 KB |
| from small pool | 1182 KB | 1194 KB | 1358 KB | 176 KB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 22528 KB | 22528 KB | 22528 KB | 0 B |
| from large pool | 20480 KB | 20480 KB | 20480 KB | 0 B |
| from small pool | 2048 KB | 2048 KB | 2048 KB | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 7795 KB | 20075 KB | 32118 KB | 24322 KB |
| from large pool | 6930 KB | 18412 KB | 25800 KB | 18870 KB |
| from small pool | 865 KB | 2047 KB | 6317 KB | 5452 KB |
|---------------------------------------------------------------------------|
| Allocations | 71 | 73 | 196 | 125 |
| from large pool | 4 | 4 | 5 | 1 |
| from small pool | 67 | 69 | 191 | 124 |
|---------------------------------------------------------------------------|
| Active allocs | 71 | 73 | 196 | 125 |
| from large pool | 4 | 4 | 5 | 1 |
| from small pool | 67 | 69 | 191 | 124 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 2 | 2 | 2 | 0 |
| from large pool | 1 | 1 | 1 | 0 |
| from small pool | 1 | 1 | 1 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 3 | 5 | 69 | 66 |
| from large pool | 1 | 1 | 1 | 0 |
| from small pool | 2 | 4 | 68 | 66 |
|===========================================================================|
memory reserved: 23068672
memory allocated: 15086080
validation steps...13
Expected behavior
memory allocation should remain the same.
Environment
PyTorch version: 1.8.1+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.1 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100
Nvidia driver version: 450.80.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] pytorch-lightning==1.4.0.dev0
[pip3] torch==1.8.1
[pip3] torchaudio==0.8.1
[pip3] torchmetrics==0.4.1
[pip3] torchtext==0.5.0
[pip3] torchvision==0.9.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.1.243 h6bb024c_0
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py38he904b0f_0
[conda] mkl_fft 1.2.0 py38h23d657b_0
[conda] mkl_random 1.1.1 py38h0573a6f_0
[conda] numpy 1.19.4 pypi_0 pypi
[conda] numpy-base 1.19.2 py38hfa32c7d_0
[conda] pytorch-lightning 1.4.0.dev0 pypi_0 pypi
[conda] torch 1.8.1 pypi_0 pypi
[conda] torchaudio 0.8.1 pypi_0 pypi
[conda] torchmetrics 0.3.2 pypi_0 pypi
[conda] torchtext 0.5.0 pypi_0 pypi
[conda] torchvision 0.9.1 pypi_0 pypi