evaluation_loop memory leak

## 🐛 Bug

When I first encountered the bug, it manifested in step 13 of the evaluation loop. This is surprising because training was running smoothly, but I received a CUDA out of memory during evaluation. I then printed out the memory allocation in between each evaluation batch, and I saw memory allocated was increasing slightly. 

## Please reproduce using the BoringModel
[BoringModel](https://gist.github.com/ytsheng/0de406bc09f5e3fb5804ed2a9d8ee1d2)

### To Reproduce

just run the above file: `python val_mem_leak.py` and observe the print statement. I will share a segment here just to illustrate the slow increase of the memory allocation. Look under the Allocated Memory section between batch idx 0 and 13.
While this does not generate a OOM, if left unfixed, over time, it will. On my experiment, since it was using a much bigger model, it was OOMing in the evaluation run.
```
memory summary: |===========================================================================|                                                                                                                                               |                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   14724 KB |   14761 KB |   22139 KB |    7415 KB |
|       from large pool |   13550 KB |   13592 KB |   20938 KB |    7388 KB |
|       from small pool |    1174 KB |    1192 KB |    1201 KB |      27 KB |
|---------------------------------------------------------------------------|
| Active memory         |   14724 KB |   14761 KB |   22139 KB |    7415 KB |
|       from large pool |   13550 KB |   13592 KB |   20938 KB |    7388 KB |
|       from small pool |    1174 KB |    1192 KB |    1201 KB |      27 KB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   22528 KB |   22528 KB |   22528 KB |       0 B  |
|       from large pool |   20480 KB |   20480 KB |   20480 KB |       0 B  |
|       from small pool |    2048 KB |    2048 KB |    2048 KB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |    7804 KB |   20075 KB |   31969 KB |   24165 KB |
|       from large pool |    6930 KB |   18412 KB |   25800 KB |   18870 KB |
|       from small pool |     874 KB |    2047 KB |    6168 KB |    5294 KB |
|---------------------------------------------------------------------------|
| Allocations           |      54    |      56    |      72    |      18    |
|       from large pool |       4    |       4    |       5    |       1    |
|       from small pool |      50    |      52    |      67    |      17    |
|---------------------------------------------------------------------------|
| Active allocs         |      54    |      56    |      72    |      18    |
|       from large pool |       4    |       4    |       5    |       1    |
|       from small pool |      50    |      52    |      67    |      17    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       2    |       2    |       2    |       0    |
|       from large pool |       1    |       1    |       1    |       0    |
|       from small pool |       1    |       1    |       1    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       3    |       3    |       9    |       6    |
|       from large pool |       1    |       1    |       1    |       0    |
|       from small pool |       2    |       2    |       8    |       6    |
|===========================================================================|

memory reserved: 23068672
memory allocated: 15077376
validation steps...0
```




```
memory summary: |===========================================================================|                                                                                                                                      [54/1822]
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   14732 KB |   14761 KB |   22297 KB |    7564 KB |
|       from large pool |   13550 KB |   13592 KB |   20938 KB |    7388 KB |
|       from small pool |    1182 KB |    1194 KB |    1358 KB |     176 KB |
|---------------------------------------------------------------------------|
| Active memory         |   14732 KB |   14761 KB |   22297 KB |    7564 KB |
|       from large pool |   13550 KB |   13592 KB |   20938 KB |    7388 KB |
|       from small pool |    1182 KB |    1194 KB |    1358 KB |     176 KB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   22528 KB |   22528 KB |   22528 KB |       0 B  |
|       from large pool |   20480 KB |   20480 KB |   20480 KB |       0 B  |
|       from small pool |    2048 KB |    2048 KB |    2048 KB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |    7795 KB |   20075 KB |   32118 KB |   24322 KB |
|       from large pool |    6930 KB |   18412 KB |   25800 KB |   18870 KB |
|       from small pool |     865 KB |    2047 KB |    6317 KB |    5452 KB |
|---------------------------------------------------------------------------|
| Allocations           |      71    |      73    |     196    |     125    |
|       from large pool |       4    |       4    |       5    |       1    |
|       from small pool |      67    |      69    |     191    |     124    |
|---------------------------------------------------------------------------|
| Active allocs         |      71    |      73    |     196    |     125    |
|       from large pool |       4    |       4    |       5    |       1    |
|       from small pool |      67    |      69    |     191    |     124    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       2    |       2    |       2    |       0    |
|       from large pool |       1    |       1    |       1    |       0    |
|       from small pool |       1    |       1    |       1    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       3    |       5    |      69    |      66    |
|       from large pool |       1    |       1    |       1    |       0    |
|       from small pool |       2    |       4    |      68    |      66    |
|===========================================================================|

memory reserved: 23068672
memory allocated: 15086080
validation steps...13
```

### Expected behavior

memory allocation should remain the same.

### Environment
```
PyTorch version: 1.8.1+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.1 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100

Nvidia driver version: 450.80.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] pytorch-lightning==1.4.0.dev0
[pip3] torch==1.8.1
[pip3] torchaudio==0.8.1
[pip3] torchmetrics==0.4.1
[pip3] torchtext==0.5.0
[pip3] torchvision==0.9.1
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               10.1.243             h6bb024c_0
[conda] mkl                       2020.2                      256
[conda] mkl-service               2.3.0            py38he904b0f_0
[conda] mkl_fft                   1.2.0            py38h23d657b_0
[conda] mkl_random                1.1.1            py38h0573a6f_0
[conda] numpy                     1.19.4                   pypi_0    pypi
[conda] numpy-base                1.19.2           py38hfa32c7d_0
[conda] pytorch-lightning         1.4.0.dev0               pypi_0    pypi
[conda] torch                     1.8.1                    pypi_0    pypi
[conda] torchaudio                0.8.1                    pypi_0    pypi
[conda] torchmetrics              0.3.2                    pypi_0    pypi
[conda] torchtext                 0.5.0                    pypi_0    pypi
[conda] torchvision               0.9.1                    pypi_0    pypi
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

evaluation_loop memory leak #8453

🐛 Bug

Please reproduce using the BoringModel

To Reproduce

Expected behavior

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

evaluation_loop memory leak #8453

Description

🐛 Bug

Please reproduce using the BoringModel

To Reproduce

Expected behavior

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions