CUDA OOM during validation of first epoch #10959
-
hi all, My model validation code (see below) appears to leak memory which leads to a rapid increase in GPU memory usage and, eventually, to an OOM error right before the validation loop is about to complete (about 90% done or so). CUDA memory usage hovers around 8-9GB during training, then increases rapidly to ca. 15+GB during validation, hitting the memory limit of my GPU card. What am I doing wrong here?
Decreasing (or increasing) the validation batch size doesn't make the problem go away. Any thoughts?
Later edit: Skipping the validation loop, i.e.,
gets rid of the OOM error (the trainer makes it past the 1st epoch). Also, I am running in mixed precision (although i suspect precision doesn't have much to do with this issue?) Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Dear @mishooax, You are returning the batch from the validation_step, which would be stored. As it is currently on the GPU, after X batches, you would get a OOM. Unless you need the batch on epoch end, I would recommend to not return anything from the validation_step. |
Beta Was this translation helpful? Give feedback.
Dear @mishooax,
You are returning the batch from the validation_step, which would be stored. As it is currently on the GPU, after X batches, you would get a OOM.
Unless you need the batch on epoch end, I would recommend to not return anything from the validation_step.