Skip to content

CUDA OOM during evaluation after Epoch 0 on A100 80GB (RFDETRSegSmall Seg) — OOM at lwdetr.py:774 res_i['masks'] = masks_i > 0.0 (alloc 2.32 GiB) even with run_test=False / eval=False #646

@Aasim-ComputerVision

Description

@Aasim-ComputerVision

Search before asking

  • I have searched the RF-DETR issues and found no similar bug report.

Bug

Training of RFDETRSegSmall runs through Epoch 0 (2512 iters) successfully, then an evaluation phase labeled Test: starts and crashes with:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB

The stack trace points to mask postprocessing:

rfdetr/models/lwdetr.py, line 774: res_i['masks'] = masks_i > 0.0

Despite passing run_test=False and (from printed args) eval=False, the evaluation still runs at the end of the epoch and triggers the OOM.

Environment

Environment
GPU / Driver

From nvidia-smi:

GPU: NVIDIA A100-SXM4-80GB

Driver: 565.57.01

CUDA: 12.7

nvidia-smi snapshot (from the run) showed:

Memory usage: 72415 MiB / 81920 MiB

GPU-Util: 0%

Processes table: empty (this looked odd because memory was still reported as used)

Timestamp in that output:

Fri Feb 6 21:09:49 2026
Python

From stack paths:

Python 3.11 (/usr/local/lib/python3.11/dist-packages/...)

CPU / RAM

From top snapshot:

load average: 33.87, 37.98, 49.11

CPU: 13.6 us, 1.4 sy, 85.0 id (lots of idle CPU overall)

RAM: 1031853 MiB total, 67125 MiB free, 860969 MiB buff/cache

Swap: 0

Multiple pt_main_thread processes were active; one showed very high CPU% (e.g. 2523%), others ~19–22%.

Minimal Reproducible Example

model.train(
dataset_dir=data_dir,
output_dir=out_dir,
epochs=20,
batch_size=16,
grad_accum_steps=2,
amp=True,
run_test=False,
fp16_eval=True,
eval_max_dets=100,
do_benchmark=False,
num_workers=12,
early_stopping=False,
tensorboard=True
)

Additional

Also printed during training:

Grad accum steps: 2

Total batch size: 32

LENGTH OF DATA LOADER: 2512

Image

Are you willing to submit a PR?

  • Yes, I'd like to help by submitting a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions