-
Notifications
You must be signed in to change notification settings - Fork 347
Description
Search before asking
- I have searched the RF-DETR issues and found no similar bug report.
Bug
Hello !
I'm about to start a new training session for my custom model. However, I encounter a problem when I use the latest version (1.2.1) vs when I used the older 1.1.0 version. Here are my parameters, please note that I'm exploring on cpu and system memory for now as I want to be sure everything works before renting a server :
model.train(dataset_dir=data_dir,
device='cpu',
num_workers=1,
epochs=num_epochs,
batch_size=1,
grad_accum_steps=16,
lr=learn,
output_dir=output_path,
resolution=1008,
weight_decay=1e-8,
checkpoint_interval=cp_interval,
resume=cp_resume,
early_stopping=True,
early_stopping_patience=15,
num_queries=300,
num_select=300,
dec_layers=6
)
I don't have any problem when training wit these parameters in the old version (I managed to start my epoch, even if it was very long on cpu) but the latest version returns the following error : RuntimeError: Inference tensors cannot be saved for backward. To work around you can make a clone to get a normal tensor and use it in autograd.
I don't know where it comes from, and the full errors output doesn't really help to know how to fix. Any idea what's happening ?
Thank you for reading !
Environment
RF-DETR : 1.1.0/1.2.1
OS : ZorinOS 17.3 core (equivalent Ubuntu LTE 22.
Python : 3.10
PyTorch : 2.6.0 version
CPU : core i5 11400H
Minimal Reproducible Example
#run this with rf-detr 1.1.0 the 1.2.1 to see the difference
device = torch.device('cpu')
data_dir = os.path.join(args.path)
output_path = os.path.join(args.output)
model = RFDETRLarge(pretrained=True)
num_epochs = args.epochs
learn = args.learning_rate
cp_interval = args.checkpoint_interval
model.train(dataset_dir=data_dir,
device='cpu',
num_workers=1,
epochs=num_epochs,
batch_size=1,
grad_accum_steps=16,
lr=learn,
output_dir=output_path,
resolution=1008,
weight_decay=1e-8,
checkpoint_interval=cp_interval,
resume=cp_resume,
early_stopping=True,
early_stopping_patience=15,
num_queries=300,
num_select=300,
dec_layers=6
)
Additional
No response
Are you willing to submit a PR?
- Yes, I'd like to help by submitting a PR!