Skip to content

RF-DETR 1.1.0 vs 1.2.1 : same train parameters doesn't work after updating #301

@MateoPetitet

Description

@MateoPetitet

Search before asking

  • I have searched the RF-DETR issues and found no similar bug report.

Bug

Hello !

I'm about to start a new training session for my custom model. However, I encounter a problem when I use the latest version (1.2.1) vs when I used the older 1.1.0 version. Here are my parameters, please note that I'm exploring on cpu and system memory for now as I want to be sure everything works before renting a server :

model.train(dataset_dir=data_dir,
            device='cpu',
            num_workers=1,
            epochs=num_epochs, 
            batch_size=1, 
            grad_accum_steps=16,
            lr=learn, 
            output_dir=output_path, 
            resolution=1008, 
            weight_decay=1e-8, 
            checkpoint_interval=cp_interval, 
            resume=cp_resume,
            early_stopping=True,
            early_stopping_patience=15,
            num_queries=300, 
            num_select=300, 
            dec_layers=6
            )

I don't have any problem when training wit these parameters in the old version (I managed to start my epoch, even if it was very long on cpu) but the latest version returns the following error : RuntimeError: Inference tensors cannot be saved for backward. To work around you can make a clone to get a normal tensor and use it in autograd.

I don't know where it comes from, and the full errors output doesn't really help to know how to fix. Any idea what's happening ?

Thank you for reading !

Environment

RF-DETR : 1.1.0/1.2.1
OS : ZorinOS 17.3 core (equivalent Ubuntu LTE 22.
Python : 3.10
PyTorch : 2.6.0 version
CPU : core i5 11400H

Minimal Reproducible Example

#run this with rf-detr 1.1.0 the 1.2.1 to see the difference

device = torch.device('cpu')
data_dir = os.path.join(args.path)
output_path = os.path.join(args.output)
model = RFDETRLarge(pretrained=True)
num_epochs = args.epochs
learn = args.learning_rate
cp_interval = args.checkpoint_interval
model.train(dataset_dir=data_dir,
                device='cpu',
                num_workers=1,
                epochs=num_epochs, 
                batch_size=1, 
                grad_accum_steps=16,
                lr=learn, 
                output_dir=output_path, 
                resolution=1008, 
                weight_decay=1e-8, 
                checkpoint_interval=cp_interval, 
                resume=cp_resume,
                early_stopping=True,
                early_stopping_patience=15,
                num_queries=300, 
                num_select=300, 
                dec_layers=6 
                )

Additional

No response

Are you willing to submit a PR?

  • Yes, I'd like to help by submitting a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions