-
Notifications
You must be signed in to change notification settings - Fork 56
About the reproduction of the experiment #55
Description
Hello,
I have a question regarding experimental reproducibility. After uncommenting the random seed section in train.py and performing single-GPU training, I observed inconsistent training loss trajectories and varying model performance, potentially due to factors like the DataLoader configuration:
dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=True, num_workers=8)
I trained and tested the model on the Kvasir-SEG dataset for polyp segmentation, the highest test performance achieved was mDice: 0.925, slightly lower than the 0.928 reported in the paper.
To ensure fair comparison:
Should I conduct multiple training runs and report the average performance?
Alternatively, is selecting the best-performing run acceptable?
Would performing only a single training/evaluation run be considered valid?
Apologies if this is too basic – I'm new to this area. Appreciate your guidance!