-
Notifications
You must be signed in to change notification settings - Fork 55
Description
Dear @Algolzw ,
While training a Refusion (pixel-space) model, the validation PSNR remained stable (~45.5 dB) up to 600 k iterations, but suddenly dropped to ~31 dB at 610 k, then ~22 dB at 615 k, and finally to ~7 dB at 620 k.
Training loss continued normally without explosion or NaN values. The learning rate decayed smoothly (≈2 e-6 → 1.3 e-6).
No data or configuration changes were made during this period.
PSNR logged with TensorBoard (see screenshot).
Sudden PSNR collapse appeared after 600 k iteration.
Expected behavior
PSNR should remain near 45 dB or improve gradually.
A 30 dB drop without visible loss explosion looks abnormal
Training Configuration (from refusion.yml)
#### general settings
name: refusion_
use_tb_logger: true
model: denoising
distortion: inpainting
gpu_ids: [1]
sde:
max_sigma: 50
T: 100
schedule: cosine # linear, cosine
eps: 0.005
degradation: # for some synthetic dataset that only have GTs
# for denoising
sigma: 25
noise_type: G # Gaussian noise: G
# for super-resolution
scale: 4
#### datasets
datasets:
train:
optimizer: Lion # Adam, AdamW, Lion
name: Train_Dataset
mode: LQGT
dataroot_GT: /_exp/pipeline_1/DA_CLIP/sync_sampled_15/train/uncompleted_p1/GT
dataroot_LQ: /_exp/pipeline_1/DA_CLIP/sync_sampled_15/train/uncompleted_p1/LQ
use_shuffle: true
n_workers: 4 # per GPU
batch_size: 2
GT_size: 640
LR_size: 640
use_flip: true
use_rot: true
color: RGB
val:
name: Val_Dataset
mode: LQGT
dataroot_GT: /_exp/pipeline_1/DA_CLIP/sync_sampled_15/val/uncompleted_p1/GT
dataroot_LQ: /_exp/pipeline_1/DA_CLIP/sync_sampled_15/val/uncompleted_p1/LQ
#### network structures
network_G:
which_model_G: ConditionalNAFNet
setting:
width: 64
enc_blk_nums: [1, 1, 1, 28]
middle_blk_num: 1
dec_blk_nums: [1, 1, 1, 1]
#### path
path:
pretrain_model_G: ~
strict_load: true
resume_state: ~
#### training settings: learning rate scheme, loss
train:
optimizer: Lion # Adam, AdamW, Lion
lr_G: !!float 4e-5
lr_scheme: TrueCosineAnnealingLR
beta1: 0.9
beta2: 0.99
niter: 700000
warmup_iter: -1 # no warm up
lr_steps: [200000, 400000, 600000]
lr_gamma: 0.5
eta_min: !!float 1e-7
# criterion
is_weighted: False
loss_type: l1
weight: 1.0
manual_seed: 0
val_freq: !!float 5e3
#### logger
logger:
print_freq: 100
save_checkpoint_freq: !!float 5e3
Validation Logs (Excerpt)
25-10-23 17:23:03.598 - INFO: <epoch:229/293, iter: 550,000, psnr: 45.356515
25-10-23 18:29:21.778 - INFO: <epoch:231/293, iter: 555,000, psnr: 45.337205
25-10-23 19:35:32.266 - INFO: <epoch:233/293, iter: 560,000, psnr: 45.421531
25-10-23 20:59:12.874 - INFO: <epoch:235/293, iter: 565,000, psnr: 45.361757
25-10-23 22:06:53.939 - INFO: <epoch:237/293, iter: 570,000, psnr: 45.281748
25-10-23 23:13:37.318 - INFO: <epoch:239/293, iter: 575,000, psnr: 45.495289
25-10-24 00:19:58.177 - INFO: <epoch:241/293, iter: 580,000, psnr: 45.286052
25-10-24 01:26:12.903 - INFO: <epoch:243/293, iter: 585,000, psnr: 45.528597
25-10-24 02:32:25.522 - INFO: <epoch:245/293, iter: 590,000, psnr: 45.606946
25-10-24 03:38:52.197 - INFO: <epoch:247/293, iter: 595,000, psnr: 45.541850
25-10-24 04:45:21.845 - INFO: <epoch:249/293, iter: 600,000, psnr: 45.545629
25-10-24 05:51:46.698 - INFO: <epoch:252/293, iter: 605,000, psnr: 45.443754
25-10-24 06:58:18.580 - INFO: <epoch:254/293, iter: 610,000, psnr: 43.920094
25-10-24 08:04:49.358 - INFO: <epoch:256/293, iter: 615,000, psnr: 45.494661
25-10-24 09:17:24.039 - INFO: <epoch:258/293, iter: 620,000, psnr: 45.378881
25-10-24 10:46:11.008 - INFO: <epoch:260/293, iter: 625,000, psnr: 25.064299
25-10-24 11:58:14.105 - INFO: <epoch:262/293, iter: 630,000, psnr: 15.447938
25-10-24 13:04:24.516 - INFO: <epoch:264/293, iter: 635,000, psnr: 14.300622
25-10-24 14:10:52.891 - INFO: <epoch:266/293, iter: 640,000, psnr: 7.484680
When I ran the test.py script to test the inpainting task, the generated images are almost noise. before 600k iteration these generated (inpainted) images were very close to perfect. I have attached these images as sample.
Attachments:
Test images (noisy)
refusion.yml
train_refusion_251024-141914.log
val_refusion_251022-143244.log
