Skip to content

Issue: Sudden PSNR Drop after 600k Iterations during Refusion Training #145

@jay3chauhan-del

Description

@jay3chauhan-del

Dear @Algolzw ,

While training a Refusion (pixel-space) model, the validation PSNR remained stable (~45.5 dB) up to 600 k iterations, but suddenly dropped to ~31 dB at 610 k, then ~22 dB at 615 k, and finally to ~7 dB at 620 k.
Training loss continued normally without explosion or NaN values. The learning rate decayed smoothly (≈2 e-6 → 1.3 e-6).
No data or configuration changes were made during this period.

PSNR logged with TensorBoard (see screenshot).

Sudden PSNR collapse appeared after 600 k iteration.

Expected behavior
PSNR should remain near 45 dB or improve gradually.
A 30 dB drop without visible loss explosion looks abnormal

Training Configuration (from refusion.yml)

#### general settings
name: refusion_
use_tb_logger: true
model: denoising
distortion: inpainting
gpu_ids: [1]

sde:
  max_sigma: 50
  T: 100
  schedule: cosine # linear, cosine
  eps: 0.005

degradation: # for some synthetic dataset that only have GTs
  # for denoising
  sigma: 25
  noise_type: G # Gaussian noise: G

  # for super-resolution
  scale: 4
  
#### datasets
datasets:
  train:
    optimizer: Lion # Adam, AdamW, Lion
    name: Train_Dataset
    mode: LQGT
    dataroot_GT: /_exp/pipeline_1/DA_CLIP/sync_sampled_15/train/uncompleted_p1/GT
    dataroot_LQ: /_exp/pipeline_1/DA_CLIP/sync_sampled_15/train/uncompleted_p1/LQ

    use_shuffle: true
    n_workers: 4  # per GPU
    batch_size: 2
    GT_size: 640
    LR_size: 640
    use_flip: true
    use_rot: true
    color: RGB
  val:
    name: Val_Dataset
    mode: LQGT
    dataroot_GT: /_exp/pipeline_1/DA_CLIP/sync_sampled_15/val/uncompleted_p1/GT
    dataroot_LQ: /_exp/pipeline_1/DA_CLIP/sync_sampled_15/val/uncompleted_p1/LQ


#### network structures
network_G:
  which_model_G: ConditionalNAFNet
  setting:
    width: 64
    enc_blk_nums: [1, 1, 1, 28]
    middle_blk_num: 1
    dec_blk_nums: [1, 1, 1, 1]

#### path
path:
  pretrain_model_G: ~
  strict_load: true
  resume_state: ~

#### training settings: learning rate scheme, loss
train:
  optimizer: Lion # Adam, AdamW, Lion
  lr_G: !!float 4e-5
  lr_scheme: TrueCosineAnnealingLR
  beta1: 0.9
  beta2: 0.99
  niter: 700000
  warmup_iter: -1  # no warm up
  lr_steps: [200000, 400000, 600000]
  lr_gamma: 0.5
  eta_min: !!float 1e-7

  # criterion
  is_weighted: False
  loss_type: l1
  weight: 1.0

  manual_seed: 0
  val_freq: !!float 5e3

#### logger
logger:
  print_freq: 100
  save_checkpoint_freq: !!float 5e3

Validation Logs (Excerpt)

25-10-23 17:23:03.598 - INFO: <epoch:229/293, iter: 550,000, psnr: 45.356515
25-10-23 18:29:21.778 - INFO: <epoch:231/293, iter: 555,000, psnr: 45.337205
25-10-23 19:35:32.266 - INFO: <epoch:233/293, iter: 560,000, psnr: 45.421531
25-10-23 20:59:12.874 - INFO: <epoch:235/293, iter: 565,000, psnr: 45.361757
25-10-23 22:06:53.939 - INFO: <epoch:237/293, iter: 570,000, psnr: 45.281748
25-10-23 23:13:37.318 - INFO: <epoch:239/293, iter: 575,000, psnr: 45.495289
25-10-24 00:19:58.177 - INFO: <epoch:241/293, iter: 580,000, psnr: 45.286052
25-10-24 01:26:12.903 - INFO: <epoch:243/293, iter: 585,000, psnr: 45.528597
25-10-24 02:32:25.522 - INFO: <epoch:245/293, iter: 590,000, psnr: 45.606946
25-10-24 03:38:52.197 - INFO: <epoch:247/293, iter: 595,000, psnr: 45.541850
25-10-24 04:45:21.845 - INFO: <epoch:249/293, iter: 600,000, psnr: 45.545629
25-10-24 05:51:46.698 - INFO: <epoch:252/293, iter: 605,000, psnr: 45.443754
25-10-24 06:58:18.580 - INFO: <epoch:254/293, iter: 610,000, psnr: 43.920094
25-10-24 08:04:49.358 - INFO: <epoch:256/293, iter: 615,000, psnr: 45.494661
25-10-24 09:17:24.039 - INFO: <epoch:258/293, iter: 620,000, psnr: 45.378881
25-10-24 10:46:11.008 - INFO: <epoch:260/293, iter: 625,000, psnr: 25.064299
25-10-24 11:58:14.105 - INFO: <epoch:262/293, iter: 630,000, psnr: 15.447938
25-10-24 13:04:24.516 - INFO: <epoch:264/293, iter: 635,000, psnr: 14.300622
25-10-24 14:10:52.891 - INFO: <epoch:266/293, iter: 640,000, psnr: 7.484680

When I ran the test.py script to test the inpainting task, the generated images are almost noise. before 600k iteration these generated (inpainted) images were very close to perfect. I have attached these images as sample.

Image Image

Attachments:
Test images (noisy)
refusion.yml
train_refusion_251024-141914.log
val_refusion_251022-143244.log

Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions