Issue: Sudden PSNR Drop after 600k Iterations during Refusion Training

Dear @Algolzw ,

While training a Refusion (pixel-space) model, the validation PSNR remained stable (~45.5 dB) up to 600 k iterations, but suddenly dropped to ~31 dB at 610 k, then ~22 dB at 615 k, and finally to ~7 dB at 620 k.
Training loss continued normally without explosion or NaN values. The learning rate decayed smoothly (≈2 e-6 → 1.3 e-6).
No data or configuration changes were made during this period.


PSNR logged with TensorBoard (see screenshot).

Sudden PSNR collapse appeared after 600 k iteration.

Expected behavior
PSNR should remain near 45 dB or improve gradually.
A 30 dB drop without visible loss explosion looks abnormal


Training Configuration (from refusion.yml)

```
#### general settings
name: refusion_
use_tb_logger: true
model: denoising
distortion: inpainting
gpu_ids: [1]

sde:
  max_sigma: 50
  T: 100
  schedule: cosine # linear, cosine
  eps: 0.005

degradation: # for some synthetic dataset that only have GTs
  # for denoising
  sigma: 25
  noise_type: G # Gaussian noise: G

  # for super-resolution
  scale: 4
  
#### datasets
datasets:
  train:
    optimizer: Lion # Adam, AdamW, Lion
    name: Train_Dataset
    mode: LQGT
    dataroot_GT: /_exp/pipeline_1/DA_CLIP/sync_sampled_15/train/uncompleted_p1/GT
    dataroot_LQ: /_exp/pipeline_1/DA_CLIP/sync_sampled_15/train/uncompleted_p1/LQ

    use_shuffle: true
    n_workers: 4  # per GPU
    batch_size: 2
    GT_size: 640
    LR_size: 640
    use_flip: true
    use_rot: true
    color: RGB
  val:
    name: Val_Dataset
    mode: LQGT
    dataroot_GT: /_exp/pipeline_1/DA_CLIP/sync_sampled_15/val/uncompleted_p1/GT
    dataroot_LQ: /_exp/pipeline_1/DA_CLIP/sync_sampled_15/val/uncompleted_p1/LQ


#### network structures
network_G:
  which_model_G: ConditionalNAFNet
  setting:
    width: 64
    enc_blk_nums: [1, 1, 1, 28]
    middle_blk_num: 1
    dec_blk_nums: [1, 1, 1, 1]

#### path
path:
  pretrain_model_G: ~
  strict_load: true
  resume_state: ~

#### training settings: learning rate scheme, loss
train:
  optimizer: Lion # Adam, AdamW, Lion
  lr_G: !!float 4e-5
  lr_scheme: TrueCosineAnnealingLR
  beta1: 0.9
  beta2: 0.99
  niter: 700000
  warmup_iter: -1  # no warm up
  lr_steps: [200000, 400000, 600000]
  lr_gamma: 0.5
  eta_min: !!float 1e-7

  # criterion
  is_weighted: False
  loss_type: l1
  weight: 1.0

  manual_seed: 0
  val_freq: !!float 5e3

#### logger
logger:
  print_freq: 100
  save_checkpoint_freq: !!float 5e3

```



Validation Logs (Excerpt)

```
25-10-23 17:23:03.598 - INFO: <epoch:229/293, iter: 550,000, psnr: 45.356515
25-10-23 18:29:21.778 - INFO: <epoch:231/293, iter: 555,000, psnr: 45.337205
25-10-23 19:35:32.266 - INFO: <epoch:233/293, iter: 560,000, psnr: 45.421531
25-10-23 20:59:12.874 - INFO: <epoch:235/293, iter: 565,000, psnr: 45.361757
25-10-23 22:06:53.939 - INFO: <epoch:237/293, iter: 570,000, psnr: 45.281748
25-10-23 23:13:37.318 - INFO: <epoch:239/293, iter: 575,000, psnr: 45.495289
25-10-24 00:19:58.177 - INFO: <epoch:241/293, iter: 580,000, psnr: 45.286052
25-10-24 01:26:12.903 - INFO: <epoch:243/293, iter: 585,000, psnr: 45.528597
25-10-24 02:32:25.522 - INFO: <epoch:245/293, iter: 590,000, psnr: 45.606946
25-10-24 03:38:52.197 - INFO: <epoch:247/293, iter: 595,000, psnr: 45.541850
25-10-24 04:45:21.845 - INFO: <epoch:249/293, iter: 600,000, psnr: 45.545629
25-10-24 05:51:46.698 - INFO: <epoch:252/293, iter: 605,000, psnr: 45.443754
25-10-24 06:58:18.580 - INFO: <epoch:254/293, iter: 610,000, psnr: 43.920094
25-10-24 08:04:49.358 - INFO: <epoch:256/293, iter: 615,000, psnr: 45.494661
25-10-24 09:17:24.039 - INFO: <epoch:258/293, iter: 620,000, psnr: 45.378881
25-10-24 10:46:11.008 - INFO: <epoch:260/293, iter: 625,000, psnr: 25.064299
25-10-24 11:58:14.105 - INFO: <epoch:262/293, iter: 630,000, psnr: 15.447938
25-10-24 13:04:24.516 - INFO: <epoch:264/293, iter: 635,000, psnr: 14.300622
25-10-24 14:10:52.891 - INFO: <epoch:266/293, iter: 640,000, psnr: 7.484680
```

When I ran the test.py script to test the inpainting task, the generated images are almost noise. before 600k iteration these generated (inpainted) images were very close to perfect. I have attached these images as sample.

<img width="640" height="640" alt="Image" src="https://github.com/user-attachments/assets/4ea6d697-a3d3-4662-8154-44401db92e3d" />
<img width="640" height="640" alt="Image" src="https://github.com/user-attachments/assets/ef154d1c-5d91-41f8-a0a2-1c0f58632d51" />

Attachments:
Test images (noisy)
[refusion.yml](https://github.com/user-attachments/files/23135050/refusion.yml)
[train_refusion_251024-141914.log](https://github.com/user-attachments/files/23135049/train_refusion_251024-141914.log)
[val_refusion_251022-143244.log](https://github.com/user-attachments/files/23135048/val_refusion_251022-143244.log)

<img width="1030" height="475" alt="Image" src="https://github.com/user-attachments/assets/557c47d8-3069-42a0-bc70-336488eadcac" />

<img width="1290" height="1151" alt="Image" src="https://github.com/user-attachments/assets/74f69b98-9a2b-4520-8c1f-5725a13963f7" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue: Sudden PSNR Drop after 600k Iterations during Refusion Training #145

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue: Sudden PSNR Drop after 600k Iterations during Refusion Training #145

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions