loss increase sharply after the first epoch

My training loss will experience a surge after the first epoch. I tried adjusting LR and momentum_treather, but it always appears, although the occurrence of iter is different. My dataset contains 40 million images.

torchrun $DISTRIBUTED_ARGS main_dino.py \
    --nproc_per_node $NPUS_PER_NODE \
    --node_rank $NODE_RANK \
    --arch vit_large \
    --patch_size 16 \
    --out_dim 65536 \
    --norm_last_layer True \
    --momentum_teacher 0.994 \
    --use_bn_in_head False \
    --warmup_teacher_temp 0.04 \
    --teacher_temp 0.04 \
    --warmup_teacher_temp_epochs 6 \
    --use_fp16 True \
    --weight_decay 0.04 \
    --weight_decay_end 0.4 \
    --clip_grad 3.0 \
    --batch_size_per_gpu 64 \
    --epochs 300 \
    --freeze_last_layer 1 \
    --lr 0.00001 \
    --warmup_epochs 5 \
    --min_lr 1e-6 \
    --optimizer adamw \
    --drop_path_rate 0.1 \
    --global_crops_scale 0.4 1.0 \
    --local_crops_number 8 \
    --local_crops_scale 0.05 0.4 \
    --restart_strict \
    --data_path ${DATA_DIR} \
    --output_dir ${CKPT_DIR} \
    --saveckp_freq 1 \
    --seed 10086 \
    --num_workers 16 \
    --dist_url env://

![Image](https://github.com/user-attachments/assets/52d53b4e-727f-4e8c-be0d-f77ed847b9c8)


chaning lr->0.0005, momentum_treather->0.997
dataset size = 40m*0.125
![Image](https://github.com/user-attachments/assets/70cd4650-cc1d-4a8a-8668-d5f5e1b4158a)

attachments/assets/d4bfddbc-9079-4cea-b8ce-98f32c8a1e65)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss increase sharply after the first epoch #287

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

loss increase sharply after the first epoch #287

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions