Skip to content
2 changes: 0 additions & 2 deletions train/t0/tr11f-6B3-ml-t0.slurm
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,6 @@ SEQ_LEN=2048
SAVE_INTERVAL=500

TRAIN_SAMPLES=6_400_000 # 13e9 / 2048
LR_WARMUP_SAMPLES=640_000 # 10% - TODO: T0 paper says nothing about warmup

# T0 paper:
# "...we use a learning rate of 1e-3..."
Expand All @@ -80,7 +79,6 @@ OPTIMIZER_ARGS=" \
--adam-eps 1e-8 \
--lr 1e-3 \
--lr-decay-style constant \
--lr-warmup-samples $LR_WARMUP_SAMPLES \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used Adafactor ... so technically I don't know what parameters matter (typically we used a decay argument, which I don't know how it translates to Adam optimizer)

--clip-grad 1.0 \
--weight-decay 1e-1 \
"
Expand Down