You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
support finetune arg -opt SGD (or sgd). llama 3.2-1b-F32 result:
observed 11gb gpu ram (45 sec/epoch) when using SGD instead of
19gb (55 sec/epoch) using adamw.
(getting the right learning rate for SGD is trickier than for adamw -
too high and you overshoot+oscillate, too low and you waste compute
slowly approaching convergence)
quickly reach 99%+ train accuracy on a tiny wikipedia train
(~58% token accuracy on held-out eval - reasonable)
note: objective loss not directly comparable between adamw, sgd -
check perplexity or accuracy or consider relative improvements
for convergence
train: ... loss=0.00231±0.00032 acc=99.99±0.01% t=00:00:05
val: ... loss=3.91926±nan acc=58.40±2.18%
on more training data (500 lines), additional catastrophic forgetting before train reaches
99.9% accuracy:
train: data=0000140/0000140 loss=0.02752±0.00094 acc=99.78±0.02% t=00:00:45
val: data=0000008/0000008 loss=4.16029±0.23384 acc=46.61±0.78%
new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)
cache (1 - wd*alpha) in 'adamw' opt struct
cache computed per-epoch optimizer opts
(formerly were computed twice per)
add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors. make ggml_opt_init aware of the optimization method
since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no arg
to set such a policy yet)
0 commit comments