You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
support finetune arg -opt SGD (or sgd). llama 3.2-1b-F32 result:
observed 11gb gpu ram when using SGD instead of 20gb using adamw
easily/quickly reach 99%+ train accuracy on a tiny wikipedia train
(~56% token accuracy on held-out eval - reasonable)
note: objective loss not directly comparable between adamw, sgd -
check perplexity or accuracy or consider relative improvements
for convergence
new finetune args -wd 1e-5 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)
cache (1 - wd*alpha) in 'adamw' opt struct
cache computed per-epoch optimizer opts
(formerly were computed twice per)
add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors. make ggml_opt_init aware of the optimization method
since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no arg
to set such a policy yet)
0 commit comments