You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
support finetune arg -opt SGD (or sgd). llama 3.2-1b-F32 result:
observed 11gb gpu ram (45 sec/epoch) when using SGD instead of
19gb (55 sec/epoch) using adamw.
(getting the right learning rate for SGD is trickier than for adamw -
too high and you overshoot+oscillate, too low and you waste compute
slowly approaching convergence)
SGD (or adamw) quickly reach 99%+ train accuracy.
note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence
also, note that logical batch size > physical batch (gradient
accumulation) seems unsupported for optimization (limited to physical
, unlike in ppx - also limited to ctx-size).
training quality/convergence could be improved
by implementing (at cost of some memory, but you can make that up
by using a much smaller physical batch for a net memory savings).
presumably it's physical batch that should be limited to ctx-size?
see llama_context::opt_epoch
new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)
cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit
cache computed per-epoch optimizer opts
(formerly were computed twice per)
add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors. make ggml_opt_init aware of the optimization method
since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no arg
to set such a policy yet)
100 lines of wikipedia train:
train: ... loss=0.00231±0.00032 acc=99.99±0.01% t=00:00:05
val: ... loss=3.91926±nan acc=58.40±2.18%
on more training data (500 lines), additional catastrophic forgetting before train reaches
99.9% accuracy:
train: data=0000140/0000140 loss=0.02611±0.00077 acc=99.82±0.02% t=00:00:45
val: data=0000008/0000008 loss=4.11112±0.22526 acc=46.36±0.78%
increasing batch+ctx sizes to 1536 (double what fits in memory for
adamw) gets apparently better validation but that could be an artifact
of continuing training from previous weights, i.e. what's train vs val
probably depends on batch size. also amusing - faster due to larger
batch even though larger context would be slower?:
train: data=0000045/0000045 loss=0.01722±0.00103 acc=99.90±0.01% t=00:00:40
val: data=0000003/0000003 loss=1.96829±1.09488 acc=72.44±0.66%
0 commit comments