You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.
support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)
llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
using larger batch/context - 1728 (possible due to memory savings),
finetune (SGD) on 500 lines of wikipedia:
train: data=0000039/0000039 loss=0.01601±0.00086 acc=99.73±0.02% t=00:00:40
val: data=0000003/0000003 loss=1.99405±1.09012 acc=72.18±0.66% t=00:00:01
using the same GPU memory, adamw can only do 512 batch/context,
reaching:
(100 wikipedia lines quickly exactly memorized:
train: ... loss=0.00231±0.00032 acc=99.99±0.01% t=00:00:05
val: ... loss=3.91926±nan acc=58.40±2.18%
)
note: when finetuning long enough (or w/ enough -lr),
validation accuracy eventually drops ('catastrophic forgetting')
-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving)
note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence
also, note that logical batch size > physical batch (gradient
accumulation) seems unsupported for optimization (limited to physical
, unlike in ppx - also limited to ctx-size).
training quality/convergence could be improved
by implementing (at cost of some memory, but you can make that up
by using a much smaller physical batch for a net memory savings).
presumably it's physical batch that should be limited to ctx-size?
see llama_context::opt_epoch; (opt_period > 1 may already
be implemented and would give you multiples of physical batch -
added an option for this and oddly didn't see increased gpu memory
usage using nvidia-smi when >1)
new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)
cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)
since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)
test-opt checks adamw as before and now sgd (except for a few check
that just need values collected+added); tolerance on the 'regression'
test is lower (weight decay enabled but generally 1st order method
converges slower than 2nd)
0 commit comments