-
-
Notifications
You must be signed in to change notification settings - Fork 32
Description
This got me thinking, and after discussing it with LLMs, here's my takeaway. If we could do ALL of this:
-
Store Adam's state
-
Store
best_lossandbest_modelfor every epoch -
Make
ShuffleDataLoaderdeterministic -
Make
CosineAnnealingLRindependent of the dataset size somehow
Then this would be doable.
1 and 2 are trivial. 4 isn't straightforwardly doable because cosine annealing requires t_max, but we could replace cosine annealing with exponential decay, which would probably hurt log loss a little bit, but would eliminate reliance on t_max.
That leaves 3, and I have no idea how hard that is.
I think this is worth pursuing. If we don't have to do optimization from zero every time and instead only need to run optimization on new reviews, this could dramatically speed optimization up.
EDIT: I completely forgot about the recency weighting.
- Make recency weighting independent of the dataset size somehow
No idea how this could possibly be done. It's literally defined in terms of first review - last review, so we're kinda screwed.
EDIT 2: We could do something like this:
w = w_min + (1 - w_min)⋅(1-e^(-λ⋅t)),
where t is the number of days since the oldest review, which gets t=0. So the oldest review has t=0, the review 10 days after the oldest review gets t=10, the review one year after the oldest review gets t=365. Now adding a new review doesn't change the weights assigned to all other reviews.
The shape you get as the result is pretty different from what is optimal for recency weighting though. And if λ is chosen to work well on average, for users with very old collections most weights will be very close to 1, defeating the purpose of recency weighting.
EDIT 3: I guess we can kinda monkey patch it with another parameter. But it's ugly as sin.
w = w_min + (1 - w_min)⋅(1-e^(-λ⋅max(t - t_max, 0)))
EDIT 4: I realized there is another issue: if we replace cosine annealing with exponential decay, that solves the problem of relying on the dataset size, but introduces a new problem: LR for new reviews will tend towards 0, meaning that new reviews will have less and less impact on parameters.