Make it possible to resume optimization

<img width="558" height="69" alt="Image" src="https://github.com/user-attachments/assets/f6965f0f-b606-41d2-aa97-4b6df4c7786f" />

This got me thinking, and after discussing it with LLMs, here's my takeaway. If we could do ALL of this:

1) Store Adam's state

2) Store `best_loss` and `best_model` for every epoch

3) Make `ShuffleDataLoader` deterministic

4) Make `CosineAnnealingLR` independent of the dataset size somehow

Then this would be doable.
1 and 2 are trivial. 4 isn't straightforwardly doable because cosine annealing requires `t_max`, but we could replace cosine annealing with exponential decay, which would probably hurt log loss a little bit, but would eliminate reliance on `t_max`.
That leaves 3, and I have no idea how hard that is.
I think this is worth pursuing. If we don't have to do optimization from zero every time and instead only need to run optimization on new reviews, this could dramatically speed optimization up.

EDIT: I completely forgot about the recency weighting.

5) Make recency weighting independent of the dataset size somehow 

No idea how this could possibly be done. It's literally defined in terms of first review - last review, so we're kinda screwed.

EDIT 2: We could do something like this:

w = w_min⁡ + (1 - w_min⁡)⋅(1-e^(-λ⋅t)),

where t is the number of days since the oldest review, which gets t=0. So the oldest review has t=0, the review 10 days after the oldest review gets t=10, the review one year after the oldest review gets t=365. Now adding a new review doesn't change the weights assigned to all other reviews.

<img width="1402" height="814" alt="Image" src="https://github.com/user-attachments/assets/47ea6733-9785-4ffe-b119-f8f550e8778a" />

The shape you get as the result is pretty different from what is optimal for recency weighting though. And if λ is chosen to work well on average, for users with very old collections most weights will be very close to 1, defeating the purpose of recency weighting.

EDIT 3: I guess we can kinda monkey patch it with another parameter. But it's ugly as sin.

w = w_min⁡ + (1 - w_min⁡)⋅(1-e^(-λ⋅max(t - t_max, 0)))

<img width="1402" height="814" alt="Image" src="https://github.com/user-attachments/assets/6c165e65-a8ad-46f3-b63c-737f0121e2ee" />

EDIT 4: I realized there is another issue: if we replace cosine annealing with exponential decay, that solves the problem of relying on the dataset size, but introduces a new problem: LR for new reviews will tend towards 0, meaning that new reviews will have less and less impact on parameters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make it possible to resume optimization #391

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Make it possible to resume optimization #391

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions