Skip to content

Make it possible to resume optimization #391

@Expertium

Description

@Expertium
Image

This got me thinking, and after discussing it with LLMs, here's my takeaway. If we could do ALL of this:

  1. Store Adam's state

  2. Store best_loss and best_model for every epoch

  3. Make ShuffleDataLoader deterministic

  4. Make CosineAnnealingLR independent of the dataset size somehow

Then this would be doable.
1 and 2 are trivial. 4 isn't straightforwardly doable because cosine annealing requires t_max, but we could replace cosine annealing with exponential decay, which would probably hurt log loss a little bit, but would eliminate reliance on t_max.
That leaves 3, and I have no idea how hard that is.
I think this is worth pursuing. If we don't have to do optimization from zero every time and instead only need to run optimization on new reviews, this could dramatically speed optimization up.

EDIT: I completely forgot about the recency weighting.

  1. Make recency weighting independent of the dataset size somehow

No idea how this could possibly be done. It's literally defined in terms of first review - last review, so we're kinda screwed.

EDIT 2: We could do something like this:

w = w_min⁡ + (1 - w_min⁡)⋅(1-e^(-λ⋅t)),

where t is the number of days since the oldest review, which gets t=0. So the oldest review has t=0, the review 10 days after the oldest review gets t=10, the review one year after the oldest review gets t=365. Now adding a new review doesn't change the weights assigned to all other reviews.

Image

The shape you get as the result is pretty different from what is optimal for recency weighting though. And if λ is chosen to work well on average, for users with very old collections most weights will be very close to 1, defeating the purpose of recency weighting.

EDIT 3: I guess we can kinda monkey patch it with another parameter. But it's ugly as sin.

w = w_min⁡ + (1 - w_min⁡)⋅(1-e^(-λ⋅max(t - t_max, 0)))

Image

EDIT 4: I realized there is another issue: if we replace cosine annealing with exponential decay, that solves the problem of relying on the dataset size, but introduces a new problem: LR for new reviews will tend towards 0, meaning that new reviews will have less and less impact on parameters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions