Skip to content

Checkpointing: Continue model training at epoch x after saving intermediate model #36

@lukashaenjes

Description

@lukashaenjes

Hi, first of all, many thanks for this outstanding package.

I have a question concerning model checkpointing: I have a fairly large corpus (~ 70M words) and run a model which calculates word embeddings (with embed_wordspace) with 10 epochs. I run this on a remote server and it can take up to 2 days for all 10 epochs to finish.

As a fault tolerance measure, I figured it might be a good idea to checkpoint the model after every epoch so in case something crashes, I can load the last saved epoch and continue training from there. For this, I set saveEveryEpoch = TRUE. Since I only want to save the last successful epoch, I keep saveTempModel = FALSE.

My question now is: How can I continue training from this checkpoint after something went wrong? I tried to pass initModel = "wordspace.bin" in the existing embed_wordspace call, which gives:

Start to load a trained starspace model.
STARSPACE-2017-2
Model loaded.

But, then it continues to run the model with the parameters specified in the overall call to embed_wordspace, starting at epoch 1 and seemingly ignoring the passed model. Also, when reading in the intermediate wordspace.bin.tsv, I'm left with the default parameters, not the one I passed in the function. For instance, x$args$param$epoch gives 5 (the default), while I originally passed epoch = 10:

x <- starspace_load_model("wordspace.bin.tsv", method = "tsv-data.table")
x$args$param$epoch
#> [1] 5

Could this be the cause of the problem?

Am I approaching this correctly? What would be an alternative way to achieve my desired goal? I'm thinking of something similar to the ModelCheckpoint functionality in TensorFlow.

Many thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions