-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Hi, first of all, many thanks for this outstanding package.
I have a question concerning model checkpointing: I have a fairly large corpus (~ 70M words) and run a model which calculates word embeddings (with embed_wordspace) with 10 epochs. I run this on a remote server and it can take up to 2 days for all 10 epochs to finish.
As a fault tolerance measure, I figured it might be a good idea to checkpoint the model after every epoch so in case something crashes, I can load the last saved epoch and continue training from there. For this, I set saveEveryEpoch = TRUE. Since I only want to save the last successful epoch, I keep saveTempModel = FALSE.
My question now is: How can I continue training from this checkpoint after something went wrong? I tried to pass initModel = "wordspace.bin" in the existing embed_wordspace call, which gives:
Start to load a trained starspace model.
STARSPACE-2017-2
Model loaded.
But, then it continues to run the model with the parameters specified in the overall call to embed_wordspace, starting at epoch 1 and seemingly ignoring the passed model. Also, when reading in the intermediate wordspace.bin.tsv, I'm left with the default parameters, not the one I passed in the function. For instance, x$args$param$epoch gives 5 (the default), while I originally passed epoch = 10:
x <- starspace_load_model("wordspace.bin.tsv", method = "tsv-data.table")
x$args$param$epoch
#> [1] 5
Could this be the cause of the problem?
Am I approaching this correctly? What would be an alternative way to achieve my desired goal? I'm thinking of something similar to the ModelCheckpoint functionality in TensorFlow.
Many thanks in advance!