Checkpointing: Continue model training at epoch x after saving intermediate model

Hi, first of all, many thanks for this outstanding package.

I have a question concerning model checkpointing: I have a fairly large corpus (~ 70M words) and run a model which calculates word embeddings (with `embed_wordspace`) with 10 epochs. I run this on a remote server and it can take up to 2 days for all 10 epochs to finish.

As a fault tolerance measure, I figured it might be a good idea to checkpoint the model after every epoch so in case something crashes, I can load the last saved epoch and continue training from there. For this, I set `saveEveryEpoch = TRUE`. Since I only want to save the last successful epoch, I keep `saveTempModel = FALSE`.

My question now is: How can I continue training from this checkpoint after something went wrong? I tried to pass `initModel = "wordspace.bin"` in the existing `embed_wordspace` call, which gives:

```
Start to load a trained starspace model.
STARSPACE-2017-2
Model loaded.
```

But, then it continues to run the model with the parameters specified in the overall call to `embed_wordspace`, starting at epoch 1 and seemingly ignoring the passed model. Also, when reading in the intermediate `wordspace.bin.tsv`, I'm left with the default parameters, not the one I passed in the function. For instance, `x$args$param$epoch` gives `5` (the default), while I originally passed `epoch = 10`:

```
x <- starspace_load_model("wordspace.bin.tsv", method = "tsv-data.table")
x$args$param$epoch
#> [1] 5
```

Could this be the cause of the problem?

Am I approaching this correctly? What would be an alternative way to achieve my desired goal? I'm thinking of something similar to the [ModelCheckpoint](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint) functionality in TensorFlow. 

Many thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpointing: Continue model training at epoch x after saving intermediate model #36

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Checkpointing: Continue model training at epoch x after saving intermediate model #36

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions