-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Line 223 in cf32146
| self.returnn_config.post_config["model"] = os.path.join(self.out_model_dir.get_path(), "epoch") |
Once you move the job dir to a new location, this will thus break.
More annoyingly, RETURNN automatically silently recursively creates non-existing directories for model, so it will not crash but still run without errors. When you moved the job with existing checkpoints, it will not find the old checkpoints but start training again from scratch. However, it will also overwrite the learning rates file, so afterwards, the old checkpoints can not really be used anymore (if you care about having a corresponding correct learning rates file), and the learning rates file will have mixed values from the old and new training run.
I'm not sure if we consider this a bug that we have absolute paths here? We could fix this by using a relative path. There might be a number of similar issues here and probably in other jobs as well.