ReturnnTrainingJob: `model` in config is an absolute path

https://github.com/rwth-i6/i6_core/blob/cf32146d1da3837fa9b24b51c9e20e81c0c74700/returnn/training.py#L223

Once you move the job dir to a new location, this will thus break.

More annoyingly, RETURNN automatically silently recursively creates non-existing directories for `model`, so it will not crash but still run without errors. When you moved the job with existing checkpoints, it will not find the old checkpoints but start training again from scratch. However, it will also overwrite the learning rates file, so afterwards, the old checkpoints can not really be used anymore (if you care about having a corresponding correct learning rates file), and the learning rates file will have mixed values from the old and new training run.

I'm not sure if we consider this a bug that we have absolute paths here? We could fix this by using a relative path. There might be a number of similar issues here and probably in other jobs as well.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ReturnnTrainingJob: `model` in config is an absolute path #495

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ReturnnTrainingJob: model in config is an absolute path #495

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

ReturnnTrainingJob: `model` in config is an absolute path #495