Skip to content

ReturnnTrainingJob: model in config is an absolute path #495

@albertz

Description

@albertz

self.returnn_config.post_config["model"] = os.path.join(self.out_model_dir.get_path(), "epoch")

Once you move the job dir to a new location, this will thus break.

More annoyingly, RETURNN automatically silently recursively creates non-existing directories for model, so it will not crash but still run without errors. When you moved the job with existing checkpoints, it will not find the old checkpoints but start training again from scratch. However, it will also overwrite the learning rates file, so afterwards, the old checkpoints can not really be used anymore (if you care about having a corresponding correct learning rates file), and the learning rates file will have mixed values from the old and new training run.

I'm not sure if we consider this a bug that we have absolute paths here? We could fix this by using a relative path. There might be a number of similar issues here and probably in other jobs as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions