`ReturnnTrainingJob` with multiple processes (distributed training) sets `use_horovod` also for Torch

First note, our `horovod_num_processes` actually is not only for Horovod but in general for any distributed training (this is a separate issue, we should rename this: #456).

In `create_returnn_config`, we do this:
```python
        if horovod_num_processes is not None:
            config["use_horovod"] = True
```

This is a problem, because RETURNN then assumes the TF backend in several places (logging, dataset). I just pushed a commit on RETURNN (https://github.com/rwth-i6/returnn/commit/9c721809e24c9a9a6d264eea1f1a0b7cea8c27d9) to workaround this issue, so this might be solved now (needs more testing). However, i think this is still not quite correct in general.

Note that in principle, PyTorch could also use Horovod. Horovod has support for PyTorch. This would probably be configured via the `torch_distributed` setting. This is currently not supported.

Also note, TensorFlow also supports other ways for distributed training, and we partly support that, although not so much tested, and we usually use Horovod.

I'm not sure how to solve this now. `ReturnnTrainingJob` maybe should not always set this? But this would break all hashes now.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`ReturnnTrainingJob` with multiple processes (distributed training) sets `use_horovod` also for Torch #461

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ReturnnTrainingJob with multiple processes (distributed training) sets use_horovod also for Torch #461

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`ReturnnTrainingJob` with multiple processes (distributed training) sets `use_horovod` also for Torch #461