[QUESTION] Why is expert parallelism not supported during fp16 training?

```
assert not args.model_parallel.fp16, \
            "Expert parallelism is not supported with fp16 training."
```

from https://github.com/NVIDIA/Megatron-LM/blob/db3a3f79d1cda60ea4b3db0ceffcf20c5760e11d/megatron/training/arguments.py#L508

compared to the case when ep=1, the difference when ep>1 is that it introduces additional all-to-all communication operation. I'm a bit confused about why this setup does not support fp16 training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Why is expert parallelism not supported during fp16 training? #810

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QUESTION] Why is expert parallelism not supported during fp16 training? #810

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions