assert not args.model_parallel.fp16, \
"Expert parallelism is not supported with fp16 training."
from
|
"Expert parallelism is not supported with fp16 training." |
compared to the case when ep=1, the difference when ep>1 is that it introduces additional all-to-all communication operation. I'm a bit confused about why this setup does not support fp16 training.