Hello.
When using tensor parallel on bloom (tp_size = 8), we find that the cross_entropy loss computed by mpu.cross_entropy is different from torch.nn.functional.cross_entropy. The difference is about 1% for our data.
For the implementation of mpu.cross_entropy, we find that the loss is computed on the partition_vocab_size which is 8 times smaller than vocab_size (tp_size = 8). We think maybe this implementation causes the difference above.
In this case, is this implementation correct? Or can this implementation ensure the performance when using tensor parallel?