Regarding the loss calculation part of the AR model, why isn't the mask being handled?
total_loss = F.cross_entropy(logits, targets, reduction=reduction)
Normally, shouldn't it be:
total_loss = F.cross_entropy(logits.mask_selected(y_mask), targets.mask_selected(y_mask), reduction=reduction)
What's the reason for not considering the mask?