In function train_one_epoch, in the file src/training/train.py from line 156 to 162, as shown below:
losses = loss(**inputs, **inputs_no_accum, output_dict=True)
del inputs
del inputs_no_accum
total_loss = sum(losses.values())
losses["loss"] = total_loss
backward(total_loss, scaler)
Shouldn't we take the average of loss for gradient accumulation before calling backward()?