Currently we are not dividing the raw attention by the square root of dimension, when we should. It's possible that this explains some of the losses peaks during training of tiny models.