I noticed that we do not have dropout after the self-attention:
# MHSA
x_mhsa_ln = self.self_att_layer_norm(x_ffn1_out)
x_mhsa = self.self_att(x_mhsa_ln, axis=spatial_dim)
x_mhsa_out = x_mhsa + x_ffn1_out
This is different to the standard Transformer.
This is also different to the paper.
Originally posted by @albertz in #233 (comment)