Some clarifications about attention used

Thank you for sharing the code.
According to the paper,  Appendix A 2nd paragraph, dropout is not used for attention.

In line 205, the residual and result are concatenated, but I think they should be added elementwise and then passed through a layer_norm (Figure 8 ANP paper). I wonder if there is some reason for this modification.

Thanks,
Deep Pandey