Thank you for sharing the code.
According to the paper, Appendix A 2nd paragraph, dropout is not used for attention.
In line 205, the residual and result are concatenated, but I think they should be added elementwise and then passed through a layer_norm (Figure 8 ANP paper). I wonder if there is some reason for this modification.
Thanks,
Deep Pandey