I have gone through the code for the attention head, and it seem to me that it is wildly different from what is described in the paper. It starts with 3x3x1024 convolutions that take up over 50% of all model's parameters. The whole thing is bizzare, and includes even 1x1x1024 convolutions at the end of both sub-heads. Also the residual connection from the branch outputs is missing.
An illustration that shows the difference:

Maybe this is the reason for the non-reproducible results? I don't think it is the case, but I would be curious to find out.