Regarding the bi-directional attention you mentioned in the article

I see in the article that you use bi-directional attention in the Transformer phase, but I don't see you using bi-directional attention in /Transformer/model.py. Instead of using CausalSelfAttention, do you use bi-directional attention elsewhere?