Hi, we plan to integreate your model to FuxiCTR, a library for CTR prediction.
But we found that the layer DisentangledSelfAttention used in DESTINE.py seems not consist with your paper description about unary attention weights.
- The paper uses the multiplication of mean_query and key to get unary attention weights.
- The code use the key itself to map the dimension to num_heads and then use a softmax to get unary attention weights. See code.
Is the version wrong? Or could you suggest which version to implement in FuxiCTR, according to your code or paper? Thanks!