Description:
While applying NATTAN to speaker verification tasks, I found that using surrounding padding improves performance(Neighborhood Attention Transformer with Progressive Channel Fusion).
To implement this properly, it is necessary to add -inf to the corresponding positions before applying softmax, effectively masking the padded positions.
Currently, I am handling this by separating qk and av instead of using the fused kernel. However, this approach increases memory consumption.
Feature Request:
Support for attention mask (specifically adding -inf to padded positions).
Ideally, support attention masking within the fused kernel implementation as well.
1d implemention of paper is available on https://github.com/ChenNan1996/PCF-NAT
Question:
Do you have any plans to implement attention mask support in the future?