I’m working with a Transformer model that routinely processes sequences up to 12 288 tokens.
For attention bias I currently create a dense attn_bias of shape 12 288 × 12 288.
Right now, I have problems with memory for my multi-head att, because of that large tensors.
I could create smaller blocks on fly from sparse attn_bias tensor, but I am not sure if xformers supports this type of processing.
I would be grateful for any help. Maybe there are other packages which could help my solve that problem?
Maciek