Support for Large attn_bias via Sparse Tensors or On‑The‑Fly Construction (seq_len ≈ 12 288)

I’m working with a Transformer model that routinely processes sequences up to 12 288 tokens.
For attention bias I currently create a dense attn_bias of shape 12 288 × 12 288.

Right now, I have problems with memory for my multi-head att, because of that large tensors.
I could create smaller blocks on fly from sparse attn_bias tensor, but I am not sure if xformers supports this type of processing.

I would be grateful for any help. Maybe there are other packages which could help my solve that problem?

Maciek

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for Large attn_bias via Sparse Tensors or On‑The‑Fly Construction (seq_len ≈ 12 288) #11928

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for Large attn_bias via Sparse Tensors or On‑The‑Fly Construction (seq_len ≈ 12 288) #11928

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Support for Large attn_bias via Sparse Tensors or On‑The‑Fly Construction (seq_len ≈ 12 288) #11928