[Bug] VariableBlockSparseAttention fails when `num_kv_head * num_blocks_per_row` > 32768

Appreciate the functionality of variable block sparse attention, I encounter an issue about its usage.

This is my setting: `seq_len = 49152, num_kv_head = 48, num_blocks_per_row = num_blocks_per_col = 768`. Because `num_kv_head * num_blocks_per_row = 36864 > 32768`, the bug reports 
`self._kv_lens_buffer[: len(kv_lens_arr_host)].copy_(
RuntimeError: The size of tensor a (32768) must match the size of tensor b (36864) at non-singleton dimension 0`

It originates from that current `_kv_lens_buffer` is allocated in a static way with size 32768.
https://github.com/flashinfer-ai/flashinfer/blob/2a614727cac853e1a6e4274907818a46e11bcec0/flashinfer/sparse.py#L754-L756

It seems that current static allocation of `_kv_lens_buffer` is not scalable for increasing kv head number and context length, can we allocate this `_kv_lens_buffer` in a dynamic way in the `plan` method, or just increase the pre-allocated size of `_kv_lens_buffer`?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] VariableBlockSparseAttention fails when `num_kv_head * num_blocks_per_row` > 32768 #1367

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	self._kv_lens_buffer = torch.empty(
	(32768,), dtype=torch.int32, device=self.device
	)

[Bug] VariableBlockSparseAttention fails when num_kv_head * num_blocks_per_row > 32768 #1367

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug] VariableBlockSparseAttention fails when `num_kv_head * num_blocks_per_row` > 32768 #1367