Skip to content

[Bug] VariableBlockSparseAttention fails when num_kv_head * num_blocks_per_row > 32768 #1367

@KevinZeng08

Description

@KevinZeng08

Appreciate the functionality of variable block sparse attention, I encounter an issue about its usage.

This is my setting: seq_len = 49152, num_kv_head = 48, num_blocks_per_row = num_blocks_per_col = 768. Because num_kv_head * num_blocks_per_row = 36864 > 32768, the bug reports
self._kv_lens_buffer[: len(kv_lens_arr_host)].copy_( RuntimeError: The size of tensor a (32768) must match the size of tensor b (36864) at non-singleton dimension 0

It originates from that current _kv_lens_buffer is allocated in a static way with size 32768.

self._kv_lens_buffer = torch.empty(
(32768,), dtype=torch.int32, device=self.device
)

It seems that current static allocation of _kv_lens_buffer is not scalable for increasing kv head number and context length, can we allocate this _kv_lens_buffer in a dynamic way in the plan method, or just increase the pre-allocated size of _kv_lens_buffer?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions