Flash attention not supported?

When I tried running the provided web demo on my 4090, the model weights do manage to fit into VRAM, but once I tested any prompt, the context immediately overflows. I tried changing the [code here](https://github.com/stepfun-ai/Step-Audio2/blob/7c58630d7f03834beac9e074c82670386e14d660/stepaudio2.py#L11) to include `attn_implementation="flash_attention_2"`, but I get an error about flash attention not supported.

Is there any other practical solution to the problem of having an excessively large context footprint? I managed to use [DF11](https://github.com/LeanModels/DFloat11) to losslessly squeeze the model weights such that my VRAM usage is only 15GB before inputting any prompt, and yet the prompt "Tell me a short story in only 3 sentences" still causes the context to balloon to the point of OOM. I also tried setting the [`use_sliding_window`](https://huggingface.co/stepfun-ai/Step-Audio-2-mini/blob/main/configuration_step_audio_2.py#L98) option to True, and that seemed to have no effect at all.

Honestly, it feels absurd that on a 8B model, even with 9GB of VRAM available for context, only extremely simple, single-turn prompts like "Hi", work well. If I do some rough extrapolation, even on a 48GB GPU, this model can only fit 3-5 simple rounds of conversation, which is very poor scaling, and completely invalidates the practicality of the thinking variant of this model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flash attention not supported? #76

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Flash attention not supported? #76

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions