Skip to content

Flash attention not supported? #76

@mingyi456

Description

@mingyi456

When I tried running the provided web demo on my 4090, the model weights do manage to fit into VRAM, but once I tested any prompt, the context immediately overflows. I tried changing the code here to include attn_implementation="flash_attention_2", but I get an error about flash attention not supported.

Is there any other practical solution to the problem of having an excessively large context footprint? I managed to use DF11 to losslessly squeeze the model weights such that my VRAM usage is only 15GB before inputting any prompt, and yet the prompt "Tell me a short story in only 3 sentences" still causes the context to balloon to the point of OOM. I also tried setting the use_sliding_window option to True, and that seemed to have no effect at all.

Honestly, it feels absurd that on a 8B model, even with 9GB of VRAM available for context, only extremely simple, single-turn prompts like "Hi", work well. If I do some rough extrapolation, even on a 48GB GPU, this model can only fit 3-5 simple rounds of conversation, which is very poor scaling, and completely invalidates the practicality of the thinking variant of this model.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions