-
Notifications
You must be signed in to change notification settings - Fork 92
Description
When I tried running the provided web demo on my 4090, the model weights do manage to fit into VRAM, but once I tested any prompt, the context immediately overflows. I tried changing the code here to include attn_implementation="flash_attention_2", but I get an error about flash attention not supported.
Is there any other practical solution to the problem of having an excessively large context footprint? I managed to use DF11 to losslessly squeeze the model weights such that my VRAM usage is only 15GB before inputting any prompt, and yet the prompt "Tell me a short story in only 3 sentences" still causes the context to balloon to the point of OOM. I also tried setting the use_sliding_window option to True, and that seemed to have no effect at all.
Honestly, it feels absurd that on a 8B model, even with 9GB of VRAM available for context, only extremely simple, single-turn prompts like "Hi", work well. If I do some rough extrapolation, even on a 48GB GPU, this model can only fit 3-5 simple rounds of conversation, which is very poor scaling, and completely invalidates the practicality of the thinking variant of this model.