-
Notifications
You must be signed in to change notification settings - Fork 154
Open

Description
What happened?
I'm not sure if it's a param exclusive to deepseek. So I disabled FA and got 7 GB compute buffers requested
llama_new_context_with_model: n_ctx = 20000
llama_new_context_with_model: n_batch = 1024
llama_new_context_with_model: n_ubatch = 1024
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 1093.75 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 781.25 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 703.12 MiB
llama_kv_cache_init: CUDA3 KV buffer size = 781.25 MiB
llama_kv_cache_init: CUDA4 KV buffer size = 781.25 MiB
llama_kv_cache_init: CUDA5 KV buffer size = 703.12 MiB
llama_kv_cache_init: CUDA6 KV buffer size = 781.25 MiB
llama_kv_cache_init: CUDA7 KV buffer size = 781.25 MiB
llama_kv_cache_init: CUDA8 KV buffer size = 312.50 MiB
llama_kv_cache_init: CUDA9 KV buffer size = 312.50 MiB
llama_kv_cache_init: CUDA10 KV buffer size = 234.38 MiB
llama_new_context_with_model: KV self size = 7265.62 MiB, K (f16): 3632.81 MiB, V (f16): 3632.81 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 7734.13 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 8109821952
llama_new_context_with_model: failed to allocate compute buffers
Name and Version
What operating system are you seeing the problem on?
Linux
Relevant log output
Metadata
Metadata
Assignees
Labels
No labels