-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Closed
Labels
bugSomething isn't workingSomething isn't workinghigh severityUsed to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)
Description
Name and Version
Using ghcr.io/ggerganov/llama.cpp@sha256:cb0f16e6eae440da844b3a80b8c15e82ac4b2b8f6637f674b10b263452e649aa
Operating systems
Linux
GGML backends
CUDA
Hardware
Nvidia H100
Cuda 12.2
Models
Qwen2.5-32B-Instruct-GGUF
Problem description & steps to reproduce
When I run llamacpp server (using the docker image server-cuda) I get this error after the first token is emitted
/app/ggml/src/ggml-cuda/ggml-cuda.cu:70: CUDA error
jade1 | CUDA error: an illegal memory access was encountered
jade1 | current device: 0, in function ggml_backend_cuda_synchronize at /app/ggml/src/ggml-cuda/ggml-cuda.cu:2273
jade1 | cudaStreamSynchronize(cuda_ctx->stream())
First Bad Commit
Not sure.
Relevant log output
jade1 | .................................................................................................
jade1 | llama_new_context_with_model: n_seq_max = 4
jade1 | llama_new_context_with_model: n_ctx = 100000
jade1 | llama_new_context_with_model: n_ctx_per_seq = 25000
jade1 | llama_new_context_with_model: n_batch = 2048
jade1 | llama_new_context_with_model: n_ubatch = 512
jade1 | llama_new_context_with_model: flash_attn = 0
jade1 | llama_new_context_with_model: freq_base = 1000000.0
jade1 | llama_new_context_with_model: freq_scale = 1
jade1 | llama_new_context_with_model: n_ctx_per_seq (25000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
jade1 | llama_kv_cache_init: CUDA0 KV buffer size = 25000.00 MiB
jade1 | llama_new_context_with_model: KV self size = 25000.00 MiB, K (f16): 12500.00 MiB, V (f16): 12500.00 MiB
jade1 | llama_new_context_with_model: CUDA_Host output buffer size = 2.32 MiB
jade1 | llama_new_context_with_model: CUDA0 compute buffer size = 8047.82 MiB
jade1 | llama_new_context_with_model: CUDA_Host compute buffer size = 205.32 MiB
jade1 | llama_new_context_with_model: graph nodes = 2246
jade1 | llama_new_context_with_model: graph splits = 2
jade1 | common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
jade1 | request: GET /health 172.18.0.3 503
jade1 | srv init: initializing slots, n_slots = 4
jade1 | slot init: id 0 | task -1 | new slot n_ctx_slot = 25000
jade1 | slot init: id 1 | task -1 | new slot n_ctx_slot = 25000
jade1 | slot init: id 2 | task -1 | new slot n_ctx_slot = 25000
jade1 | slot init: id 3 | task -1 | new slot n_ctx_slot = 25000
jade1 | main: model loaded
jade1 | main: chat template, built_in: 1, chat_example: '<|im_start|>system
jade1 | You are a helpful assistant<|im_end|>
jade1 | <|im_start|>user
jade1 | Hello<|im_end|>
jade1 | <|im_start|>assistant
jade1 | Hi there<|im_end|>
jade1 | <|im_start|>user
jade1 | How are you?<|im_end|>
jade1 | <|im_start|>assistant
jade1 | '
jade1 | main: server is listening on http://0.0.0.0:15029 - starting the main loop
jade1 | srv update_slots: all slots are idle
ai-worker-4 | 2024-12-09T12:42:45.057140Z INFO main{worker_id="2FOIM8HI"}:worker_loop{state="ready" job_id="58f316b9-83a4-4a74-8f3a-2df8bd943960"}: Starting completion opts=LlamaCompletionTask { target: Title, system: Some("Write a short subject that summarizes what the user says or asks for. Write only the subject and nothing else. Be concise."), turns: None, llama: LlamaOptions { temperature: Some(0.2), dynatemp_range: None, dynatemp_exponent: None, top_k: None, top_p: None, min_p: None, n_predict: Some(1024), n_keep: None, stop: ["<|", "\n\n"], tfs_z: None, typical_p: None, repeat_penalty: None, repeat_last_n: None, penalize_nl: None, presence_penalty: None, frequency_penalty: None, penalty_prompt: None, mirostat: None, mirostat_tau: None, mirostat_eta: None, grammar: None, json_schema: None, seed: None, ignore_eos: None, logit_bias: [], n_probs: None, min_keep: None, image_data: [], id_slot: None, system_prompt: None, samplers: [] } }
jade1 | slot launch_slot_: id 0 | task 0 | processing task
jade1 | slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 25000, n_keep = 0, n_prompt_tokens = 60
jade1 | slot update_slots: id 0 | task 0 | kv cache rm [0, end)
jade1 | slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 60, n_tokens = 60, progress = 1.000000
jade1 | slot update_slots: id 0 | task 0 | prompt done, n_past = 60, n_tokens = 60
jade1 | request: GET /health 172.18.0.3 200
jade1 | /app/ggml/src/ggml-cuda/ggml-cuda.cu:70: CUDA error
jade1 | CUDA error: an illegal memory access was encountered
jade1 | current device: 0, in function ggml_backend_cuda_synchronize at /app/ggml/src/ggml-cuda/ggml-cuda.cu:2273
jade1 | cudaStreamSynchronize(cuda_ctx->stream())Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghigh severityUsed to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)