Skip to content

Eval bug: llama-server: illegal memory access was encountered #10739

@eamonnmag

Description

@eamonnmag

Name and Version

Using ghcr.io/ggerganov/llama.cpp@sha256:cb0f16e6eae440da844b3a80b8c15e82ac4b2b8f6637f674b10b263452e649aa

Operating systems

Linux

GGML backends

CUDA

Hardware

Nvidia H100
Cuda 12.2

Models

Qwen2.5-32B-Instruct-GGUF

URL https://huggingface.co/Qwen/Qwen2.5-32B-Instruct-GGUF/resolve/main/qwen2.5-32b-instruct-q6_k-00001-of-00007.gguf
...

Problem description & steps to reproduce

When I run llamacpp server (using the docker image server-cuda) I get this error after the first token is emitted

/app/ggml/src/ggml-cuda/ggml-cuda.cu:70: CUDA error
jade1        | CUDA error: an illegal memory access was encountered
jade1        |   current device: 0, in function ggml_backend_cuda_synchronize at /app/ggml/src/ggml-cuda/ggml-cuda.cu:2273
jade1        |   cudaStreamSynchronize(cuda_ctx->stream())

First Bad Commit

Not sure.

Relevant log output

jade1        | .................................................................................................
jade1        | llama_new_context_with_model: n_seq_max     = 4
jade1        | llama_new_context_with_model: n_ctx         = 100000
jade1        | llama_new_context_with_model: n_ctx_per_seq = 25000
jade1        | llama_new_context_with_model: n_batch       = 2048
jade1        | llama_new_context_with_model: n_ubatch      = 512
jade1        | llama_new_context_with_model: flash_attn    = 0
jade1        | llama_new_context_with_model: freq_base     = 1000000.0
jade1        | llama_new_context_with_model: freq_scale    = 1
jade1        | llama_new_context_with_model: n_ctx_per_seq (25000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
jade1        | llama_kv_cache_init:      CUDA0 KV buffer size = 25000.00 MiB
jade1        | llama_new_context_with_model: KV self size  = 25000.00 MiB, K (f16): 12500.00 MiB, V (f16): 12500.00 MiB
jade1        | llama_new_context_with_model:  CUDA_Host  output buffer size =     2.32 MiB
jade1        | llama_new_context_with_model:      CUDA0 compute buffer size =  8047.82 MiB
jade1        | llama_new_context_with_model:  CUDA_Host compute buffer size =   205.32 MiB
jade1        | llama_new_context_with_model: graph nodes  = 2246
jade1        | llama_new_context_with_model: graph splits = 2
jade1        | common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
jade1        | request: GET /health 172.18.0.3 503
jade1        | srv          init: initializing slots, n_slots = 4
jade1        | slot         init: id  0 | task -1 | new slot n_ctx_slot = 25000
jade1        | slot         init: id  1 | task -1 | new slot n_ctx_slot = 25000
jade1        | slot         init: id  2 | task -1 | new slot n_ctx_slot = 25000
jade1        | slot         init: id  3 | task -1 | new slot n_ctx_slot = 25000
jade1        | main: model loaded
jade1        | main: chat template, built_in: 1, chat_example: '<|im_start|>system
jade1        | You are a helpful assistant<|im_end|>
jade1        | <|im_start|>user
jade1        | Hello<|im_end|>
jade1        | <|im_start|>assistant
jade1        | Hi there<|im_end|>
jade1        | <|im_start|>user
jade1        | How are you?<|im_end|>
jade1        | <|im_start|>assistant
jade1        | '
jade1        | main: server is listening on http://0.0.0.0:15029 - starting the main loop
jade1        | srv  update_slots: all slots are idle
ai-worker-4  | 2024-12-09T12:42:45.057140Z  INFO main{worker_id="2FOIM8HI"}:worker_loop{state="ready" job_id="58f316b9-83a4-4a74-8f3a-2df8bd943960"}: Starting completion opts=LlamaCompletionTask { target: Title, system: Some("Write a short subject that summarizes what the user says or asks for. Write only the subject and nothing else. Be concise."), turns: None, llama: LlamaOptions { temperature: Some(0.2), dynatemp_range: None, dynatemp_exponent: None, top_k: None, top_p: None, min_p: None, n_predict: Some(1024), n_keep: None, stop: ["<|", "\n\n"], tfs_z: None, typical_p: None, repeat_penalty: None, repeat_last_n: None, penalize_nl: None, presence_penalty: None, frequency_penalty: None, penalty_prompt: None, mirostat: None, mirostat_tau: None, mirostat_eta: None, grammar: None, json_schema: None, seed: None, ignore_eos: None, logit_bias: [], n_probs: None, min_keep: None, image_data: [], id_slot: None, system_prompt: None, samplers: [] } }
jade1        | slot launch_slot_: id  0 | task 0 | processing task
jade1        | slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 25000, n_keep = 0, n_prompt_tokens = 60
jade1        | slot update_slots: id  0 | task 0 | kv cache rm [0, end)
jade1        | slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 60, n_tokens = 60, progress = 1.000000
jade1        | slot update_slots: id  0 | task 0 | prompt done, n_past = 60, n_tokens = 60
jade1        | request: GET /health 172.18.0.3 200
jade1        | /app/ggml/src/ggml-cuda/ggml-cuda.cu:70: CUDA error
jade1        | CUDA error: an illegal memory access was encountered
jade1        |   current device: 0, in function ggml_backend_cuda_synchronize at /app/ggml/src/ggml-cuda/ggml-cuda.cu:2273
jade1        |   cudaStreamSynchronize(cuda_ctx->stream())

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghigh severityUsed to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions