Eval bug: llama-server: illegal memory access was encountered

### Name and Version

Using ghcr.io/ggerganov/llama.cpp@sha256:cb0f16e6eae440da844b3a80b8c15e82ac4b2b8f6637f674b10b263452e649aa



### Operating systems

Linux

### GGML backends

CUDA

### Hardware

Nvidia H100
Cuda 12.2

### Models

Qwen2.5-32B-Instruct-GGUF

URL https://huggingface.co/Qwen/Qwen2.5-32B-Instruct-GGUF/resolve/main/qwen2.5-32b-instruct-q6_k-00001-of-00007.gguf
...



### Problem description & steps to reproduce

When I run llamacpp server (using the docker image server-cuda) I get this error after the first token is emitted

```
/app/ggml/src/ggml-cuda/ggml-cuda.cu:70: CUDA error
jade1        | CUDA error: an illegal memory access was encountered
jade1        |   current device: 0, in function ggml_backend_cuda_synchronize at /app/ggml/src/ggml-cuda/ggml-cuda.cu:2273
jade1        |   cudaStreamSynchronize(cuda_ctx->stream())
```


### First Bad Commit

Not sure.

### Relevant log output

```shell
jade1        | .................................................................................................
jade1        | llama_new_context_with_model: n_seq_max     = 4
jade1        | llama_new_context_with_model: n_ctx         = 100000
jade1        | llama_new_context_with_model: n_ctx_per_seq = 25000
jade1        | llama_new_context_with_model: n_batch       = 2048
jade1        | llama_new_context_with_model: n_ubatch      = 512
jade1        | llama_new_context_with_model: flash_attn    = 0
jade1        | llama_new_context_with_model: freq_base     = 1000000.0
jade1        | llama_new_context_with_model: freq_scale    = 1
jade1        | llama_new_context_with_model: n_ctx_per_seq (25000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
jade1        | llama_kv_cache_init:      CUDA0 KV buffer size = 25000.00 MiB
jade1        | llama_new_context_with_model: KV self size  = 25000.00 MiB, K (f16): 12500.00 MiB, V (f16): 12500.00 MiB
jade1        | llama_new_context_with_model:  CUDA_Host  output buffer size =     2.32 MiB
jade1        | llama_new_context_with_model:      CUDA0 compute buffer size =  8047.82 MiB
jade1        | llama_new_context_with_model:  CUDA_Host compute buffer size =   205.32 MiB
jade1        | llama_new_context_with_model: graph nodes  = 2246
jade1        | llama_new_context_with_model: graph splits = 2
jade1        | common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
jade1        | request: GET /health 172.18.0.3 503
jade1        | srv          init: initializing slots, n_slots = 4
jade1        | slot         init: id  0 | task -1 | new slot n_ctx_slot = 25000
jade1        | slot         init: id  1 | task -1 | new slot n_ctx_slot = 25000
jade1        | slot         init: id  2 | task -1 | new slot n_ctx_slot = 25000
jade1        | slot         init: id  3 | task -1 | new slot n_ctx_slot = 25000
jade1        | main: model loaded
jade1        | main: chat template, built_in: 1, chat_example: '<|im_start|>system
jade1        | You are a helpful assistant<|im_end|>
jade1        | <|im_start|>user
jade1        | Hello<|im_end|>
jade1        | <|im_start|>assistant
jade1        | Hi there<|im_end|>
jade1        | <|im_start|>user
jade1        | How are you?<|im_end|>
jade1        | <|im_start|>assistant
jade1        | '
jade1        | main: server is listening on http://0.0.0.0:15029 - starting the main loop
jade1        | srv  update_slots: all slots are idle
ai-worker-4  | 2024-12-09T12:42:45.057140Z  INFO main{worker_id="2FOIM8HI"}:worker_loop{state="ready" job_id="58f316b9-83a4-4a74-8f3a-2df8bd943960"}: Starting completion opts=LlamaCompletionTask { target: Title, system: Some("Write a short subject that summarizes what the user says or asks for. Write only the subject and nothing else. Be concise."), turns: None, llama: LlamaOptions { temperature: Some(0.2), dynatemp_range: None, dynatemp_exponent: None, top_k: None, top_p: None, min_p: None, n_predict: Some(1024), n_keep: None, stop: ["<|", "\n\n"], tfs_z: None, typical_p: None, repeat_penalty: None, repeat_last_n: None, penalize_nl: None, presence_penalty: None, frequency_penalty: None, penalty_prompt: None, mirostat: None, mirostat_tau: None, mirostat_eta: None, grammar: None, json_schema: None, seed: None, ignore_eos: None, logit_bias: [], n_probs: None, min_keep: None, image_data: [], id_slot: None, system_prompt: None, samplers: [] } }
jade1        | slot launch_slot_: id  0 | task 0 | processing task
jade1        | slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 25000, n_keep = 0, n_prompt_tokens = 60
jade1        | slot update_slots: id  0 | task 0 | kv cache rm [0, end)
jade1        | slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 60, n_tokens = 60, progress = 1.000000
jade1        | slot update_slots: id  0 | task 0 | prompt done, n_past = 60, n_tokens = 60
jade1        | request: GET /health 172.18.0.3 200
jade1        | /app/ggml/src/ggml-cuda/ggml-cuda.cu:70: CUDA error
jade1        | CUDA error: an illegal memory access was encountered
jade1        |   current device: 0, in function ggml_backend_cuda_synchronize at /app/ggml/src/ggml-cuda/ggml-cuda.cu:2273
jade1        |   cudaStreamSynchronize(cuda_ctx->stream())
```
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: llama-server: illegal memory access was encountered #10739

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: llama-server: illegal memory access was encountered #10739

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions