Skip to content

Error 503 when loading model using ghcr.io/ggml-org/llama.cpp:server-cuda #146

@PrideIsLife

Description

@PrideIsLife

Describe the bug
When attempting to load a model using the llama.cpp Docker image ghcr.io/ggml-org/llama.cpp:server-cuda instead of ghcr.io/ggerganov/llama.cpp:server-cuda with llama-swap, the following response is returned :

curl http://localhost:9292/v1/chat/completions -d '{
  "model": "gpt-4.1-nano",
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "python",
        "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
        "parameters": {
          "type": "object",
          "properties": {
            "code": {
              "type": "string",
              "description": "The code to run in the ipython interpreter."
            }
          },
          "required": ["code"]
        }
      }
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": "Print a hello world message with python."
    }
  ],
  "stream": true
}'
{"error":{"code":503,"message":"Loading model","type":"unavailable_error"}}

Expected behaviour
I expected to have the same behavior than when using the ghcr.io/ggerganov/llama.cpp:server-cuda docker image.

curl http://localhost:9292/v1/chat/completions -d '{
  "model": "gpt-4.1-nano",
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "python",
        "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
        "parameters": {
          "type": "object",
          "properties": {
            "code": {
              "type": "string",
              "description": "The code to run in the ipython interpreter."
            }
          },
          "required": ["code"]
        }
      }
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": "Print a hello world message with python."
    }
  ],
  "stream": true
}'

The following error is thrown but after the model has been loaded in memory which is normal because this version does not support response streaming.

{"error":{"code":500,"message":"Cannot use tools with stream","type":"server_error"}}

Operating system and version

  • OS: Linux Ubuntu 24.04
  • GPUs: 3090

```yaml
"gpt-4.1-nano":
    proxy: http://172.17.0.1:9601 # (default Docker bridge)
    cmd: >
      docker run --init --rm
      -v /opt/llama-swap/models:/models
      --gpus '"device=0,1"'
      -p 9601:9501  
      --name "Qwen3-30B-A3B-Q8_0-container"
      ghcr.io/ggml-org/llama.cpp:server-cuda
      --host 0.0.0.0
      --port 9501
      --metrics
      --model /models/Qwen3-30B-A3B-Q8_0.gguf
      --gpu-layers 200
      --slots
      -np 8
      --ctx-size 45000
      --temp 0.6
      --min-p 0.01
      --keep -1
      --threads 24
      --jinja
      --verbose
      -fa
    concurrencyLimit: 900000
    cmd_stop: "docker stop -t 2 Qwen3-30B-A3B-Q8_0-container"
    checkEndpoint: /v1/models
    ttl: 300

Proxy Logs

llama-swap listening on :8080
[DEBUG] Exclusive mode for group graphiti, stopping other process groups
[DEBUG] <gpt-4.1-nano> swapState() State transitioned from stopped to starting
[DEBUG] <gpt-4.1-nano> Connection refused on http://172.17.0.1:9601/v1/models, giving up in 300s (normal during startup)
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by command line argument --host
build: 5478 (f5cd27b7) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 24, n_threads_batch = 24, total_threads = 48

system_info: n_threads = 24 (n_threads_batch = 24) / 48 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 9501, http threads: 47
main: loading model
srv    load_model: loading model '/models/Qwen3-30B-A3B-Q8_0.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from /models/Qwen3-30B-A3B-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv   7:                    qwen3moe.context_length u32              = 32768
llama_model_loader: - kv   8:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv   9:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  10:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  11:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  15:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  16:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  17:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  18:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                          general.file_type u32              = 7
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q8_0:  338 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 30.25 GiB (8.51 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 2048
print_info: n_layer          = 48
print_info: n_head           = 32
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 6144
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 30B.A3B
print_info: model params     = 30.53 B
print_info: general.name     = Qwen3 30B A3B
print_info: n_ff_exp         = 768
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CUDA0, is_swa = 0
load_tensors: layer   1 assigned to device CUDA0, is_swa = 0
load_tensors: layer   2 assigned to device CUDA0, is_swa = 0
load_tensors: layer   3 assigned to device CUDA0, is_swa = 0
load_tensors: layer   4 assigned to device CUDA0, is_swa = 0
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 0
load_tensors: layer   7 assigned to device CUDA0, is_swa = 0
load_tensors: layer   8 assigned to device CUDA0, is_swa = 0
load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
load_tensors: layer  10 assigned to device CUDA0, is_swa = 0
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 0
load_tensors: layer  13 assigned to device CUDA0, is_swa = 0
load_tensors: layer  14 assigned to device CUDA0, is_swa = 0
load_tensors: layer  15 assigned to device CUDA0, is_swa = 0
load_tensors: layer  16 assigned to device CUDA0, is_swa = 0
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 0
load_tensors: layer  19 assigned to device CUDA0, is_swa = 0
load_tensors: layer  20 assigned to device CUDA0, is_swa = 0
load_tensors: layer  21 assigned to device CUDA0, is_swa = 0
load_tensors: layer  22 assigned to device CUDA0, is_swa = 0
load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
load_tensors: layer  24 assigned to device CUDA0, is_swa = 0
load_tensors: layer  25 assigned to device CUDA1, is_swa = 0
load_tensors: layer  26 assigned to device CUDA1, is_swa = 0
load_tensors: layer  27 assigned to device CUDA1, is_swa = 0
load_tensors: layer  28 assigned to device CUDA1, is_swa = 0
load_tensors: layer  29 assigned to device CUDA1, is_swa = 0
load_tensors: layer  30 assigned to device CUDA1, is_swa = 0
load_tensors: layer  31 assigned to device CUDA1, is_swa = 0
load_tensors: layer  32 assigned to device CUDA1, is_swa = 0
load_tensors: layer  33 assigned to device CUDA1, is_swa = 0
load_tensors: layer  34 assigned to device CUDA1, is_swa = 0
load_tensors: layer  35 assigned to device CUDA1, is_swa = 0
load_tensors: layer  36 assigned to device CUDA1, is_swa = 0
load_tensors: layer  37 assigned to device CUDA1, is_swa = 0
load_tensors: layer  38 assigned to device CUDA1, is_swa = 0
load_tensors: layer  39 assigned to device CUDA1, is_swa = 0
load_tensors: layer  40 assigned to device CUDA1, is_swa = 0
load_tensors: layer  41 assigned to device CUDA1, is_swa = 0
load_tensors: layer  42 assigned to device CUDA1, is_swa = 0
load_tensors: layer  43 assigned to device CUDA1, is_swa = 0
load_tensors: layer  44 assigned to device CUDA1, is_swa = 0
load_tensors: layer  45 assigned to device CUDA1, is_swa = 0
load_tensors: layer  46 assigned to device CUDA1, is_swa = 0
load_tensors: layer  47 assigned to device CUDA1, is_swa = 0
load_tensors: layer  48 assigned to device CUDA1, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:        CUDA0 model buffer size = 15803.55 MiB
load_tensors:        CUDA1 model buffer size = 14854.57 MiB
load_tensors:   CPU_Mapped model buffer size =   315.30 MiB
.......................[INFO] <gpt-4.1-nano> Health check passed on http://172.17.0.1:9601/v1/models
[DEBUG] <gpt-4.1-nano> swapState() State transitioned from starting to ready
srv  log_server_r: request: GET /v1/models 172.17.0.1 200
srv  log_server_r: request:
srv  log_server_r: response: {"models":[{"name":"/models/Qwen3-30B-A3B-Q8_0.gguf","model":"/models/Qwen3-30B-A3B-Q8_0.gguf","modified_at":"","size":"","digest":"","type":"model","description":"","tags":[""],"capabilities":["completion"],"parameters":"","details":{"parent_model":"","format":"gguf","family":"","families":[""],"parameter_size":"","quantization_level":""}}],"object":"list","data":[{"id":"/models/Qwen3-30B-A3B-Q8_0.gguf","object":"model","created":1748187361,"owned_by":"llamacpp","meta":null}]}
srv  log_server_r: request: POST /v1/chat/completions 172.17.0.1 503
srv  log_server_r: request:
srv  log_server_r: response: {"error":{"code":503,"message":"Loading model","type":"unavailable_error"}}
[DEBUG] <gpt-4.1-nano> request /v1/chat/completions - start: 5.254679349s, total: 5.256184319s
[INFO] Request 172.26.0.1 "POST /v1/chat/completions HTTP/1.1" 503 75 "curl/8.5.0" 5.256381983s
............................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 8
llama_context: n_ctx         = 45000
llama_context: n_ctx_per_seq = 5625
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (5625) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     4.64 MiB
create_memory: n_ctx = 45056 (padded)
llama_kv_cache_unified: layer   0: dev = CUDA0
llama_kv_cache_unified: layer   1: dev = CUDA0
llama_kv_cache_unified: layer   2: dev = CUDA0
llama_kv_cache_unified: layer   3: dev = CUDA0
llama_kv_cache_unified: layer   4: dev = CUDA0
llama_kv_cache_unified: layer   5: dev = CUDA0
llama_kv_cache_unified: layer   6: dev = CUDA0
llama_kv_cache_unified: layer   7: dev = CUDA0
... 
Model proceeds to loading normally after

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions