Misc. bug: problem loading gpt-oss-20b-GGUF in vulcan unless specificly using -np

### Name and Version

./llama-server --version
load_backend: loaded RPC backend from E:\devel\lamacpp_release\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from E:\devel\lamacpp_release\ggml-vulkan.dll
load_backend: loaded CPU backend from E:\devel\lamacpp_release\ggml-cpu-haswell.dll
version: 6106 (5fd160bb)
built with clang version 19.1.5 for x86_64-pc-windows-msvc

### Operating systems

Windows

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
-ngl 15 --n-cpu-moe 10 -c 16384 -t -1 , currently fails to load the model.

-ngl 15 --n-cpu-moe 10 -c 16384 -t -1 -np 2, will succsessfully load it.
```

### Problem description & steps to reproduce

for me to reproduce is , this happens in both windows and linux. In linux I experenced this with both drivers (radv, amd_vlk) 
-ngl 15 --n-cpu-moe 10 -c 16384 -t -1 , currently fails to load the model.

-ngl 15 --n-cpu-moe 10 -c 16384 -t -1 -np 2, will succsessfully load it.

### First Bad Commit

I think this has been around since the model was supported

### Relevant log output

```shell
loading model with: 
-ngl 15 --n-cpu-moe 10 -c 16384 -t -1

load_backend: loaded RPC backend from E:\devel\lamacpp_release\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from E:\devel\lamacpp_release\ggml-vulkan.dll
load_backend: loaded CPU backend from E:\devel\lamacpp_release\ggml-cpu-haswell.dll
build: 6106 (5fd160bb) with clang version 19.1.5 for x86_64-pc-windows-msvc
system info: n_threads = 16, n_threads_batch = 16, total_threads = 16

system_info: n_threads = 16 (n_threads_batch = 16) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 15
main: loading model
srv    load_model: loading model 'E:\devel\OLLAMA_MODELS\llama.cpp\ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf'
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 5700 XT) - 8176 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 459 tensors from E:\devel\OLLAMA_MODELS\llama.cpp\ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gpt-oss
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Prerelease 100 20b Hf
llama_model_loader: - kv   3:                           general.finetune str              = hf
llama_model_loader: - kv   4:                           general.basename str              = prerelease-100
llama_model_loader: - kv   5:                         general.size_label str              = 20B
llama_model_loader: - kv   6:                        gpt-oss.block_count u32              = 24
llama_model_loader: - kv   7:                     gpt-oss.context_length u32              = 131072
llama_model_loader: - kv   8:                   gpt-oss.embedding_length u32              = 2880
llama_model_loader: - kv   9:                gpt-oss.feed_forward_length u32              = 2880
llama_model_loader: - kv  10:               gpt-oss.attention.head_count u32              = 64
llama_model_loader: - kv  11:            gpt-oss.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                     gpt-oss.rope.freq_base f32              = 150000.000000
llama_model_loader: - kv  13:   gpt-oss.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  14:                       gpt-oss.expert_count u32              = 32
llama_model_loader: - kv  15:                  gpt-oss.expert_used_count u32              = 4
llama_model_loader: - kv  16:               gpt-oss.attention.key_length u32              = 64
llama_model_loader: - kv  17:             gpt-oss.attention.value_length u32              = 64
llama_model_loader: - kv  18:           gpt-oss.attention.sliding_window u32              = 128
llama_model_loader: - kv  19:         gpt-oss.expert_feed_forward_length u32              = 2880
llama_model_loader: - kv  20:                  gpt-oss.rope.scaling.type str              = yarn
llama_model_loader: - kv  21:                gpt-oss.rope.scaling.factor f32              = 32.000000
llama_model_loader: - kv  22: gpt-oss.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = gpt-4o
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,201088]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,201088]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,446189]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 199998
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 199999
llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 199999
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {#-\n  In addition to the normal input...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 38
llama_model_loader: - type  f32:  289 tensors
llama_model_loader: - type q8_0:   98 tensors
llama_model_loader: - type mxfp4:   72 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = MXFP4 MoE
print_info: file size   = 11.27 GiB (4.63 BPW) 
load: printing all EOG tokens:
load:   - 199999 ('<|endoftext|>')
load:   - 200002 ('<|return|>')
load:   - 200007 ('<|end|>')
load:   - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
print_info: arch             = gpt-oss
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2880
print_info: n_layer          = 24
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 128
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 2880
print_info: n_expert         = 32
print_info: n_expert_used    = 4
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = yarn
print_info: freq_base_train  = 150000.0
print_info: freq_scale_train = 0.03125
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: model type       = ?B
print_info: model params     = 20.91 B
print_info: general.name     = Prerelease 100 20b Hf
print_info: n_ff_exp         = 2880
print_info: vocab type       = BPE
print_info: n_vocab          = 201088
print_info: n_merges         = 446189
print_info: BOS token        = 199998 '<|startoftext|>'
print_info: EOS token        = 199999 '<|endoftext|>'
print_info: EOT token        = 199999 '<|endoftext|>'
print_info: PAD token        = 199999 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 199999 '<|endoftext|>'
print_info: EOG token        = 200002 '<|return|>'
print_info: EOG token        = 200012 '<|call|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
load_tensors: offloading 15 repeating layers to GPU
load_tensors: offloaded 15/25 layers to GPU
load_tensors:      Vulkan0 model buffer size =  6072.10 MiB
load_tensors:   CPU_Mapped model buffer size =  5491.36 MiB
........srv  log_server_r: request: GET /health 127.0.0.1 503
...............srv  log_server_r: request: GET /health 127.0.0.1 503
..............srv  log_server_r: request: GET /health 127.0.0.1 503
.............................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: kv_unified    = false
llama_context: freq_base     = 150000.0
llama_context: freq_scale    = 0.03125
llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.77 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 16384 cells
llama_kv_cache_unified:    Vulkan0 KV buffer size =   256.00 MiB
llama_kv_cache_unified:        CPU KV buffer size =   128.00 MiB
llama_kv_cache_unified: size =  384.00 MiB ( 16384 cells,  12 layers,  1/1 seqs), K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 640 cells
llama_kv_cache_unified:    Vulkan0 KV buffer size =     8.75 MiB
llama_kv_cache_unified:        CPU KV buffer size =     6.25 MiB
llama_kv_cache_unified: size =   15.00 MiB (   640 cells,  12 layers,  1/1 seqs), K (f16):    7.50 MiB, V (f16):    7.50 MiB
ggml_vulkan: Device memory allocation of size 2223917056 failed.
ggml_vulkan: Requested buffer size exceeds device memory allocation limit: ErrorOutOfDeviceMemory
ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 2223917056
graph_reserve: failed to allocate compute buffers
llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers
common_init_from_params: failed to create context with model 'E:\devel\OLLAMA_MODELS\llama.cpp\ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf'
srv  log_server_r: request: GET /health 127.0.0.1 503
srv    load_model: failed to load model, 'E:\devel\OLLAMA_MODELS\llama.cpp\ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

but loading the model with:
-ngl 15 --n-cpu-moe 10 -c 16384 -t -1 -np 2

load_backend: loaded RPC backend from E:\devel\lamacpp_release\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from E:\devel\lamacpp_release\ggml-vulkan.dll
load_backend: loaded CPU backend from E:\devel\lamacpp_release\ggml-cpu-haswell.dll
build: 6106 (5fd160bb) with clang version 19.1.5 for x86_64-pc-windows-msvc
system info: n_threads = 16, n_threads_batch = 16, total_threads = 16

system_info: n_threads = 16 (n_threads_batch = 16) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 15
main: loading model
srv    load_model: loading model 'E:\devel\OLLAMA_MODELS\llama.cpp\ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf'
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 5700 XT) - 8176 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 459 tensors from E:\devel\OLLAMA_MODELS\llama.cpp\ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gpt-oss
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Prerelease 100 20b Hf
llama_model_loader: - kv   3:                           general.finetune str              = hf
llama_model_loader: - kv   4:                           general.basename str              = prerelease-100
llama_model_loader: - kv   5:                         general.size_label str              = 20B
llama_model_loader: - kv   6:                        gpt-oss.block_count u32              = 24
llama_model_loader: - kv   7:                     gpt-oss.context_length u32              = 131072
llama_model_loader: - kv   8:                   gpt-oss.embedding_length u32              = 2880
llama_model_loader: - kv   9:                gpt-oss.feed_forward_length u32              = 2880
llama_model_loader: - kv  10:               gpt-oss.attention.head_count u32              = 64
llama_model_loader: - kv  11:            gpt-oss.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                     gpt-oss.rope.freq_base f32              = 150000.000000
llama_model_loader: - kv  13:   gpt-oss.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  14:                       gpt-oss.expert_count u32              = 32
llama_model_loader: - kv  15:                  gpt-oss.expert_used_count u32              = 4
llama_model_loader: - kv  16:               gpt-oss.attention.key_length u32              = 64
llama_model_loader: - kv  17:             gpt-oss.attention.value_length u32              = 64
llama_model_loader: - kv  18:           gpt-oss.attention.sliding_window u32              = 128
llama_model_loader: - kv  19:         gpt-oss.expert_feed_forward_length u32              = 2880
llama_model_loader: - kv  20:                  gpt-oss.rope.scaling.type str              = yarn
llama_model_loader: - kv  21:                gpt-oss.rope.scaling.factor f32              = 32.000000
llama_model_loader: - kv  22: gpt-oss.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = gpt-4o
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,201088]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,201088]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,446189]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 199998
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 199999
llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 199999
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {#-\n  In addition to the normal input...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 38
llama_model_loader: - type  f32:  289 tensors
llama_model_loader: - type q8_0:   98 tensors
llama_model_loader: - type mxfp4:   72 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = MXFP4 MoE
print_info: file size   = 11.27 GiB (4.63 BPW) 
srv  log_server_r: request: GET /health 127.0.0.1 503
load: printing all EOG tokens:
load:   - 199999 ('<|endoftext|>')
load:   - 200002 ('<|return|>')
load:   - 200007 ('<|end|>')
load:   - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
print_info: arch             = gpt-oss
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2880
print_info: n_layer          = 24
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 128
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 2880
print_info: n_expert         = 32
print_info: n_expert_used    = 4
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = yarn
print_info: freq_base_train  = 150000.0
print_info: freq_scale_train = 0.03125
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: model type       = ?B
print_info: model params     = 20.91 B
print_info: general.name     = Prerelease 100 20b Hf
print_info: n_ff_exp         = 2880
print_info: vocab type       = BPE
print_info: n_vocab          = 201088
print_info: n_merges         = 446189
print_info: BOS token        = 199998 '<|startoftext|>'
print_info: EOS token        = 199999 '<|endoftext|>'
print_info: EOT token        = 199999 '<|endoftext|>'
print_info: PAD token        = 199999 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 199999 '<|endoftext|>'
print_info: EOG token        = 200002 '<|return|>'
print_info: EOG token        = 200012 '<|call|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
load_tensors: offloading 15 repeating layers to GPU
load_tensors: offloaded 15/25 layers to GPU
load_tensors:      Vulkan0 model buffer size =  6072.10 MiB
load_tensors:   CPU_Mapped model buffer size =  5491.36 MiB
........srv  log_server_r: request: GET /health 127.0.0.1 503
.....srv  log_server_r: request: GET /health 127.0.0.1 503
..........srv  log_server_r: request: GET /health 127.0.0.1 503
..srv  log_server_r: request: GET /health 127.0.0.1 503
............srv  log_server_r: request: GET /health 127.0.0.1 503
...srv  log_server_r: request: GET /health 127.0.0.1 503
..........................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 2
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: kv_unified    = false
llama_context: freq_base     = 150000.0
llama_context: freq_scale    = 0.03125
llama_context: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: requested n_seq_max (2) > 1, but swa_full is not enabled -- performance may be degraded: https://github.com/ggml-org/llama.cpp/pull/13845#issuecomment-2924800573
llama_context:        CPU  output buffer size =     1.53 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 8192 cells
llama_kv_cache_unified:    Vulkan0 KV buffer size =   256.00 MiB
llama_kv_cache_unified:        CPU KV buffer size =   128.00 MiB
llama_kv_cache_unified: size =  384.00 MiB (  8192 cells,  12 layers,  2/2 seqs), K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 640 cells
llama_kv_cache_unified:    Vulkan0 KV buffer size =    17.50 MiB
llama_kv_cache_unified:        CPU KV buffer size =    12.50 MiB
llama_kv_cache_unified: size =   30.00 MiB (   640 cells,  12 layers,  2/2 seqs), K (f16):   15.00 MiB, V (f16):   15.00 MiB
srv  log_server_r: request: GET /health 127.0.0.1 503
llama_context:    Vulkan0 compute buffer size =  1085.02 MiB
llama_context: Vulkan_Host compute buffer size =    66.26 MiB
llama_context: graph nodes  = 1446
llama_context: graph splits = 219 (with bs=512), 5 (with bs=1)
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|return|> logit bias = -inf
common_init_from_params: added <|call|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16384
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv  log_server_r: request: GET /health 127.0.0.1 503
srv          init: initializing slots, n_slots = 2
slot         init: id  0 | task -1 | new slot n_ctx_slot = 8192
slot         init: id  1 | task -1 | new slot n_ctx_slot = 8192
main: model loaded
main: chat template, chat_template: {#-
  In addition to the normal inputs of `messages` and `tools`, this template also accepts the
  following kwargs:
  - "builtin_tools": A list, can contain "browser" and/or "python".
  - "model_identity": A string that optionally describes the model identity.
  - "reasoning_effort": A string that describes the reasoning effort, defaults to "medium".
 #}

{#- Tool Definition Rendering ============================================== #}
{%- macro render_typescript_type(param_spec, required_params, is_nullable=false) -%}
    {%- if param_spec.type == "array" -%}
        {%- if param_spec['items'] -%}
            {%- if param_spec['items']['type'] == "string" -%}
                {{- "string[]" }}
            {%- elif param_spec['items']['type'] == "number" -%}
                {{- "number[]" }}
            {%- elif param_spec['items']['type'] == "integer" -%}
                {{- "number[]" }}
            {%- elif param_spec['items']['type'] == "boolean" -%}
                {{- "boolean[]" }}
            {%- else -%}
                {%- set inner_type = render_typescript_type(param_spec['items'], required_params) -%}
                {%- if inner_type == "object | object" or inner_type|length > 50 -%}
                    {{- "any[]" }}
                {%- else -%}
                    {{- inner_type + "[]" }}
                {%- endif -%}
            {%- endif -%}
            {%- if param_spec.nullable -%}
                {{- " | null" }}
            {%- endif -%}
        {%- else -%}
            {{- "any[]" }}
            {%- if param_spec.nullable -%}
                {{- " | null" }}
            {%- endif -%}
        {%- endif -%}
    {%- elif param_spec.type is defined and param_spec.type is iterable and param_spec.type is not string and param_spec.type is not mapping and param_spec.type[0] is defined -%}
        {#- Handle array of types like ["object", "object"] from Union[dict, list] #}
        {%- if param_spec.type | length > 1 -%}
            {{- param_spec.type | join(" | ") }}
        {%- else -%}
            {{- param_spec.type[0] }}
        {%- endif -%}
    {%- elif param_spec.oneOf -%}
        {#- Handle oneOf schemas - check for complex unions and fallback to any #}
        {%- set has_object_variants = false -%}
        {%- for variant in param_spec.oneOf -%}
            {%- if variant.type == "object" -%}
                {%- set has_object_variants = true -%}
            {%- endif -%}
        {%- endfor -%}
        {%- if has_object_variants and param_spec.oneOf|length > 1 -%}
            {{- "any" }}
        {%- else -%}
            {%- for variant in param_spec.oneOf -%}
                {{- render_typescript_type(variant, required_params) -}}
                {%- if variant.description %}
                    {{- "// " + variant.description }}
                {%- endif -%}
                {%- if variant.default is defined %}
                    {{ "// default: " + variant.default|tojson }}
                {%- endif -%}
                {%- if not loop.last %}
                    {{- " | " }}
                {% endif -%}
            {%- endfor -%}
        {%- endif -%}
    {%- elif param_spec.type == "string" -%}
        {%- if param_spec.enum -%}
            {{- '"' + param_spec.enum|join('" | "') + '"' -}}
        {%- else -%}
            {{- "string" }}
            {%- if param_spec.nullable %}
                {{- " | null" }}
            {%- endif -%}
        {%- endif -%}
    {%- elif param_spec.type == "number" -%}
        {{- "number" }}
    {%- elif param_spec.type == "integer" -%}
        {{- "number" }}
    {%- elif param_spec.type == "boolean" -%}
        {{- "boolean" }}

    {%- elif param_spec.type == "object" -%}
        {%- if param_spec.properties -%}
            {{- "{\n" }}
            {%- for prop_name, prop_spec in param_spec.properties.items() -%}
                {{- prop_name -}}
                {%- if prop_name not in (param_spec.required or []) -%}
                    {{- "?" }}
                {%- endif -%}
                {{- ": " }}
                {{ render_typescript_type(prop_spec, param_spec.required or []) }}
                {%- if not loop.last -%}
                    {{-", " }}
                {%- endif -%}
            {%- endfor -%}
            {{- "}" }}
        {%- else -%}
            {{- "object" }}
        {%- endif -%}
    {%- else -%}
        {{- "any" }}
    {%- endif -%}
{%- endmacro -%}

{%- macro render_tool_namespace(namespace_name, tools) -%}
    {{- "## " + namespace_name + "\n\n" }}
    {{- "namespace " + namespace_name + " {\n\n" }}
    {%- for tool in tools %}
        {%- set tool = tool.function %}
        {{- "// " + tool.description + "\n" }}
        {{- "type "+ tool.name + " = (" }}
        {%- if tool.parameters and tool.parameters.properties -%}
            {{- "_: " }}
            {{- "{\n" }}
            {%- for param_name, param_spec in tool.parameters.properties.items() %}
                {{- "// " + param_spec.description + "\n" }}
                {{- param_name }}
                {%- if param_name not in (tool.parameters.required or []) -%}
                    {{- "?" }}
                {%- endif -%}
                {{- ": " }}
                {{- render_typescript_type(param_spec, tool.parameters.required or []) }}
                {%- if param_spec.default is defined -%}
                    {%- if param_spec.oneOf %}
                        {{- "// default: " + param_spec.default }}
                    {%- else %}
                        {{- ", // default: " + param_spec.default|tojson }}
                    {%- endif -%}
                {%- endif -%}
                {%- if not loop.last %}
                    {{- ",\n" }}
                {%- endif -%}
            {%- endfor %}
            {{- ",\n}) => any;\n" }}
        {%- else -%}
            {{- "\n}) => any;\n" }}
        {%- endif -%}
    {%- endfor %}
    {{- "\n} // namespace " + namespace_name }}
{%- endmacro -%}

{%- macro render_builtin_tools(browser_tool, python_tool) -%}
    {%- if browser_tool %}
        {{- "## browser\n\n" }}
        {{- "// Tool for browsing.\n" }}
        {{- "// The `cursor` appears in brackets before each browsing display: `[{cursor}]`.\n" }}
        {{- "// Cite information from the tool using the following format:\n" }}
        {{- "// `【{cursor}†L{line_start}(-L{line_end})?】`, for example: `【6†L9-L11】` or `【8†L3】`.\n" }}
        {{- "// Do not quote more than 10 words directly from the tool output.\n" }}
        {{- "// sources=web (default: web)\n" }}
        {{- "namespace browser {\n\n" }}
        {{- "// Searches for information related to `query` and displays `topn` results.\n" }}
        {{- "type search = (_: {\n" }}
        {{- "query: string,\n" }}
        {{- "topn?: number, // default: 10\n" }}
        {{- "source?: string,\n" }}
        {{- "}) => any;\n\n" }}
        {{- "// Opens the link `id` from the page indicated by `cursor` starting at line number `loc`, showing `num_lines` lines.\n" }}
        {{- "// Valid link ids are displayed with the formatting: `【{id}†.*】`.\n" }}
        {{- "// If `cursor` is not provided, the most recent page is implied.\n" }}
        {{- "// If `id` is a string, it is treated as a fully qualified URL associated with `source`.\n" }}
        {{- "// If `loc` is not provided, the viewport will be positioned at the beginning of the document or centered on the most relevant passage, if available.\n" }}
        {{- "// Use this function without `id` to scroll to a new location of an opened page.\n" }}
        {{- "type open = (_: {\n" }}
        {{- "id?: number | string, // default: -1\n" }}
        {{- "cursor?: number, // default: -1\n" }}
        {{- "loc?: number, // default: -1\n" }}
        {{- "num_lines?: number, // default: -1\n" }}
        {{- "view_source?: boolean, // default: false\n" }}
        {{- "source?: string,\n" }}
        {{- "}) => any;\n\n" }}
        {{- "// Finds exact matches of `pattern` in the current page, or the page given by `cursor`.\n" }}
        {{- "type find = (_: {\n" }}
        {{- "pattern: string,\n" }}
        {{- "cursor?: number, // default: -1\n" }}
        {{- "}) => any;\n\n" }}
        {{- "} // namespace browser\n\n" }}
    {%- endif -%}

    {%- if python_tool %}
        {{- "## python\n\n" }}
        {{- "Use this tool to execute Python code in your chain of thought. The code will not be shown to the user. This tool should be used for internal reasoning, but not for code that is intended to be visible to the user (e.g. when creating plots, tables, or files).\n\n" }}
        {{- "When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 120.0 seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is UNKNOWN. Depends on the cluster.\n\n" }}
    {%- endif -%}
{%- endmacro -%}

{#- System Message Construction ============================================ #}
{%- macro build_system_message() -%}
    {%- if model_identity is not defined %}
        {{- "You are ChatGPT, a large language model trained by OpenAI.\n" -}}
    {%- else %}
        {{- model_identity }}
    {%- endif %}
    {{- "Knowledge cutoff: 2024-06\n" }}
    {{- "Current date: " + strftime_now("%Y-%m-%d") + "\n\n" }}
    {%- if reasoning_effort is not defined %}
        {%- set reasoning_effort = "medium" %}
    {%- endif %}
    {{- "reasoning: " + reasoning_effort + "\n\n" }}
    {%- if builtin_tools %}
        {{- "# Tools\n\n" }}
        {%- set available_builtin_tools = namespace(browser=false, python=false) %}
        {%- for tool in builtin_tools %}
            {%- if tool == "browser" %}
                {%- set available_builtin_tools.browser = true %}
            {%- elif tool == "python" %}
                {%- set available_builtin_tools.python = true %}
            {%- endif %}
        {%- endfor %}
        {{- render_builtin_tools(available_builtin_tools.browser, available_builtin_tools.python) }}
    {%- endif -%}
    {{- "# Valid channels: analysis, commentary, final. Channel must be included for every message.\n" }}
    {{- "Calls to these tools must go to the commentary channel: 'functions'." }}
{%- endmacro -%}

{#- Main Template Logic ================================================= #}
{#- Set defaults #}

{#- Render system message #}
{{- "<|start|>system<|message|>" }}
{{- build_system_message() }}
{{- "<|end|>" }}

{#- Extract developer message #}
{%- if messages[0].role == "developer" or messages[0].role == "system" %}
    {%- set developer_message = messages[0].content %}
    {%- set loop_messages = messages[1:] %}
{%- else %}
    {%- set developer_message = "" %}
    {%- set loop_messages = messages %}
{%- endif %}

{#- Render developer message #}
{%- if developer_message or tools %}
    {{- "<|start|>developer<|message|>" }}
    {%- if developer_message %}
        {{- "# Instructions\n\n" }}
        {{- developer_message }}
    {%- endif %}
    {%- if tools -%}
        {{- "\n\n" }}
        {{- "# Tools\n\n" }}
        {{- render_tool_namespace("functions", tools) }}
    {%- endif -%}
    {{- "<|end|>" }}
{%- endif %}

{#- Render messages #}
{%- set last_tool_call = namespace(name=none) %}
{%- for message in loop_messages -%}
    {#- At this point only assistant/user/tool messages should remain #}
    {%- if message.role == 'assistant' -%}
        {%- if "tool_calls" in message %}
            {#- We assume max 1 tool call per message, and so we infer the tool call name #}
            {#- in "tool" messages from the most recent assistant tool call name #}
            {%- set tool_call = message.tool_calls[0] %}
            {%- if tool_call.function %}
                {%- set tool_call = tool_call.function %}
            {%- endif %}
            {%- if message.content %}
                {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.content + "<|end|>" }}
            {%- endif %}
            {{- "<|start|>assistant to=" }}
            {{- "functions." + tool_call.name + "<|channel|>commentary json<|message|>" }}
            {{- tool_call.arguments|tojson }}
            {{- "<|end|>" }}
            {%- set last_tool_call.name = tool_call.name %}
        {%- elif "thinking" in message and loop.last and not add_generation_prompt %}
            {#- Only render the CoT if the final turn is an assistant turn and add_generation_prompt is false #}
            {#- This is a situation that should only occur in training, never in inference. #}
            {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
            {#- <|return|> indicates the end of generation, but <|end|> does not #}
            {#- <|return|> should never be an input to the model, but we include it as the final token #}
            {#- when training, so the model learns to emit it. #}
            {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|return|>" }}
            {%- set last_tool_call.name = none %}
        {%- elif "thinking" in message %}
            {#- CoT is dropped during all previous turns, so we never render it for inference #}
            {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|end|>" }}
            {%- set last_tool_call.name = none %}
        {%- elif loop.last and not add_generation_prompt %}
            {#- <|return|> indicates the end of generation, but <|end|> does not #}
            {#- <|return|> should never be an input to the model, but we include it as the final token #}
            {#- when training, so the model learns to emit it. #}
            {{- "<|start|>assistant<|message|>" + message.content + "<|return|>" }}
        {%- else %}
            {{- "<|start|>assistant<|message|>" + message.content + "<|end|>" }}
            {%- set last_tool_call.name = none %}
        {%- endif %}
    {%- elif message.role == 'tool' -%}
        {%- if last_tool_call.name is none %}
            {{- raise_exception("Message has tool role, but there was no previous assistant message with a tool call!") }}
        {%- endif %}
        {{- "<|start|>functions." + last_tool_call.name }}
        {{- " to=assistant<|channel|>commentary<|message|>" + message.content|tojson + "<|end|>" }}
    {%- else -%}
        {{- "<|start|>user<|message|>" + message.content + "<|end|>" }}
    {%- endif -%}
{%- endfor -%}

{#- Generation prompt #}
{%- if add_generation_prompt -%}
<|start|>assistant
{%- endif -%}
, example_format: '<|start|>system<|message|>You are a helpful assistant<|end|><|start|>user<|message|>Hello<|end|><|start|>assistant<|message|>Hi there<|return|><|start|>user<|message|>How are you?<|end|><|start|>assistant'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: problem loading gpt-oss-20b-GGUF in vulcan unless specificly using -np #15144

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: problem loading gpt-oss-20b-GGUF in vulcan unless specificly using -np #15144

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions