Help in converting llama-cpp command to KoboldCpp #1684

albertosottile · 2025-08-12T22:18:13Z

albertosottile
Aug 12, 2025

I got this command from Reddit to run GLM-4.5-Air on my machine (Windows, 2x3090 + 64 GB DDR5):

.\llama-server.exe -m I:\LLM\GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf --host 0.0.0.0 --port 30000 -c 20480 --cache-type-k q8_0 --cache-type-v q8_0 -t 8 -ngl 99 -ts 2/1 --n-cpu-moe 19 --flash-attn --cache-reuse 128 --mlock --numa distribute

The model runs perfectly with this command and produces about 10 t/s. How the tensors are split:

load_tensors: offloaded 48/48 layers to GPU
load_tensors:        CUDA0 model buffer size = 20349.63 MiB
load_tensors:        CUDA1 model buffer size = 21586.17 MiB
load_tensors:   CPU_Mapped model buffer size = 26874.49 MiB

I tried to run the same model with latest Koboldcpp (the backend that I normally use for all the other models) but apparently I failed to load it due to insufficient VRAM.

Here is the full output from the terminal:

C:\Users\Alberto\Documents\kobold>koboldcpp.exe
***
Welcome to KoboldCpp - Version 1.97.4
For command line arguments, please refer to --help
***
Auto Selected CUDA Backend (flag=0)

Multi-Part GGUF detected. Layer estimates may not be very accurate - recommend setting layers manually.
Loading Chat Completions Adapter: C:\Users\Alberto\AppData\Local\Temp\_MEI212122\kcpp_adapters\AutoGuess.json
Chat Completions Adapter Loaded
System: Windows 10.0.19045 AMD64 Intel64 Family 6 Model 183 Stepping 1, GenuineIntel
Detected Available GPU Memory: 24576 MB
Detected Available RAM: 53417 MB
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(admin=False, admindir='', adminpassword='', analyze='', benchmark=None, blasbatchsize=512, blasthreads=8, chatcompletionsadapter='AutoGuess', cli=False, config=None, contextsize=20480, debugmode=0, defaultgenamt=512, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, embeddingsgpu=False, embeddingsmaxctx=0, embeddingsmodel='', enableguidance=False, exportconfig='', exporttemplate='', failsafe=False, flashattention=True, forceversion=0, foreground=False, gpulayers=99, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=True, lora=None, loramult=1.0, maingpu=-1, maxrequestsize=32, mmproj=None, mmprojcpu=False, model=[], model_param='I:/LLM/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf', moecpu=19, moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nobostoken=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=False, onready='', overridekv=None, overridetensors=None, password=None, port=5001, port_param=5001, preloadstory=None, prompt='', promptlimit=100, quantkv=1, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], savedatafile=None, sdclamped=0, sdclampedsoft=0, sdclipg='', sdclipl='', sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdnotile=False, sdphotomaker='', sdquant=False, sdt5xxl='', sdthreads=8, sdtiledvae=768, sdvae='', sdvaeauto=False, showgui=False, singleinstance=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=8, ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', useclblast=None, usecpu=False, usecuda=['normal', 'mmq'], usemlock=True, usemmap=False, useswa=False, usevulkan=None, version=False, visionmaxres=1024, websearch=False, whispermodel='')
==========
Loading Text Model: I:\LLM\GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf

The reported GGUF Arch is: glm4moe
Arch Category: 9

---
Identified as GGUF model.
Attempting to Load...
---
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
CUDA MMQ: True
---
Initializing CUDA/HIP, please wait, the following step may take a few minutes (only for first launch)...
---
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Overriding 19 MoE layers to CPU...
Handling Override Tensors for backends: CUDA0 CUDA1 CPU

Override Tensor: blk\.0\.ffn_(up|down|gate)_exps to CPU
Override Tensor: blk\.1\.ffn_(up|down|gate)_exps to CPU
Override Tensor: blk\.2\.ffn_(up|down|gate)_exps to CPU
Override Tensor: blk\.3\.ffn_(up|down|gate)_exps to CPU
Override Tensor: blk\.4\.ffn_(up|down|gate)_exps to CPU
Override Tensor: blk\.5\.ffn_(up|down|gate)_exps to CPU
Override Tensor: blk\.6\.ffn_(up|down|gate)_exps to CPU
Override Tensor: blk\.7\.ffn_(up|down|gate)_exps to CPU
Override Tensor: blk\.8\.ffn_(up|down|gate)_exps to CPU
Override Tensor: blk\.9\.ffn_(up|down|gate)_exps to CPU
Override Tensor: blk\.10\.ffn_(up|down|gate)_exps to CPU
Override Tensor: blk\.11\.ffn_(up|down|gate)_exps to CPU
Override Tensor: blk\.12\.ffn_(up|down|gate)_exps to CPU
Override Tensor: blk\.13\.ffn_(up|down|gate)_exps to CPU
Override Tensor: blk\.14\.ffn_(up|down|gate)_exps to CPU
Override Tensor: blk\.15\.ffn_(up|down|gate)_exps to CPU
Override Tensor: blk\.16\.ffn_(up|down|gate)_exps to CPU
Override Tensor: blk\.17\.ffn_(up|down|gate)_exps to CPU
Override Tensor: blk\.18\.ffn_(up|down|gate)_exps to CPU
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 55 key-value pairs and 803 tensors from I:\LLM\GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file size   = 68.01 GiB (5.29 BPW)
init_tokenizer: initializing tokenizer for type 2
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 151329 ('<|endoftext|>')
load:   - 151336 ('<|user|>')
load:   - 151338 ('<|observation|>')
load: special tokens cache size = 36
load: token to piece cache size = 0.9713 MB
print_info: arch             = glm4moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 47
print_info: n_head           = 96
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 12
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 10944
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 106B.A12B
print_info: model params     = 110.47 B
print_info: general.name     = Glm-4.5-Air
print_info: vocab type       = BPE
print_info: n_vocab          = 151552
print_info: n_merges         = 318088
print_info: BOS token        = 151331 '[gMASK]'
print_info: EOS token        = 151329 '<|endoftext|>'
print_info: EOT token        = 151336 '<|user|>'
print_info: EOM token        = 151338 '<|observation|>'
print_info: UNK token        = 151329 '<|endoftext|>'
print_info: PAD token        = 151330 '[MASK]'
print_info: LF token         = 198 'ÄS'
print_info: FIM PRE token    = 151347 '<|code_prefix|>'
print_info: FIM SUF token    = 151349 '<|code_suffix|>'
print_info: FIM MID token    = 151348 '<|code_middle|>'
print_info: EOG token        = 151329 '<|endoftext|>'
print_info: EOG token        = 151336 '<|user|>'
print_info: EOG token        = 151338 '<|observation|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = false)
tensor blk.1.ffn_gate_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.1.ffn_down_exps.weight (748 MiB q8_0) buffer type overridden to CUDA_Host
tensor blk.1.ffn_up_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.2.ffn_gate_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.2.ffn_down_exps.weight (748 MiB q8_0) buffer type overridden to CUDA_Host
tensor blk.2.ffn_up_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.3.ffn_gate_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.3.ffn_down_exps.weight (748 MiB q8_0) buffer type overridden to CUDA_Host
tensor blk.3.ffn_up_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.4.ffn_gate_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.4.ffn_down_exps.weight (748 MiB q8_0) buffer type overridden to CUDA_Host
tensor blk.4.ffn_up_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.5.ffn_gate_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.5.ffn_down_exps.weight (484 MiB q5_0) buffer type overridden to CUDA_Host
tensor blk.5.ffn_up_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.6.ffn_gate_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.6.ffn_down_exps.weight (484 MiB q5_0) buffer type overridden to CUDA_Host
tensor blk.6.ffn_up_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.7.ffn_gate_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.7.ffn_down_exps.weight (748 MiB q8_0) buffer type overridden to CUDA_Host
tensor blk.7.ffn_up_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.8.ffn_gate_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.8.ffn_down_exps.weight (484 MiB q5_0) buffer type overridden to CUDA_Host
tensor blk.8.ffn_up_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.9.ffn_gate_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.9.ffn_down_exps.weight (484 MiB q5_0) buffer type overridden to CUDA_Host
tensor blk.9.ffn_up_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.10.ffn_gate_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.10.ffn_down_exps.weight (748 MiB q8_0) buffer type overridden to CUDA_Host
tensor blk.10.ffn_up_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.11.ffn_gate_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.11.ffn_down_exps.weight (484 MiB q5_0) buffer type overridden to CUDA_Host
tensor blk.11.ffn_up_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.12.ffn_gate_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.12.ffn_down_exps.weight (484 MiB q5_0) buffer type overridden to CUDA_Host
tensor blk.12.ffn_up_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.13.ffn_gate_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.13.ffn_down_exps.weight (748 MiB q8_0) buffer type overridden to CUDA_Host
tensor blk.13.ffn_up_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.14.ffn_gate_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.14.ffn_down_exps.weight (484 MiB q5_0) buffer type overridden to CUDA_Host
tensor blk.14.ffn_up_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.15.ffn_gate_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.15.ffn_down_exps.weight (484 MiB q5_0) buffer type overridden to CUDA_Host
tensor blk.15.ffn_up_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.16.ffn_gate_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.16.ffn_down_exps.weight (748 MiB q8_0) buffer type overridden to CUDA_Host
tensor blk.16.ffn_up_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.17.ffn_gate_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.17.ffn_down_exps.weight (484 MiB q5_0) buffer type overridden to CUDA_Host
tensor blk.17.ffn_up_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.18.ffn_gate_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
tensor blk.18.ffn_down_exps.weight (484 MiB q5_0) buffer type overridden to CUDA_Host
tensor blk.18.ffn_up_exps.weight (396 MiB q4_K) buffer type overridden to CUDA_Host
model has unused tensor blk.46.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.46.attn_q.weight (size = 28311552 bytes) -- ignoring
model has unused tensor blk.46.attn_k.weight (size = 2359296 bytes) -- ignoring
model has unused tensor blk.46.attn_v.weight (size = 2359296 bytes) -- ignoring
model has unused tensor blk.46.attn_q.bias (size = 49152 bytes) -- ignoring
model has unused tensor blk.46.attn_k.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.46.attn_v.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.46.attn_output.weight (size = 28311552 bytes) -- ignoring
model has unused tensor blk.46.post_attention_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.46.ffn_gate_inp.weight (size = 2097152 bytes) -- ignoring
model has unused tensor blk.46.exp_probs_b.bias (size = 512 bytes) -- ignoring
model has unused tensor blk.46.ffn_gate_exps.weight (size = 415236096 bytes) -- ignoring
model has unused tensor blk.46.ffn_down_exps.weight (size = 784334848 bytes) -- ignoring
model has unused tensor blk.46.ffn_up_exps.weight (size = 415236096 bytes) -- ignoring
model has unused tensor blk.46.ffn_gate_shexp.weight (size = 3244032 bytes) -- ignoring
model has unused tensor blk.46.ffn_down_shexp.weight (size = 6127616 bytes) -- ignoring
model has unused tensor blk.46.ffn_up_shexp.weight (size = 3244032 bytes) -- ignoring
model has unused tensor blk.46.nextn.eh_proj.weight (size = 18874368 bytes) -- ignoring
model has unused tensor blk.46.nextn.embed_tokens.weight (size = 349175808 bytes) -- ignoring
model has unused tensor blk.46.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.46.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.46.nextn.shared_head_head.weight (size = 349175808 bytes) -- ignoring
model has unused tensor blk.46.nextn.shared_head_norm.weight (size = 16384 bytes) -- ignoring
load_tensors: relocated tensors: 55 of 780
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 33206.30 MiB on device 1: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA1 buffer of size 34819329280
llama_model_load: error loading model: unable to allocate CUDA1 buffer
llama_model_load_from_file_impl: failed to load model
Traceback (most recent call last):
  File "koboldcpp.py", line 7576, in <module>
    main(launch_args=parser.parse_args(),default_args=parser.parse_args([]))
  File "koboldcpp.py", line 6588, in main
    kcpp_main_process(args,global_memory,using_gui_launcher)
  File "koboldcpp.py", line 7038, in kcpp_main_process
    loadok = load_model(modelname)
  File "koboldcpp.py", line 1398, in load_model
    ret = handle.load_model(inputs)
OSError: exception: access violation reading 0x0000000000000004
[23236] Failed to execute script 'koboldcpp' due to unhandled exception!

Can you tell me what I am doing wrong w.r.t. llama-cpp? Thanks.

Answered by wbruna

Aug 13, 2025

Namespace( (...) tensor_split=None,

This option would be the equivalent of -ts on your llama-server command line, so something like: --tensor_split 2 1 .

View full answer

wbruna · 2025-08-13T01:28:14Z

wbruna
Aug 13, 2025

Namespace( (...) tensor_split=None,

This option would be the equivalent of -ts on your llama-server command line, so something like: --tensor_split 2 1 .

6 replies

wbruna Aug 13, 2025

I thought this option was meant to split the tensors between the two GPUs, which is something I do not want to do here.

Well, that's what it is doing with that llama-server option -ts 2/1 on the command line you mentioned:

$ llama-server --help | grep -A 2 -- -ts
(...)
-ts, --tensor-split N0,N1,N2,... fraction of the model to offload to each GPU, comma-separated list of
proportions, e.g. 3,1
(env: LLAMA_ARG_TENSOR_SPLIT)

The --tensor_split 2 1 would replicate what I guessed that -ts 2/1 meant. I'm not exactly sure how it is interpreting the '/', but the model is being split between the cards, judging by the VRAM usage.

But if you prefer using a single card, one option would be increasing the moecpu value to offload more layers to the CPU. With those memory values, I'd guess around 30 should work.

albertosottile Aug 13, 2025
Author

The model is split between the cards but, equally:

load_tensors:        CUDA0 model buffer size = 20349.63 MiB
load_tensors:        CUDA1 model buffer size = 21586.17 MiB
load_tensors:   CPU_Mapped model buffer size = 26874.49 MiB

so, I do not understand what the "2 over 1" ratio means.

But if you prefer using a single card, one option would be increasing the moecpu value to offload more layers to the CPU. With those memory values, I'd guess around 30 should work.

No, I actually wanted to end up here but, I do not understand why "2/1" is the right answer.

wbruna Aug 13, 2025

Alright, just use --tensor_split 1 1 then.

I do not understand why "2/1" is the right answer.

It's not, it's just that the command line you provided included that. According to the llama-server help, -ts 1,1 would make more sense for an equal split.

albertosottile Aug 13, 2025
Author

But that's exactly the thing: running with -ts 1/1 or with --tensor_split 1 1 results in:

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 33206.30 MiB on device 1: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA1 buffer of size 34819329280
llama_model_load: error loading model: unable to allocate CUDA1 buffer
llama_model_load_from_file_impl: failed to load model

Therefore, I appreciate the original command and your answer (because they work). But, I still do not understand why they work :D

wbruna Aug 14, 2025

Oh, my mistake: so --tensor_split 2 1 did replicate the split in the same way as before.

My guess is, the splitting is done at tensor level, and some tensors are bigger than others. So in this case, the two-thirds on one GPU have roughly the same size as the third on the other.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Help in converting llama-cpp command to KoboldCpp #1684

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Help in converting llama-cpp command to KoboldCpp #1684

Uh oh!

Uh oh!

albertosottile Aug 12, 2025

Replies: 1 comment · 6 replies

Uh oh!

wbruna Aug 13, 2025

Uh oh!

wbruna Aug 13, 2025

Uh oh!

albertosottile Aug 13, 2025 Author

Uh oh!

wbruna Aug 13, 2025

Uh oh!

Uh oh!

albertosottile Aug 13, 2025 Author

Uh oh!

wbruna Aug 14, 2025

albertosottile
Aug 12, 2025

Replies: 1 comment 6 replies

wbruna
Aug 13, 2025

albertosottile Aug 13, 2025
Author

albertosottile Aug 13, 2025
Author