Skip to content

Eval bug: Gemma 3 Vision GGUF fails to generate output with llama-gemma3-cli.exe on Windows (while LLaVA works) #12784

@ismailsemihsenturk

Description

@ismailsemihsenturk

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
version: 5061 (916c83b)
built with MSVC 19.43.34809.0 for x64

Operating systems

Windows

GGML backends

CUDA

Hardware

Ryzen 9950 X3D
Nvidia RTX 5090

Models

Gemma3
https://huggingface.co/google/gemma-3-4b-it-qat-q4_0-gguf/tree/main
https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/tree/main

Problem description & steps to reproduce

Subject: Gemma 3 Vision GGUF Fails (Silent Exit) with llama-gemma3-cli & llama-llava-cli on Windows (Text works, LLaVA works)

Body:

Hi llama.cpp team,

I'm encountering difficulties running Gemma 3 vision capabilities using GGUF models on Windows 11, and hoping you might shed some light on it.

Environment:

  • OS: Windows 11 Pro
  • GPU: NVIDIA GeForce RTX 5090
  • CPU: AMD Ryzen 9 9950 X3D
  • CUDA Toolkit: 12.8
  • Build Tools: Visual Studio 2022 Community (MSVC 19.43.34809.0)
  • llama.cpp Build: Commit 916c83bf (build 5061), compiled from source on Windows with MSVC, GGML_CUDA=ON.
  • Python 3.11

Problem Description:

When attempting multimodal inference with a Gemma 3 GGUF model (unsloth/gemma-3-4b-it-GGUF Q4_K_M) and its corresponding mmproj file (mmproj-F32.gguf from the same repo), both llama-gemma3-cli.exe and llama-llava-cli.exe fail to produce any text output.

The process appears to load the LLM and the CLIP projector successfully. Logs show GPU offloading is complete, and the clip_ctx: CLIP using CUDA0 backend message appears. VRAM usage temporarily spikes (e.g., ~2.5GB idle to ~6.8GB) during this loading phase, but then immediately drops back down. The application then either hangs indefinitely or exits silently without generating any response tokens or printing any error messages after the CLIP context line.

Working Scenarios (Confirmation of Setup):

  1. LLaVA 1.5 Vision Works: Using llama-llava-cli.exe with a known-good LLaVA 1.5 7B GGUF model and its corresponding mmproj-model-f16.gguf works perfectly on the same system. It loads, processes the image, and generates the expected text description. (See successful command/log snippet below).
  2. Gemma 3 Text-Only Works: Using the standard llama-cli.exe with the exact same gemma-3-4b-it-Q4_K_M.gguf model file (but without specifying --mmproj or --image) works correctly for text-only generation. It loads the model to the GPU and produces text output based on the prompt. I tested both the unsloth and the google's q4 versions with F32 and BF16 mmproj files.

These successful tests indicate that the base llama.cpp CUDA build, GPU interaction (RTX 5090 / CC 12.0 detection seems okay), and the core functionality for other multimodal models (LLaVA) are working correctly on my Windows setup. The issue seems highly specific to the Gemma 3 multimodal inference pathway.

Failed Command (Gemma 3 Vision - Tried with llama-gemma3-cli.exe):

D:\llamacpp\llama.cpp\build\bin\Release>.\llama-gemma3-cli.exe -m D:\ai_models\unsloth-gemma-3-4b-it-gguf\gemma-3-4b-it-Q4_K_M.gguf --mmproj D:\ai_models\unsloth-gemma-3-4b-it-gguf\mmproj-BF16.gguf --image C:/Users/Bukra/Desktop/photo_test.jpg -p "describe this image" --temp 0.2 -ngl 100 -v
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
build: 5061 (916c83bf) with MSVC 19.43.34809.0 for x64
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) - 30843 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 444 tensors from D:\ai_models\unsloth-gemma-3-4b-it-gguf\gemma-3-4b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma-3-4B-It
llama_model_loader: - kv   3:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   4:                         general.size_label str              = 3.9B
llama_model_loader: - kv   5:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   6:                      gemma3.context_length u32              = 131072
llama_model_loader: - kv   7:                    gemma3.embedding_length u32              = 2560
llama_model_loader: - kv   8:                         gemma3.block_count u32              = 34
llama_model_loader: - kv   9:                 gemma3.feed_forward_length u32              = 10240
llama_model_loader: - kv  10:                gemma3.attention.head_count u32              = 8
llama_model_loader: - kv  11:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  12:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  13:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  14:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  15:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv  16:             gemma3.attention.head_count_kv u32              = 4
llama_model_loader: - kv  17:                   gemma3.rope.scaling.type str              = linear
llama_model_loader: - kv  18:                 gemma3.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,262208]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  22:                      tokenizer.ggml.scores arr[f32,262208]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,262208]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 106
llama_model_loader: - kv  26:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  28:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  29:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  31:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  205 tensors
llama_model_loader: - type q4_K:  204 tensors
llama_model_loader: - type q6_K:   35 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 2.31 GiB (5.12 BPW)
init_tokenizer: initializing tokenizer for type 1
load: control token:      1 '<eos>' is not marked as EOG
load: control token:      0 '<pad>' is not marked as EOG
load: control token:     46 '<unused40>' is not marked as EOG
load: control token:      3 '<unk>' is not marked as EOG
load: control token:     54 '<unused48>' is not marked as EOG
load: control token:     23 '<unused17>' is not marked as EOG
load: control token: 261532 '<unused5630>' is not marked as EOG
load: control token:      2 '<bos>' is not marked as EOG
load: control token:     30 '<unused24>' is not marked as EOG
load: control token: 260607 '<unused4705>' is not marked as EOG
load: control token:     74 '<unused68>' is not marked as EOG
load: control token: 259984 '<unused4082>' is not marked as EOG
load: control token: 259158 '<unused3256>' is not marked as EOG
load: control token:     57 '<unused51>' is not marked as EOG
load: control token:      4 '<mask>' is not marked as EOG
load: control token:     43 '<unused37>' is not marked as EOG
load: control token:     34 '<unused28>' is not marked as EOG
load: control token:      6 '<unused0>' is not marked as EOG
load: control token:     36 '<unused30>' is not marked as EOG
load: control token: 258673 '<unused2771>' is not marked as EOG
load: control token:     13 '<unused7>' is not marked as EOG
load: control token:     26 '<unused20>' is not marked as EOG
load: control token:     14 '<unused8>' is not marked as EOG
load: control token:     27 '<unused21>' is not marked as EOG
load: control token:     15 '<unused9>' is not marked as EOG
load: control token:     16 '<unused10>' is not marked as EOG
load: control token:     17 '<unused11>' is not marked as EOG
load: control token:     18 '<unused12>' is not marked as EOG
load: control token: 260041 '<unused4139>' is not marked as EOG
load: control token:     19 '<unused13>' is not marked as EOG
load: control token:     20 '<unused14>' is not marked as EOG
load: control token:     21 '<unused15>' is not marked as EOG
load: control token:     55 '<unused49>' is not marked as EOG
load: control token:     22 '<unused16>' is not marked as EOG
load: control token:     53 '<unused47>' is not marked as EOG
load: control token:     24 '<unused18>' is not marked as EOG
load: control token:     52 '<unused46>' is not marked as EOG
load: control token:     25 '<unused19>' is not marked as EOG
load: control token:     28 '<unused22>' is not marked as EOG
load: control token:     29 '<unused23>' is not marked as EOG
load: control token: 260701 '<unused4799>' is not marked as EOG
load: control token:     31 '<unused25>' is not marked as EOG
load: control token:     45 '<unused39>' is not marked as EOG
load: control token:     32 '<unused26>' is not marked as EOG
load: control token:     44 '<unused38>' is not marked as EOG
load: control token:     33 '<unused27>' is not marked as EOG
load: control token:     47 '<unused41>' is not marked as EOG
load: control token: 260528 '<unused4626>' is not marked as EOG
load: control token:     48 '<unused42>' is not marked as EOG
load: control token:     49 '<unused43>' is not marked as EOG
load: control token: 260522 '<unused4620>' is not marked as EOG
load: control token: 260887 '<unused4985>' is not marked as EOG
load: control token:     50 '<unused44>' is not marked as EOG
load: control token:     51 '<unused45>' is not marked as EOG
load: control token:     56 '<unused50>' is not marked as EOG
load: control token: 259159 '<unused3257>' is not marked as EOG
load: control token: 258653 '<unused2751>' is not marked as EOG
load: control token:     58 '<unused52>' is not marked as EOG
load: control token:     59 '<unused53>' is not marked as EOG
load: control token:     60 '<unused54>' is not marked as EOG
load: control token:     61 '<unused55>' is not marked as EOG
load: control token: 259987 '<unused4085>' is not marked as EOG
load: control token:     62 '<unused56>' is not marked as EOG
load: control token: 257650 '<unused1748>' is not marked as EOG
load: control token:     63 '<unused57>' is not marked as EOG
load: control token:     64 '<unused58>' is not marked as EOG
load: control token: 260792 '<unused4890>' is not marked as EOG
load: control token:     65 '<unused59>' is not marked as EOG
load: control token:     66 '<unused60>' is not marked as EOG
load: control token: 259074 '<unused3172>' is not marked as EOG
load: control token:     67 '<unused61>' is not marked as EOG
load: control token:     68 '<unused62>' is not marked as EOG
load: control token:     69 '<unused63>' is not marked as EOG
load: control token:     87 '<unused81>' is not marked as EOG
load: control token: 257138 '<unused1236>' is not marked as EOG
load: control token:     88 '<unused82>' is not marked as EOG
load: control token: 257139 '<unused1237>' is not marked as EOG
load: control token:     89 '<unused83>' is not marked as EOG
load: control token: 257132 '<unused1230>' is not marked as EOG
load: control token:     90 '<unused84>' is not marked as EOG
load: control token: 257989 '<unused2087>' is not marked as EOG
load: control token: 257133 '<unused1231>' is not marked as EOG
load: control token:     91 '<unused85>' is not marked as EOG
load: control token: 257134 '<unused1232>' is not marked as EOG
load: control token:     92 '<unused86>' is not marked as EOG
load: control token: 256220 '<unused318>' is not marked as EOG
load: control token: 257135 '<unused1233>' is not marked as EOG
load: control token:     93 '<unused87>' is not marked as EOG
load: control token:     94 '<unused88>' is not marked as EOG
load: control token: 256218 '<unused316>' is not marked as EOG
load: control token:     95 '<unused89>' is not marked as EOG
load: control token: 257126 '<unused1224>' is not marked as EOG
load: control token:     96 '<unused90>' is not marked as EOG
load: control token: 257127 '<unused1225>' is not marked as EOG
load: control token:     97 '<unused91>' is not marked as EOG
load: control token: 257128 '<unused1226>' is not marked as EOG
load: control token:     98 '<unused92>' is not marked as EOG
load: control token: 257129 '<unused1227>' is not marked as EOG
load: control token:     99 '<unused93>' is not marked as EOG
load: control token: 257122 '<unused1220>' is not marked as EOG
load: control token:    100 '<unused94>' is not marked as EOG
load: control token: 257123 '<unused1221>' is not marked as EOG
load: control token:    101 '<unused95>' is not marked as EOG
load: control token: 257124 '<unused1222>' is not marked as EOG
load: control token:    102 '<unused96>' is not marked as EOG
load: control token: 257125 '<unused1223>' is not marked as EOG
load: control token:    103 '<unused97>' is not marked as EOG
load: control token:    104 '<unused98>' is not marked as EOG
load: control token:    105 '<start_of_turn>' is not marked as EOG
load: control token: 262128 '<unused6226>' is not marked as EOG
load: control token: 261646 '<unused5744>' is not marked as EOG
load: control token: 257704 '<unused1802>' is not marked as EOG
load: control token: 257599 '<unused1697>' is not marked as EOG
load: control token: 260246 '<unused4344>' is not marked as EOG
load: control token: 262111 '<unused6209>' is not marked as EOG
load: control token: 260590 '<unused4688>' is not marked as EOG
load: control token: 260591 '<unused4689>' is not marked as EOG
load: control token: 260213 '<unused4311>' is not marked as EOG
load: control token: 257971 '<unused2069>' is not marked as EOG
load: control token: 258690 '<unused2788>' is not marked as EOG
load: control token: 260860 '<unused4958>' is not marked as EOG
load: control token: 256429 '<unused527>' is not marked as EOG
load: control token: 258920 '<unused3018>' is not marked as EOG
load: control token: 262132 '<unused6230>' is not marked as EOG
load: control token: 256183 '<unused281>' is not marked as EOG
load: control token: 260490 '<unused4588>' is not marked as EOG
load: control token: 262130 '<unused6228>' is not marked as EOG
load: control token: 262131 '<unused6229>' is not marked as EOG
load: control token: 262133 '<unused6231>' is not marked as EOG
load: control token: 262134 '<unused6232>' is not marked as EOG
load: control token: 262139 '<unused6237>' is not marked as EOG
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2560
print_info: n_layer          = 34
print_info: n_head           = 8
print_info: n_head_kv        = 4
print_info: n_rot            = 256
print_info: n_swa            = 1024
print_info: n_swa_pattern    = 6
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 10240
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 4B
print_info: model params     = 3.88 B
print_info: general.name     = Gemma-3-4B-It
print_info: vocab type       = SPM
print_info: n_vocab          = 262208
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 106 '<end_of_turn>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CUDA0, is_swa = 1
load_tensors: layer   1 assigned to device CUDA0, is_swa = 1
load_tensors: layer   2 assigned to device CUDA0, is_swa = 1
load_tensors: layer   3 assigned to device CUDA0, is_swa = 1
load_tensors: layer   4 assigned to device CUDA0, is_swa = 1
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 1
load_tensors: layer   7 assigned to device CUDA0, is_swa = 1
load_tensors: layer   8 assigned to device CUDA0, is_swa = 1
load_tensors: layer   9 assigned to device CUDA0, is_swa = 1
load_tensors: layer  10 assigned to device CUDA0, is_swa = 1
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 1
load_tensors: layer  13 assigned to device CUDA0, is_swa = 1
load_tensors: layer  14 assigned to device CUDA0, is_swa = 1
load_tensors: layer  15 assigned to device CUDA0, is_swa = 1
load_tensors: layer  16 assigned to device CUDA0, is_swa = 1
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 1
load_tensors: layer  19 assigned to device CUDA0, is_swa = 1
load_tensors: layer  20 assigned to device CUDA0, is_swa = 1
load_tensors: layer  21 assigned to device CUDA0, is_swa = 1
load_tensors: layer  22 assigned to device CUDA0, is_swa = 1
load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
load_tensors: layer  24 assigned to device CUDA0, is_swa = 1
load_tensors: layer  25 assigned to device CUDA0, is_swa = 1
load_tensors: layer  26 assigned to device CUDA0, is_swa = 1
load_tensors: layer  27 assigned to device CUDA0, is_swa = 1
load_tensors: layer  28 assigned to device CUDA0, is_swa = 1
load_tensors: layer  29 assigned to device CUDA0, is_swa = 0
load_tensors: layer  30 assigned to device CUDA0, is_swa = 1
load_tensors: layer  31 assigned to device CUDA0, is_swa = 1
load_tensors: layer  32 assigned to device CUDA0, is_swa = 1
load_tensors: layer  33 assigned to device CUDA0, is_swa = 1
load_tensors: layer  34 assigned to device CUDA0, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q6_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 34 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 35/35 layers to GPU
load_tensors:        CUDA0 model buffer size =  2368.31 MiB
load_tensors:   CPU_Mapped model buffer size =   525.13 MiB
.................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     1.00 MiB
llama_context: n_ctx = 4096
llama_context: n_ctx = 4096 (padded)
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 34, can_shift = 1
init: layer   0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init:      CUDA0 KV buffer size =   544.00 MiB
llama_context: KV self size  =  544.00 MiB, K (f16):  272.00 MiB, V (f16):  272.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:      CUDA0 compute buffer size =   517.12 MiB
llama_context:  CUDA_Host compute buffer size =    21.01 MiB
llama_context: graph nodes  = 1435
llama_context: graph splits = 2
clear_adapter_lora: call
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
set_warmup: value = 1
set_warmup: value = 0
clip_ctx: CLIP using CUDA0 backend

D:\llamacpp\llama.cpp\build\bin\Release>

(Result: Loads model & CLIP, VRAM spikes & drops, hangs/exits silently after clip_ctx log. No text output.)
(Note: Using llama-llava-cli.exe with the Gemma 3 files yields the same silent exit behavior after CLIP loading.)

Successful Command (LLaVA Vision):

.\llama-llava-cli.exe ^
 -m D:\ai_models\ggml_llava-v1.5-7b\ggml-model-q4_k.gguf ^
 --mmproj D:\ai_models\ggml_llava-v1.5-7b\mmproj-model-f16.gguf ^
 --image C:/Users/Bukra/Desktop/photo_test.jpg ^
 -ngl 99 ^
 --temp 0.2 ^
 -p "USER: <image>\nDescribe this image\nASSISTANT:" ^
 -v

(Result: Correctly generated text description of the image: "Yes, there are many brightly colored buildings in the image, and there are also some people standing in front of them.")

Successful Command (Gemma 3 Text-Only):

D:\llamacpp\llama.cpp\build\bin\Release>.\llama-cli.exe -m D:/ai_models/unsloth-gemma-3-4b-it-gguf/gemma-3-4b-it-Q4_K_M.gguf -ngl 99 -p "Hi" -n 50
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
build: 5061 (916c83bf) with MSVC 19.43.34809.0 for x64
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) - 30843 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 444 tensors from D:/ai_models/unsloth-gemma-3-4b-it-gguf/gemma-3-4b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma-3-4B-It
llama_model_loader: - kv   3:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   4:                         general.size_label str              = 3.9B
llama_model_loader: - kv   5:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   6:                      gemma3.context_length u32              = 131072
llama_model_loader: - kv   7:                    gemma3.embedding_length u32              = 2560
llama_model_loader: - kv   8:                         gemma3.block_count u32              = 34
llama_model_loader: - kv   9:                 gemma3.feed_forward_length u32              = 10240
llama_model_loader: - kv  10:                gemma3.attention.head_count u32              = 8
llama_model_loader: - kv  11:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  12:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  13:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  14:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  15:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv  16:             gemma3.attention.head_count_kv u32              = 4
llama_model_loader: - kv  17:                   gemma3.rope.scaling.type str              = linear
llama_model_loader: - kv  18:                 gemma3.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,262208]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  22:                      tokenizer.ggml.scores arr[f32,262208]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,262208]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 106
llama_model_loader: - kv  26:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  28:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  29:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  31:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  205 tensors
llama_model_loader: - type q4_K:  204 tensors
llama_model_loader: - type q6_K:   35 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 2.31 GiB (5.12 BPW)
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2560
print_info: n_layer          = 34
print_info: n_head           = 8
print_info: n_head_kv        = 4
print_info: n_rot            = 256
print_info: n_swa            = 1024
print_info: n_swa_pattern    = 6
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 10240
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 4B
print_info: model params     = 3.88 B
print_info: general.name     = Gemma-3-4B-It
print_info: vocab type       = SPM
print_info: n_vocab          = 262208
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 106 '<end_of_turn>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 34 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 35/35 layers to GPU
load_tensors:        CUDA0 model buffer size =  2368.31 MiB
load_tensors:   CPU_Mapped model buffer size =   525.13 MiB
.................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     1.00 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 34, can_shift = 1
init:      CUDA0 KV buffer size =   544.00 MiB
llama_context: KV self size  =  544.00 MiB, K (f16):  272.00 MiB, V (f16):  272.00 MiB
llama_context:      CUDA0 compute buffer size =   517.12 MiB
llama_context:  CUDA_Host compute buffer size =    21.01 MiB
llama_context: graph nodes  = 1435
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<start_of_turn>user
You are a helpful assistant

Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model


system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: interactive mode on.
sampler seed: 3434785435
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 50, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

user
Hi
model
Hi there! How can I help you today? 😊

Do you want to:

*   Chat about something?
*   Get some information?
*   Play a game?

(Result: Correctly generated text response.)

Additional Context:

I initially faced similar issues trying to get quantized Gemma 3 vision GGUFs working with Ollama on Windows, as discussed in this Unsloth issue: unslothai/unsloth#2248 (comment). This led me to try llama.cpp directly.

Question:

Is this silent failure with Gemma 3 vision GGUFs using llama-gemma3-cli.exe (or llama-llava-cli.exe) on Windows a known issue? Could it be related to:

  • The specific Unsloth GGUF conversion process for Gemma 3?
  • The compatibility of the mmproj-F32.gguf file with the Q4_K_M base model GGUF?
  • Bugs or limitations in the experimental llama-gemma3-cli.exe on Windows, especially with newer models like Gemma 3?
  • A required, specific prompt format (different from LLaVA's) for Gemma 3 vision that I'm missing?

Any guidance or suggestions for known-good Gemma 3 vision GGUF/mmproj combinations that are confirmed to work with llama.cpp on Windows would be greatly appreciated.

First Bad Commit

No response

Relevant log output

D:\llamacpp\llama.cpp\build\bin\Release>.\llama-gemma3-cli.exe -m D:\ai_models\unsloth-gemma-3-4b-it-gguf\gemma-3-4b-it-Q4_K_M.gguf --mmproj D:\ai_models\unsloth-gemma-3-4b-it-gguf\mmproj-BF16.gguf --image C:/Users/Bukra/Desktop/photo_test.jpg -p "describe this image" --temp 0.2 -ngl 100 -v
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
build: 5061 (916c83bf) with MSVC 19.43.34809.0 for x64
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) - 30843 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 444 tensors from D:\ai_models\unsloth-gemma-3-4b-it-gguf\gemma-3-4b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma-3-4B-It
llama_model_loader: - kv   3:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   4:                         general.size_label str              = 3.9B
llama_model_loader: - kv   5:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   6:                      gemma3.context_length u32              = 131072
llama_model_loader: - kv   7:                    gemma3.embedding_length u32              = 2560
llama_model_loader: - kv   8:                         gemma3.block_count u32              = 34
llama_model_loader: - kv   9:                 gemma3.feed_forward_length u32              = 10240
llama_model_loader: - kv  10:                gemma3.attention.head_count u32              = 8
llama_model_loader: - kv  11:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  12:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  13:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  14:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  15:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv  16:             gemma3.attention.head_count_kv u32              = 4
llama_model_loader: - kv  17:                   gemma3.rope.scaling.type str              = linear
llama_model_loader: - kv  18:                 gemma3.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,262208]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  22:                      tokenizer.ggml.scores arr[f32,262208]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,262208]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 106
llama_model_loader: - kv  26:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  28:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  29:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  31:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  205 tensors
llama_model_loader: - type q4_K:  204 tensors
llama_model_loader: - type q6_K:   35 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 2.31 GiB (5.12 BPW)
init_tokenizer: initializing tokenizer for type 1
load: control token:      1 '<eos>' is not marked as EOG
load: control token:      0 '<pad>' is not marked as EOG
load: control token:     46 '<unused40>' is not marked as EOG
load: control token:      3 '<unk>' is not marked as EOG
load: control token:     54 '<unused48>' is not marked as EOG
load: control token:     23 '<unused17>' is not marked as EOG
load: control token: 261532 '<unused5630>' is not marked as EOG
load: control token:      2 '<bos>' is not marked as EOG
load: control token:     30 '<unused24>' is not marked as EOG
load: control token: 260607 '<unused4705>' is not marked as EOG
load: control token:     74 '<unused68>' is not marked as EOG
load: control token: 259984 '<unused4082>' is not marked as EOG
load: control token: 259158 '<unused3256>' is not marked as EOG
load: control token:     57 '<unused51>' is not marked as EOG
load: control token:      4 '<mask>' is not marked as EOG
load: control token:     43 '<unused37>' is not marked as EOG
load: control token:     34 '<unused28>' is not marked as EOG
load: control token:      6 '<unused0>' is not marked as EOG
load: control token:     36 '<unused30>' is not marked as EOG
load: control token: 258673 '<unused2771>' is not marked as EOG
load: control token:     13 '<unused7>' is not marked as EOG
load: control token:     26 '<unused20>' is not marked as EOG
load: control token:     14 '<unused8>' is not marked as EOG
load: control token:     27 '<unused21>' is not marked as EOG
load: control token:     15 '<unused9>' is not marked as EOG
load: control token:     16 '<unused10>' is not marked as EOG
load: control token:     17 '<unused11>' is not marked as EOG
load: control token:     18 '<unused12>' is not marked as EOG
load: control token: 260041 '<unused4139>' is not marked as EOG
load: control token:     19 '<unused13>' is not marked as EOG
load: control token:     20 '<unused14>' is not marked as EOG
load: control token:     21 '<unused15>' is not marked as EOG
load: control token:     55 '<unused49>' is not marked as EOG
load: control token:     22 '<unused16>' is not marked as EOG
load: control token:     53 '<unused47>' is not marked as EOG
load: control token:     24 '<unused18>' is not marked as EOG
load: control token:     52 '<unused46>' is not marked as EOG
load: control token:     25 '<unused19>' is not marked as EOG
load: control token:     28 '<unused22>' is not marked as EOG
load: control token:     29 '<unused23>' is not marked as EOG
load: control token: 260701 '<unused4799>' is not marked as EOG
load: control token:     31 '<unused25>' is not marked as EOG
load: control token:     45 '<unused39>' is not marked as EOG
load: control token:     32 '<unused26>' is not marked as EOG
load: control token:     44 '<unused38>' is not marked as EOG
load: control token:     33 '<unused27>' is not marked as EOG
load: control token:     47 '<unused41>' is not marked as EOG
load: control token: 260528 '<unused4626>' is not marked as EOG
load: control token:     48 '<unused42>' is not marked as EOG
load: control token:     49 '<unused43>' is not marked as EOG
load: control token: 260522 '<unused4620>' is not marked as EOG
load: control token: 260887 '<unused4985>' is not marked as EOG
load: control token:     50 '<unused44>' is not marked as EOG
load: control token:     51 '<unused45>' is not marked as EOG
load: control token:     56 '<unused50>' is not marked as EOG
load: control token: 259159 '<unused3257>' is not marked as EOG
load: control token: 258653 '<unused2751>' is not marked as EOG
load: control token:     58 '<unused52>' is not marked as EOG
load: control token:     59 '<unused53>' is not marked as EOG
load: control token:     60 '<unused54>' is not marked as EOG
load: control token:     61 '<unused55>' is not marked as EOG
load: control token: 259987 '<unused4085>' is not marked as EOG
load: control token:     62 '<unused56>' is not marked as EOG
load: control token: 257650 '<unused1748>' is not marked as EOG
load: control token:     63 '<unused57>' is not marked as EOG
load: control token:     64 '<unused58>' is not marked as EOG
load: control token: 260792 '<unused4890>' is not marked as EOG
load: control token:     65 '<unused59>' is not marked as EOG
load: control token:     66 '<unused60>' is not marked as EOG
load: control token: 259074 '<unused3172>' is not marked as EOG
load: control token:     67 '<unused61>' is not marked as EOG
load: control token:     68 '<unused62>' is not marked as EOG
load: control token:     69 '<unused63>' is not marked as EOG
load: control token:     87 '<unused81>' is not marked as EOG
load: control token: 257138 '<unused1236>' is not marked as EOG
load: control token:     88 '<unused82>' is not marked as EOG
load: control token: 257139 '<unused1237>' is not marked as EOG
load: control token:     89 '<unused83>' is not marked as EOG
load: control token: 257132 '<unused1230>' is not marked as EOG
load: control token:     90 '<unused84>' is not marked as EOG
load: control token: 257989 '<unused2087>' is not marked as EOG
load: control token: 257133 '<unused1231>' is not marked as EOG
load: control token:     91 '<unused85>' is not marked as EOG
load: control token: 257134 '<unused1232>' is not marked as EOG
load: control token:     92 '<unused86>' is not marked as EOG
load: control token: 256220 '<unused318>' is not marked as EOG
load: control token: 257135 '<unused1233>' is not marked as EOG
load: control token:     93 '<unused87>' is not marked as EOG
load: control token:     94 '<unused88>' is not marked as EOG
load: control token: 256218 '<unused316>' is not marked as EOG
load: control token:     95 '<unused89>' is not marked as EOG
load: control token: 257126 '<unused1224>' is not marked as EOG
load: control token:     96 '<unused90>' is not marked as EOG
load: control token: 257127 '<unused1225>' is not marked as EOG
load: control token:     97 '<unused91>' is not marked as EOG
load: control token: 257128 '<unused1226>' is not marked as EOG
load: control token:     98 '<unused92>' is not marked as EOG
load: control token: 257129 '<unused1227>' is not marked as EOG
load: control token:     99 '<unused93>' is not marked as EOG
load: control token: 257122 '<unused1220>' is not marked as EOG
load: control token:    100 '<unused94>' is not marked as EOG
load: control token: 257123 '<unused1221>' is not marked as EOG
load: control token:    101 '<unused95>' is not marked as EOG
load: control token: 257124 '<unused1222>' is not marked as EOG
load: control token:    102 '<unused96>' is not marked as EOG
load: control token: 257125 '<unused1223>' is not marked as EOG
load: control token:    103 '<unused97>' is not marked as EOG
load: control token:    104 '<unused98>' is not marked as EOG
load: control token:    105 '<start_of_turn>' is not marked as EOG
load: control token: 262128 '<unused6226>' is not marked as EOG
load: control token: 261646 '<unused5744>' is not marked as EOG
load: control token: 257704 '<unused1802>' is not marked as EOG
load: control token: 257599 '<unused1697>' is not marked as EOG
load: control token: 260246 '<unused4344>' is not marked as EOG
load: control token: 262111 '<unused6209>' is not marked as EOG
load: control token: 260590 '<unused4688>' is not marked as EOG
load: control token: 260591 '<unused4689>' is not marked as EOG
load: control token: 260213 '<unused4311>' is not marked as EOG
load: control token: 257971 '<unused2069>' is not marked as EOG
load: control token: 258690 '<unused2788>' is not marked as EOG
load: control token: 260860 '<unused4958>' is not marked as EOG
load: control token: 256429 '<unused527>' is not marked as EOG
load: control token: 258920 '<unused3018>' is not marked as EOG
load: control token: 262132 '<unused6230>' is not marked as EOG
load: control token: 256183 '<unused281>' is not marked as EOG
load: control token: 260490 '<unused4588>' is not marked as EOG
load: control token: 262130 '<unused6228>' is not marked as EOG
load: control token: 262131 '<unused6229>' is not marked as EOG
load: control token: 262133 '<unused6231>' is not marked as EOG
load: control token: 262134 '<unused6232>' is not marked as EOG
load: control token: 262139 '<unused6237>' is not marked as EOG
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2560
print_info: n_layer          = 34
print_info: n_head           = 8
print_info: n_head_kv        = 4
print_info: n_rot            = 256
print_info: n_swa            = 1024
print_info: n_swa_pattern    = 6
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 10240
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 4B
print_info: model params     = 3.88 B
print_info: general.name     = Gemma-3-4B-It
print_info: vocab type       = SPM
print_info: n_vocab          = 262208
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 106 '<end_of_turn>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CUDA0, is_swa = 1
load_tensors: layer   1 assigned to device CUDA0, is_swa = 1
load_tensors: layer   2 assigned to device CUDA0, is_swa = 1
load_tensors: layer   3 assigned to device CUDA0, is_swa = 1
load_tensors: layer   4 assigned to device CUDA0, is_swa = 1
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 1
load_tensors: layer   7 assigned to device CUDA0, is_swa = 1
load_tensors: layer   8 assigned to device CUDA0, is_swa = 1
load_tensors: layer   9 assigned to device CUDA0, is_swa = 1
load_tensors: layer  10 assigned to device CUDA0, is_swa = 1
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 1
load_tensors: layer  13 assigned to device CUDA0, is_swa = 1
load_tensors: layer  14 assigned to device CUDA0, is_swa = 1
load_tensors: layer  15 assigned to device CUDA0, is_swa = 1
load_tensors: layer  16 assigned to device CUDA0, is_swa = 1
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 1
load_tensors: layer  19 assigned to device CUDA0, is_swa = 1
load_tensors: layer  20 assigned to device CUDA0, is_swa = 1
load_tensors: layer  21 assigned to device CUDA0, is_swa = 1
load_tensors: layer  22 assigned to device CUDA0, is_swa = 1
load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
load_tensors: layer  24 assigned to device CUDA0, is_swa = 1
load_tensors: layer  25 assigned to device CUDA0, is_swa = 1
load_tensors: layer  26 assigned to device CUDA0, is_swa = 1
load_tensors: layer  27 assigned to device CUDA0, is_swa = 1
load_tensors: layer  28 assigned to device CUDA0, is_swa = 1
load_tensors: layer  29 assigned to device CUDA0, is_swa = 0
load_tensors: layer  30 assigned to device CUDA0, is_swa = 1
load_tensors: layer  31 assigned to device CUDA0, is_swa = 1
load_tensors: layer  32 assigned to device CUDA0, is_swa = 1
load_tensors: layer  33 assigned to device CUDA0, is_swa = 1
load_tensors: layer  34 assigned to device CUDA0, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q6_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 34 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 35/35 layers to GPU
load_tensors:        CUDA0 model buffer size =  2368.31 MiB
load_tensors:   CPU_Mapped model buffer size =   525.13 MiB
.................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     1.00 MiB
llama_context: n_ctx = 4096
llama_context: n_ctx = 4096 (padded)
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 34, can_shift = 1
init: layer   0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer   9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer  33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init:      CUDA0 KV buffer size =   544.00 MiB
llama_context: KV self size  =  544.00 MiB, K (f16):  272.00 MiB, V (f16):  272.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:      CUDA0 compute buffer size =   517.12 MiB
llama_context:  CUDA_Host compute buffer size =    21.01 MiB
llama_context: graph nodes  = 1435
llama_context: graph splits = 2
clear_adapter_lora: call
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
set_warmup: value = 1
set_warmup: value = 0
clip_ctx: CLIP using CUDA0 backend

D:\llamacpp\llama.cpp\build\bin\Release>

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions