-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Description
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
version: 5061 (916c83b)
built with MSVC 19.43.34809.0 for x64
Operating systems
Windows
GGML backends
CUDA
Hardware
Ryzen 9950 X3D
Nvidia RTX 5090
Models
Gemma3
https://huggingface.co/google/gemma-3-4b-it-qat-q4_0-gguf/tree/main
https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/tree/main
Problem description & steps to reproduce
Subject: Gemma 3 Vision GGUF Fails (Silent Exit) with llama-gemma3-cli & llama-llava-cli on Windows (Text works, LLaVA works)
Body:
Hi llama.cpp team,
I'm encountering difficulties running Gemma 3 vision capabilities using GGUF models on Windows 11, and hoping you might shed some light on it.
Environment:
- OS: Windows 11 Pro
- GPU: NVIDIA GeForce RTX 5090
- CPU: AMD Ryzen 9 9950 X3D
- CUDA Toolkit: 12.8
- Build Tools: Visual Studio 2022 Community (MSVC 19.43.34809.0)
llama.cppBuild: Commit916c83bf(build 5061), compiled from source on Windows with MSVC,GGML_CUDA=ON.- Python 3.11
Problem Description:
When attempting multimodal inference with a Gemma 3 GGUF model (unsloth/gemma-3-4b-it-GGUF Q4_K_M) and its corresponding mmproj file (mmproj-F32.gguf from the same repo), both llama-gemma3-cli.exe and llama-llava-cli.exe fail to produce any text output.
The process appears to load the LLM and the CLIP projector successfully. Logs show GPU offloading is complete, and the clip_ctx: CLIP using CUDA0 backend message appears. VRAM usage temporarily spikes (e.g., ~2.5GB idle to ~6.8GB) during this loading phase, but then immediately drops back down. The application then either hangs indefinitely or exits silently without generating any response tokens or printing any error messages after the CLIP context line.
Working Scenarios (Confirmation of Setup):
- LLaVA 1.5 Vision Works: Using
llama-llava-cli.exewith a known-good LLaVA 1.5 7B GGUF model and its correspondingmmproj-model-f16.ggufworks perfectly on the same system. It loads, processes the image, and generates the expected text description. (See successful command/log snippet below). - Gemma 3 Text-Only Works: Using the standard
llama-cli.exewith the exact samegemma-3-4b-it-Q4_K_M.ggufmodel file (but without specifying--mmprojor--image) works correctly for text-only generation. It loads the model to the GPU and produces text output based on the prompt. I tested both the unsloth and the google's q4 versions with F32 and BF16 mmproj files.
These successful tests indicate that the base llama.cpp CUDA build, GPU interaction (RTX 5090 / CC 12.0 detection seems okay), and the core functionality for other multimodal models (LLaVA) are working correctly on my Windows setup. The issue seems highly specific to the Gemma 3 multimodal inference pathway.
Failed Command (Gemma 3 Vision - Tried with llama-gemma3-cli.exe):
D:\llamacpp\llama.cpp\build\bin\Release>.\llama-gemma3-cli.exe -m D:\ai_models\unsloth-gemma-3-4b-it-gguf\gemma-3-4b-it-Q4_K_M.gguf --mmproj D:\ai_models\unsloth-gemma-3-4b-it-gguf\mmproj-BF16.gguf --image C:/Users/Bukra/Desktop/photo_test.jpg -p "describe this image" --temp 0.2 -ngl 100 -v
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
build: 5061 (916c83bf) with MSVC 19.43.34809.0 for x64
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) - 30843 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 444 tensors from D:\ai_models\unsloth-gemma-3-4b-it-gguf\gemma-3-4b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma-3-4B-It
llama_model_loader: - kv 3: general.quantized_by str = Unsloth
llama_model_loader: - kv 4: general.size_label str = 3.9B
llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 6: gemma3.context_length u32 = 131072
llama_model_loader: - kv 7: gemma3.embedding_length u32 = 2560
llama_model_loader: - kv 8: gemma3.block_count u32 = 34
llama_model_loader: - kv 9: gemma3.feed_forward_length u32 = 10240
llama_model_loader: - kv 10: gemma3.attention.head_count u32 = 8
llama_model_loader: - kv 11: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 12: gemma3.attention.key_length u32 = 256
llama_model_loader: - kv 13: gemma3.attention.value_length u32 = 256
llama_model_loader: - kv 14: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 15: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 16: gemma3.attention.head_count_kv u32 = 4
llama_model_loader: - kv 17: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 18: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 19: tokenizer.ggml.model str = llama
llama_model_loader: - kv 20: tokenizer.ggml.pre str = default
llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 22: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 106
llama_model_loader: - kv 26: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 29: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 30: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 31: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 32: general.quantization_version u32 = 2
llama_model_loader: - kv 33: general.file_type u32 = 15
llama_model_loader: - type f32: 205 tensors
llama_model_loader: - type q4_K: 204 tensors
llama_model_loader: - type q6_K: 35 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 2.31 GiB (5.12 BPW)
init_tokenizer: initializing tokenizer for type 1
load: control token: 1 '<eos>' is not marked as EOG
load: control token: 0 '<pad>' is not marked as EOG
load: control token: 46 '<unused40>' is not marked as EOG
load: control token: 3 '<unk>' is not marked as EOG
load: control token: 54 '<unused48>' is not marked as EOG
load: control token: 23 '<unused17>' is not marked as EOG
load: control token: 261532 '<unused5630>' is not marked as EOG
load: control token: 2 '<bos>' is not marked as EOG
load: control token: 30 '<unused24>' is not marked as EOG
load: control token: 260607 '<unused4705>' is not marked as EOG
load: control token: 74 '<unused68>' is not marked as EOG
load: control token: 259984 '<unused4082>' is not marked as EOG
load: control token: 259158 '<unused3256>' is not marked as EOG
load: control token: 57 '<unused51>' is not marked as EOG
load: control token: 4 '<mask>' is not marked as EOG
load: control token: 43 '<unused37>' is not marked as EOG
load: control token: 34 '<unused28>' is not marked as EOG
load: control token: 6 '<unused0>' is not marked as EOG
load: control token: 36 '<unused30>' is not marked as EOG
load: control token: 258673 '<unused2771>' is not marked as EOG
load: control token: 13 '<unused7>' is not marked as EOG
load: control token: 26 '<unused20>' is not marked as EOG
load: control token: 14 '<unused8>' is not marked as EOG
load: control token: 27 '<unused21>' is not marked as EOG
load: control token: 15 '<unused9>' is not marked as EOG
load: control token: 16 '<unused10>' is not marked as EOG
load: control token: 17 '<unused11>' is not marked as EOG
load: control token: 18 '<unused12>' is not marked as EOG
load: control token: 260041 '<unused4139>' is not marked as EOG
load: control token: 19 '<unused13>' is not marked as EOG
load: control token: 20 '<unused14>' is not marked as EOG
load: control token: 21 '<unused15>' is not marked as EOG
load: control token: 55 '<unused49>' is not marked as EOG
load: control token: 22 '<unused16>' is not marked as EOG
load: control token: 53 '<unused47>' is not marked as EOG
load: control token: 24 '<unused18>' is not marked as EOG
load: control token: 52 '<unused46>' is not marked as EOG
load: control token: 25 '<unused19>' is not marked as EOG
load: control token: 28 '<unused22>' is not marked as EOG
load: control token: 29 '<unused23>' is not marked as EOG
load: control token: 260701 '<unused4799>' is not marked as EOG
load: control token: 31 '<unused25>' is not marked as EOG
load: control token: 45 '<unused39>' is not marked as EOG
load: control token: 32 '<unused26>' is not marked as EOG
load: control token: 44 '<unused38>' is not marked as EOG
load: control token: 33 '<unused27>' is not marked as EOG
load: control token: 47 '<unused41>' is not marked as EOG
load: control token: 260528 '<unused4626>' is not marked as EOG
load: control token: 48 '<unused42>' is not marked as EOG
load: control token: 49 '<unused43>' is not marked as EOG
load: control token: 260522 '<unused4620>' is not marked as EOG
load: control token: 260887 '<unused4985>' is not marked as EOG
load: control token: 50 '<unused44>' is not marked as EOG
load: control token: 51 '<unused45>' is not marked as EOG
load: control token: 56 '<unused50>' is not marked as EOG
load: control token: 259159 '<unused3257>' is not marked as EOG
load: control token: 258653 '<unused2751>' is not marked as EOG
load: control token: 58 '<unused52>' is not marked as EOG
load: control token: 59 '<unused53>' is not marked as EOG
load: control token: 60 '<unused54>' is not marked as EOG
load: control token: 61 '<unused55>' is not marked as EOG
load: control token: 259987 '<unused4085>' is not marked as EOG
load: control token: 62 '<unused56>' is not marked as EOG
load: control token: 257650 '<unused1748>' is not marked as EOG
load: control token: 63 '<unused57>' is not marked as EOG
load: control token: 64 '<unused58>' is not marked as EOG
load: control token: 260792 '<unused4890>' is not marked as EOG
load: control token: 65 '<unused59>' is not marked as EOG
load: control token: 66 '<unused60>' is not marked as EOG
load: control token: 259074 '<unused3172>' is not marked as EOG
load: control token: 67 '<unused61>' is not marked as EOG
load: control token: 68 '<unused62>' is not marked as EOG
load: control token: 69 '<unused63>' is not marked as EOG
load: control token: 87 '<unused81>' is not marked as EOG
load: control token: 257138 '<unused1236>' is not marked as EOG
load: control token: 88 '<unused82>' is not marked as EOG
load: control token: 257139 '<unused1237>' is not marked as EOG
load: control token: 89 '<unused83>' is not marked as EOG
load: control token: 257132 '<unused1230>' is not marked as EOG
load: control token: 90 '<unused84>' is not marked as EOG
load: control token: 257989 '<unused2087>' is not marked as EOG
load: control token: 257133 '<unused1231>' is not marked as EOG
load: control token: 91 '<unused85>' is not marked as EOG
load: control token: 257134 '<unused1232>' is not marked as EOG
load: control token: 92 '<unused86>' is not marked as EOG
load: control token: 256220 '<unused318>' is not marked as EOG
load: control token: 257135 '<unused1233>' is not marked as EOG
load: control token: 93 '<unused87>' is not marked as EOG
load: control token: 94 '<unused88>' is not marked as EOG
load: control token: 256218 '<unused316>' is not marked as EOG
load: control token: 95 '<unused89>' is not marked as EOG
load: control token: 257126 '<unused1224>' is not marked as EOG
load: control token: 96 '<unused90>' is not marked as EOG
load: control token: 257127 '<unused1225>' is not marked as EOG
load: control token: 97 '<unused91>' is not marked as EOG
load: control token: 257128 '<unused1226>' is not marked as EOG
load: control token: 98 '<unused92>' is not marked as EOG
load: control token: 257129 '<unused1227>' is not marked as EOG
load: control token: 99 '<unused93>' is not marked as EOG
load: control token: 257122 '<unused1220>' is not marked as EOG
load: control token: 100 '<unused94>' is not marked as EOG
load: control token: 257123 '<unused1221>' is not marked as EOG
load: control token: 101 '<unused95>' is not marked as EOG
load: control token: 257124 '<unused1222>' is not marked as EOG
load: control token: 102 '<unused96>' is not marked as EOG
load: control token: 257125 '<unused1223>' is not marked as EOG
load: control token: 103 '<unused97>' is not marked as EOG
load: control token: 104 '<unused98>' is not marked as EOG
load: control token: 105 '<start_of_turn>' is not marked as EOG
load: control token: 262128 '<unused6226>' is not marked as EOG
load: control token: 261646 '<unused5744>' is not marked as EOG
load: control token: 257704 '<unused1802>' is not marked as EOG
load: control token: 257599 '<unused1697>' is not marked as EOG
load: control token: 260246 '<unused4344>' is not marked as EOG
load: control token: 262111 '<unused6209>' is not marked as EOG
load: control token: 260590 '<unused4688>' is not marked as EOG
load: control token: 260591 '<unused4689>' is not marked as EOG
load: control token: 260213 '<unused4311>' is not marked as EOG
load: control token: 257971 '<unused2069>' is not marked as EOG
load: control token: 258690 '<unused2788>' is not marked as EOG
load: control token: 260860 '<unused4958>' is not marked as EOG
load: control token: 256429 '<unused527>' is not marked as EOG
load: control token: 258920 '<unused3018>' is not marked as EOG
load: control token: 262132 '<unused6230>' is not marked as EOG
load: control token: 256183 '<unused281>' is not marked as EOG
load: control token: 260490 '<unused4588>' is not marked as EOG
load: control token: 262130 '<unused6228>' is not marked as EOG
load: control token: 262131 '<unused6229>' is not marked as EOG
load: control token: 262133 '<unused6231>' is not marked as EOG
load: control token: 262134 '<unused6232>' is not marked as EOG
load: control token: 262139 '<unused6237>' is not marked as EOG
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 2560
print_info: n_layer = 34
print_info: n_head = 8
print_info: n_head_kv = 4
print_info: n_rot = 256
print_info: n_swa = 1024
print_info: n_swa_pattern = 6
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 6.2e-02
print_info: n_ff = 10240
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 4B
print_info: model params = 3.88 B
print_info: general.name = Gemma-3-4B-It
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 106 '<end_of_turn>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer 0 assigned to device CUDA0, is_swa = 1
load_tensors: layer 1 assigned to device CUDA0, is_swa = 1
load_tensors: layer 2 assigned to device CUDA0, is_swa = 1
load_tensors: layer 3 assigned to device CUDA0, is_swa = 1
load_tensors: layer 4 assigned to device CUDA0, is_swa = 1
load_tensors: layer 5 assigned to device CUDA0, is_swa = 0
load_tensors: layer 6 assigned to device CUDA0, is_swa = 1
load_tensors: layer 7 assigned to device CUDA0, is_swa = 1
load_tensors: layer 8 assigned to device CUDA0, is_swa = 1
load_tensors: layer 9 assigned to device CUDA0, is_swa = 1
load_tensors: layer 10 assigned to device CUDA0, is_swa = 1
load_tensors: layer 11 assigned to device CUDA0, is_swa = 0
load_tensors: layer 12 assigned to device CUDA0, is_swa = 1
load_tensors: layer 13 assigned to device CUDA0, is_swa = 1
load_tensors: layer 14 assigned to device CUDA0, is_swa = 1
load_tensors: layer 15 assigned to device CUDA0, is_swa = 1
load_tensors: layer 16 assigned to device CUDA0, is_swa = 1
load_tensors: layer 17 assigned to device CUDA0, is_swa = 0
load_tensors: layer 18 assigned to device CUDA0, is_swa = 1
load_tensors: layer 19 assigned to device CUDA0, is_swa = 1
load_tensors: layer 20 assigned to device CUDA0, is_swa = 1
load_tensors: layer 21 assigned to device CUDA0, is_swa = 1
load_tensors: layer 22 assigned to device CUDA0, is_swa = 1
load_tensors: layer 23 assigned to device CUDA0, is_swa = 0
load_tensors: layer 24 assigned to device CUDA0, is_swa = 1
load_tensors: layer 25 assigned to device CUDA0, is_swa = 1
load_tensors: layer 26 assigned to device CUDA0, is_swa = 1
load_tensors: layer 27 assigned to device CUDA0, is_swa = 1
load_tensors: layer 28 assigned to device CUDA0, is_swa = 1
load_tensors: layer 29 assigned to device CUDA0, is_swa = 0
load_tensors: layer 30 assigned to device CUDA0, is_swa = 1
load_tensors: layer 31 assigned to device CUDA0, is_swa = 1
load_tensors: layer 32 assigned to device CUDA0, is_swa = 1
load_tensors: layer 33 assigned to device CUDA0, is_swa = 1
load_tensors: layer 34 assigned to device CUDA0, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q6_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 34 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 35/35 layers to GPU
load_tensors: CUDA0 model buffer size = 2368.31 MiB
load_tensors: CPU_Mapped model buffer size = 525.13 MiB
.................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: CUDA_Host output buffer size = 1.00 MiB
llama_context: n_ctx = 4096
llama_context: n_ctx = 4096 (padded)
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 34, can_shift = 1
init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: CUDA0 KV buffer size = 544.00 MiB
llama_context: KV self size = 544.00 MiB, K (f16): 272.00 MiB, V (f16): 272.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: CUDA0 compute buffer size = 517.12 MiB
llama_context: CUDA_Host compute buffer size = 21.01 MiB
llama_context: graph nodes = 1435
llama_context: graph splits = 2
clear_adapter_lora: call
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
set_warmup: value = 1
set_warmup: value = 0
clip_ctx: CLIP using CUDA0 backend
D:\llamacpp\llama.cpp\build\bin\Release>(Result: Loads model & CLIP, VRAM spikes & drops, hangs/exits silently after clip_ctx log. No text output.)
(Note: Using llama-llava-cli.exe with the Gemma 3 files yields the same silent exit behavior after CLIP loading.)
Successful Command (LLaVA Vision):
.\llama-llava-cli.exe ^
-m D:\ai_models\ggml_llava-v1.5-7b\ggml-model-q4_k.gguf ^
--mmproj D:\ai_models\ggml_llava-v1.5-7b\mmproj-model-f16.gguf ^
--image C:/Users/Bukra/Desktop/photo_test.jpg ^
-ngl 99 ^
--temp 0.2 ^
-p "USER: <image>\nDescribe this image\nASSISTANT:" ^
-v(Result: Correctly generated text description of the image: "Yes, there are many brightly colored buildings in the image, and there are also some people standing in front of them.")
Successful Command (Gemma 3 Text-Only):
D:\llamacpp\llama.cpp\build\bin\Release>.\llama-cli.exe -m D:/ai_models/unsloth-gemma-3-4b-it-gguf/gemma-3-4b-it-Q4_K_M.gguf -ngl 99 -p "Hi" -n 50
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
build: 5061 (916c83bf) with MSVC 19.43.34809.0 for x64
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) - 30843 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 444 tensors from D:/ai_models/unsloth-gemma-3-4b-it-gguf/gemma-3-4b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma-3-4B-It
llama_model_loader: - kv 3: general.quantized_by str = Unsloth
llama_model_loader: - kv 4: general.size_label str = 3.9B
llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 6: gemma3.context_length u32 = 131072
llama_model_loader: - kv 7: gemma3.embedding_length u32 = 2560
llama_model_loader: - kv 8: gemma3.block_count u32 = 34
llama_model_loader: - kv 9: gemma3.feed_forward_length u32 = 10240
llama_model_loader: - kv 10: gemma3.attention.head_count u32 = 8
llama_model_loader: - kv 11: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 12: gemma3.attention.key_length u32 = 256
llama_model_loader: - kv 13: gemma3.attention.value_length u32 = 256
llama_model_loader: - kv 14: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 15: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 16: gemma3.attention.head_count_kv u32 = 4
llama_model_loader: - kv 17: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 18: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 19: tokenizer.ggml.model str = llama
llama_model_loader: - kv 20: tokenizer.ggml.pre str = default
llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 22: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 106
llama_model_loader: - kv 26: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 29: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 30: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 31: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 32: general.quantization_version u32 = 2
llama_model_loader: - kv 33: general.file_type u32 = 15
llama_model_loader: - type f32: 205 tensors
llama_model_loader: - type q4_K: 204 tensors
llama_model_loader: - type q6_K: 35 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 2.31 GiB (5.12 BPW)
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 2560
print_info: n_layer = 34
print_info: n_head = 8
print_info: n_head_kv = 4
print_info: n_rot = 256
print_info: n_swa = 1024
print_info: n_swa_pattern = 6
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 6.2e-02
print_info: n_ff = 10240
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 4B
print_info: model params = 3.88 B
print_info: general.name = Gemma-3-4B-It
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 106 '<end_of_turn>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 34 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 35/35 layers to GPU
load_tensors: CUDA0 model buffer size = 2368.31 MiB
load_tensors: CPU_Mapped model buffer size = 525.13 MiB
.................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 1.00 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 34, can_shift = 1
init: CUDA0 KV buffer size = 544.00 MiB
llama_context: KV self size = 544.00 MiB, K (f16): 272.00 MiB, V (f16): 272.00 MiB
llama_context: CUDA0 compute buffer size = 517.12 MiB
llama_context: CUDA_Host compute buffer size = 21.01 MiB
llama_context: graph nodes = 1435
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 16
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<start_of_turn>user
You are a helpful assistant
Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: interactive mode on.
sampler seed: 3434785435
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 50, n_keep = 1
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
- Not using system message. To change it, set a different value via -sys PROMPT
user
Hi
model
Hi there! How can I help you today? 😊
Do you want to:
* Chat about something?
* Get some information?
* Play a game?(Result: Correctly generated text response.)
Additional Context:
I initially faced similar issues trying to get quantized Gemma 3 vision GGUFs working with Ollama on Windows, as discussed in this Unsloth issue: unslothai/unsloth#2248 (comment). This led me to try llama.cpp directly.
Question:
Is this silent failure with Gemma 3 vision GGUFs using llama-gemma3-cli.exe (or llama-llava-cli.exe) on Windows a known issue? Could it be related to:
- The specific Unsloth GGUF conversion process for Gemma 3?
- The compatibility of the
mmproj-F32.gguffile with the Q4_K_M base model GGUF? - Bugs or limitations in the experimental
llama-gemma3-cli.exeon Windows, especially with newer models like Gemma 3? - A required, specific prompt format (different from LLaVA's) for Gemma 3 vision that I'm missing?
Any guidance or suggestions for known-good Gemma 3 vision GGUF/mmproj combinations that are confirmed to work with llama.cpp on Windows would be greatly appreciated.
First Bad Commit
No response
Relevant log output
D:\llamacpp\llama.cpp\build\bin\Release>.\llama-gemma3-cli.exe -m D:\ai_models\unsloth-gemma-3-4b-it-gguf\gemma-3-4b-it-Q4_K_M.gguf --mmproj D:\ai_models\unsloth-gemma-3-4b-it-gguf\mmproj-BF16.gguf --image C:/Users/Bukra/Desktop/photo_test.jpg -p "describe this image" --temp 0.2 -ngl 100 -v
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
build: 5061 (916c83bf) with MSVC 19.43.34809.0 for x64
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) - 30843 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 444 tensors from D:\ai_models\unsloth-gemma-3-4b-it-gguf\gemma-3-4b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma-3-4B-It
llama_model_loader: - kv 3: general.quantized_by str = Unsloth
llama_model_loader: - kv 4: general.size_label str = 3.9B
llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 6: gemma3.context_length u32 = 131072
llama_model_loader: - kv 7: gemma3.embedding_length u32 = 2560
llama_model_loader: - kv 8: gemma3.block_count u32 = 34
llama_model_loader: - kv 9: gemma3.feed_forward_length u32 = 10240
llama_model_loader: - kv 10: gemma3.attention.head_count u32 = 8
llama_model_loader: - kv 11: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 12: gemma3.attention.key_length u32 = 256
llama_model_loader: - kv 13: gemma3.attention.value_length u32 = 256
llama_model_loader: - kv 14: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 15: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 16: gemma3.attention.head_count_kv u32 = 4
llama_model_loader: - kv 17: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 18: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 19: tokenizer.ggml.model str = llama
llama_model_loader: - kv 20: tokenizer.ggml.pre str = default
llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 22: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 106
llama_model_loader: - kv 26: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 29: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 30: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 31: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 32: general.quantization_version u32 = 2
llama_model_loader: - kv 33: general.file_type u32 = 15
llama_model_loader: - type f32: 205 tensors
llama_model_loader: - type q4_K: 204 tensors
llama_model_loader: - type q6_K: 35 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 2.31 GiB (5.12 BPW)
init_tokenizer: initializing tokenizer for type 1
load: control token: 1 '<eos>' is not marked as EOG
load: control token: 0 '<pad>' is not marked as EOG
load: control token: 46 '<unused40>' is not marked as EOG
load: control token: 3 '<unk>' is not marked as EOG
load: control token: 54 '<unused48>' is not marked as EOG
load: control token: 23 '<unused17>' is not marked as EOG
load: control token: 261532 '<unused5630>' is not marked as EOG
load: control token: 2 '<bos>' is not marked as EOG
load: control token: 30 '<unused24>' is not marked as EOG
load: control token: 260607 '<unused4705>' is not marked as EOG
load: control token: 74 '<unused68>' is not marked as EOG
load: control token: 259984 '<unused4082>' is not marked as EOG
load: control token: 259158 '<unused3256>' is not marked as EOG
load: control token: 57 '<unused51>' is not marked as EOG
load: control token: 4 '<mask>' is not marked as EOG
load: control token: 43 '<unused37>' is not marked as EOG
load: control token: 34 '<unused28>' is not marked as EOG
load: control token: 6 '<unused0>' is not marked as EOG
load: control token: 36 '<unused30>' is not marked as EOG
load: control token: 258673 '<unused2771>' is not marked as EOG
load: control token: 13 '<unused7>' is not marked as EOG
load: control token: 26 '<unused20>' is not marked as EOG
load: control token: 14 '<unused8>' is not marked as EOG
load: control token: 27 '<unused21>' is not marked as EOG
load: control token: 15 '<unused9>' is not marked as EOG
load: control token: 16 '<unused10>' is not marked as EOG
load: control token: 17 '<unused11>' is not marked as EOG
load: control token: 18 '<unused12>' is not marked as EOG
load: control token: 260041 '<unused4139>' is not marked as EOG
load: control token: 19 '<unused13>' is not marked as EOG
load: control token: 20 '<unused14>' is not marked as EOG
load: control token: 21 '<unused15>' is not marked as EOG
load: control token: 55 '<unused49>' is not marked as EOG
load: control token: 22 '<unused16>' is not marked as EOG
load: control token: 53 '<unused47>' is not marked as EOG
load: control token: 24 '<unused18>' is not marked as EOG
load: control token: 52 '<unused46>' is not marked as EOG
load: control token: 25 '<unused19>' is not marked as EOG
load: control token: 28 '<unused22>' is not marked as EOG
load: control token: 29 '<unused23>' is not marked as EOG
load: control token: 260701 '<unused4799>' is not marked as EOG
load: control token: 31 '<unused25>' is not marked as EOG
load: control token: 45 '<unused39>' is not marked as EOG
load: control token: 32 '<unused26>' is not marked as EOG
load: control token: 44 '<unused38>' is not marked as EOG
load: control token: 33 '<unused27>' is not marked as EOG
load: control token: 47 '<unused41>' is not marked as EOG
load: control token: 260528 '<unused4626>' is not marked as EOG
load: control token: 48 '<unused42>' is not marked as EOG
load: control token: 49 '<unused43>' is not marked as EOG
load: control token: 260522 '<unused4620>' is not marked as EOG
load: control token: 260887 '<unused4985>' is not marked as EOG
load: control token: 50 '<unused44>' is not marked as EOG
load: control token: 51 '<unused45>' is not marked as EOG
load: control token: 56 '<unused50>' is not marked as EOG
load: control token: 259159 '<unused3257>' is not marked as EOG
load: control token: 258653 '<unused2751>' is not marked as EOG
load: control token: 58 '<unused52>' is not marked as EOG
load: control token: 59 '<unused53>' is not marked as EOG
load: control token: 60 '<unused54>' is not marked as EOG
load: control token: 61 '<unused55>' is not marked as EOG
load: control token: 259987 '<unused4085>' is not marked as EOG
load: control token: 62 '<unused56>' is not marked as EOG
load: control token: 257650 '<unused1748>' is not marked as EOG
load: control token: 63 '<unused57>' is not marked as EOG
load: control token: 64 '<unused58>' is not marked as EOG
load: control token: 260792 '<unused4890>' is not marked as EOG
load: control token: 65 '<unused59>' is not marked as EOG
load: control token: 66 '<unused60>' is not marked as EOG
load: control token: 259074 '<unused3172>' is not marked as EOG
load: control token: 67 '<unused61>' is not marked as EOG
load: control token: 68 '<unused62>' is not marked as EOG
load: control token: 69 '<unused63>' is not marked as EOG
load: control token: 87 '<unused81>' is not marked as EOG
load: control token: 257138 '<unused1236>' is not marked as EOG
load: control token: 88 '<unused82>' is not marked as EOG
load: control token: 257139 '<unused1237>' is not marked as EOG
load: control token: 89 '<unused83>' is not marked as EOG
load: control token: 257132 '<unused1230>' is not marked as EOG
load: control token: 90 '<unused84>' is not marked as EOG
load: control token: 257989 '<unused2087>' is not marked as EOG
load: control token: 257133 '<unused1231>' is not marked as EOG
load: control token: 91 '<unused85>' is not marked as EOG
load: control token: 257134 '<unused1232>' is not marked as EOG
load: control token: 92 '<unused86>' is not marked as EOG
load: control token: 256220 '<unused318>' is not marked as EOG
load: control token: 257135 '<unused1233>' is not marked as EOG
load: control token: 93 '<unused87>' is not marked as EOG
load: control token: 94 '<unused88>' is not marked as EOG
load: control token: 256218 '<unused316>' is not marked as EOG
load: control token: 95 '<unused89>' is not marked as EOG
load: control token: 257126 '<unused1224>' is not marked as EOG
load: control token: 96 '<unused90>' is not marked as EOG
load: control token: 257127 '<unused1225>' is not marked as EOG
load: control token: 97 '<unused91>' is not marked as EOG
load: control token: 257128 '<unused1226>' is not marked as EOG
load: control token: 98 '<unused92>' is not marked as EOG
load: control token: 257129 '<unused1227>' is not marked as EOG
load: control token: 99 '<unused93>' is not marked as EOG
load: control token: 257122 '<unused1220>' is not marked as EOG
load: control token: 100 '<unused94>' is not marked as EOG
load: control token: 257123 '<unused1221>' is not marked as EOG
load: control token: 101 '<unused95>' is not marked as EOG
load: control token: 257124 '<unused1222>' is not marked as EOG
load: control token: 102 '<unused96>' is not marked as EOG
load: control token: 257125 '<unused1223>' is not marked as EOG
load: control token: 103 '<unused97>' is not marked as EOG
load: control token: 104 '<unused98>' is not marked as EOG
load: control token: 105 '<start_of_turn>' is not marked as EOG
load: control token: 262128 '<unused6226>' is not marked as EOG
load: control token: 261646 '<unused5744>' is not marked as EOG
load: control token: 257704 '<unused1802>' is not marked as EOG
load: control token: 257599 '<unused1697>' is not marked as EOG
load: control token: 260246 '<unused4344>' is not marked as EOG
load: control token: 262111 '<unused6209>' is not marked as EOG
load: control token: 260590 '<unused4688>' is not marked as EOG
load: control token: 260591 '<unused4689>' is not marked as EOG
load: control token: 260213 '<unused4311>' is not marked as EOG
load: control token: 257971 '<unused2069>' is not marked as EOG
load: control token: 258690 '<unused2788>' is not marked as EOG
load: control token: 260860 '<unused4958>' is not marked as EOG
load: control token: 256429 '<unused527>' is not marked as EOG
load: control token: 258920 '<unused3018>' is not marked as EOG
load: control token: 262132 '<unused6230>' is not marked as EOG
load: control token: 256183 '<unused281>' is not marked as EOG
load: control token: 260490 '<unused4588>' is not marked as EOG
load: control token: 262130 '<unused6228>' is not marked as EOG
load: control token: 262131 '<unused6229>' is not marked as EOG
load: control token: 262133 '<unused6231>' is not marked as EOG
load: control token: 262134 '<unused6232>' is not marked as EOG
load: control token: 262139 '<unused6237>' is not marked as EOG
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 2560
print_info: n_layer = 34
print_info: n_head = 8
print_info: n_head_kv = 4
print_info: n_rot = 256
print_info: n_swa = 1024
print_info: n_swa_pattern = 6
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 6.2e-02
print_info: n_ff = 10240
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 4B
print_info: model params = 3.88 B
print_info: general.name = Gemma-3-4B-It
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 106 '<end_of_turn>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer 0 assigned to device CUDA0, is_swa = 1
load_tensors: layer 1 assigned to device CUDA0, is_swa = 1
load_tensors: layer 2 assigned to device CUDA0, is_swa = 1
load_tensors: layer 3 assigned to device CUDA0, is_swa = 1
load_tensors: layer 4 assigned to device CUDA0, is_swa = 1
load_tensors: layer 5 assigned to device CUDA0, is_swa = 0
load_tensors: layer 6 assigned to device CUDA0, is_swa = 1
load_tensors: layer 7 assigned to device CUDA0, is_swa = 1
load_tensors: layer 8 assigned to device CUDA0, is_swa = 1
load_tensors: layer 9 assigned to device CUDA0, is_swa = 1
load_tensors: layer 10 assigned to device CUDA0, is_swa = 1
load_tensors: layer 11 assigned to device CUDA0, is_swa = 0
load_tensors: layer 12 assigned to device CUDA0, is_swa = 1
load_tensors: layer 13 assigned to device CUDA0, is_swa = 1
load_tensors: layer 14 assigned to device CUDA0, is_swa = 1
load_tensors: layer 15 assigned to device CUDA0, is_swa = 1
load_tensors: layer 16 assigned to device CUDA0, is_swa = 1
load_tensors: layer 17 assigned to device CUDA0, is_swa = 0
load_tensors: layer 18 assigned to device CUDA0, is_swa = 1
load_tensors: layer 19 assigned to device CUDA0, is_swa = 1
load_tensors: layer 20 assigned to device CUDA0, is_swa = 1
load_tensors: layer 21 assigned to device CUDA0, is_swa = 1
load_tensors: layer 22 assigned to device CUDA0, is_swa = 1
load_tensors: layer 23 assigned to device CUDA0, is_swa = 0
load_tensors: layer 24 assigned to device CUDA0, is_swa = 1
load_tensors: layer 25 assigned to device CUDA0, is_swa = 1
load_tensors: layer 26 assigned to device CUDA0, is_swa = 1
load_tensors: layer 27 assigned to device CUDA0, is_swa = 1
load_tensors: layer 28 assigned to device CUDA0, is_swa = 1
load_tensors: layer 29 assigned to device CUDA0, is_swa = 0
load_tensors: layer 30 assigned to device CUDA0, is_swa = 1
load_tensors: layer 31 assigned to device CUDA0, is_swa = 1
load_tensors: layer 32 assigned to device CUDA0, is_swa = 1
load_tensors: layer 33 assigned to device CUDA0, is_swa = 1
load_tensors: layer 34 assigned to device CUDA0, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q6_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 34 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 35/35 layers to GPU
load_tensors: CUDA0 model buffer size = 2368.31 MiB
load_tensors: CPU_Mapped model buffer size = 525.13 MiB
.................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: CUDA_Host output buffer size = 1.00 MiB
llama_context: n_ctx = 4096
llama_context: n_ctx = 4096 (padded)
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 34, can_shift = 1
init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 24: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 25: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 26: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 27: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 28: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 29: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 30: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 31: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 32: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: layer 33: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CUDA0
init: CUDA0 KV buffer size = 544.00 MiB
llama_context: KV self size = 544.00 MiB, K (f16): 272.00 MiB, V (f16): 272.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: CUDA0 compute buffer size = 517.12 MiB
llama_context: CUDA_Host compute buffer size = 21.01 MiB
llama_context: graph nodes = 1435
llama_context: graph splits = 2
clear_adapter_lora: call
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
set_warmup: value = 1
set_warmup: value = 0
clip_ctx: CLIP using CUDA0 backend
D:\llamacpp\llama.cpp\build\bin\Release>