-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Description
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from E:\neuro\LLM-server\llamacpp\new\ggml-cuda.dll
load_backend: loaded RPC backend from E:\neuro\LLM-server\llamacpp\new\ggml-rpc.dll
load_backend: loaded CPU backend from E:\neuro\LLM-server\llamacpp\new\ggml-cpu-alderlake.dll
version: 5527 (763d06e)
built with clang version 18.1.8 for x86_64-pc-windows-msvc
Operating systems
Windows
GGML backends
CUDA
Hardware
i9-13900F
RTX 4090 + RTX 3090
Models
Mistral-Small-3.1-24B-Instruct-2503-UD-Q6_K_XL
Problem description & steps to reproduce
Error using the OpenAI API endpoint with images:
CUDA error: an illegal memory access was encountered
current device: 1, in function ggml_backend_cuda_synchronize at C:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2461
cudaStreamSynchronize(cuda_ctx->stream())
C:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:75: CUDA error
Works correctly without an image.
The problem does not occur with llama-b5504-bin-win-cuda-12.4-x64.
The bug occurs only in Mistral models. Qwen and Gemma are working correctly.
First Bad Commit
The issue started appearing in Llama-B5505-bin-win-cuda-12.4-x64
Relevant log output
set LLAMA_ARG_MODEL=e:\neuro\LLM-server\models\mistral-small-3.1-24b-instruct-2503\Mistral-Small-3.1-24B-Instruct-2503-UD-Q5_K_XL.gguf
set LLAMA_ARG_MMPROJ=e:\neuro\LLM-server\models\mistral-small-3.1-24b-instruct-2503\mmproj-F16.gguf
set LLAMA_ARG_DEVICE=CUDA0,CUDA1
set LLAMA_ARG_N_GPU_LAYERS=99
set LLAMA_ARG_TENSOR_SPLIT=17,24
set LLAMA_ARG_MODEL_DRAFT=
set LLAMA_ARG_N_GPU_LAYERS_DRAFT=99
set CONTEXT_SIZE=131072
set CACHE_TYPE=f16
set LLAMA_ARG_CTX_SIZE=131072
set LLAMA_ARG_CACHE_TYPE_K=f16
set LLAMA_ARG_CACHE_TYPE_V=f16
set LLAMA_ARG_CTX_SIZE_DRAFT=131072
set LLAMA_ARG_DRAFT_MAX=16
set LLAMA_ARG_SPLIT_MODE=layer
llama-server.exe --host 0.0.0.0 --port 5000 --n-predict -1 --keep -1 --threads 30 --no-webui --flash-attn --no-mmap -v --temp 0.2 --top-k 64 --top-p 0.7 --min-p 0.01 --jinja --device-draft CUDA0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from E:\neuro\LLM-server\llamacpp\new\ggml-cuda.dll
load_backend: loaded RPC backend from E:\neuro\LLM-server\llamacpp\new\ggml-rpc.dll
load_backend: loaded CPU backend from E:\neuro\LLM-server\llamacpp\new\ggml-cpu-alderlake.dll
build: 5527 (763d06ed) with clang version 18.1.8 for x86_64-pc-windows-msvc
system info: n_threads = 30, n_threads_batch = 30, total_threads = 32
system_info: n_threads = 30 (n_threads_batch = 30) / 32 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
Web UI is disabled
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 5000, http threads: 31
main: loading model
srv load_model: loading model 'e:\LLM-server\models\mistral-small-3.1-24b-instruct-2503\Mistral-Small-3.1-24B-Instruct-2503-UD-Q5_K_XL.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 363 tensors from e:\neuro\LLM-server\models\mistral-small-3.1-24b-instruct-2503\Mistral-Small-3.1-24B-Instruct-2503-UD-Q5_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Mistral-Small-3.1-24B-Instruct-2503
llama_model_loader: - kv 3: general.version str = 2503
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = Mistral-Small-3.1-24B-Instruct-2503
llama_model_loader: - kv 6: general.quantized_by str = Unsloth
llama_model_loader: - kv 7: general.size_label str = 24B
llama_model_loader: - kv 8: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 9: llama.block_count u32 = 40
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 5120
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 32768
llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 1000000000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: llama.attention.key_length u32 = 128
llama_model_loader: - kv 18: llama.attention.value_length u32 = 128
llama_model_loader: - kv 19: llama.vocab_size u32 = 131072
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = tekken
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 11
llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 32: tokenizer.chat_template str = {%- set today = strftime_now("%Y-%m-%...
llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 34: general.quantization_version u32 = 2
llama_model_loader: - kv 35: general.file_type u32 = 17
llama_model_loader: - kv 36: quantize.imatrix.file str = Mistral-Small-3.1-24B-Instruct-2503-G...
llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_Mistral-Small-3.1...
llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 280
llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 576
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q8_0: 1 tensors
llama_model_loader: - type q4_K: 20 tensors
llama_model_loader: - type q5_K: 184 tensors
llama_model_loader: - type q6_K: 77 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q5_K - Medium
print_info: file size = 15.61 GiB (5.69 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 475 '<SPECIAL_475>' is not marked as EOG
load: control token: 787 '<SPECIAL_787>' is not marked as EOG
load: control token: 59 '<SPECIAL_59>' is not marked as EOG
.......
.......
load: control token: 992 '<SPECIAL_992>' is not marked as EOG
load: control token: 993 '<SPECIAL_993>' is not marked as EOG
load: control token: 997 '<SPECIAL_997>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1000
load: token to piece cache size = 0.8498 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 5120
print_info: n_layer = 40
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 32768
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 1000000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 13B
print_info: model params = 23.57 B
print_info: general.name = Mistral-Small-3.1-24B-Instruct-2503
print_info: vocab type = BPE
print_info: n_vocab = 131072
print_info: n_merges = 269443
print_info: BOS token = 1 '<s>'
print_info: EOS token = 2 '</s>'
print_info: UNK token = 0 '<unk>'
print_info: PAD token = 11 '<pad>'
print_info: LF token = 1010 'Ċ'
print_info: EOG token = 2 '</s>'
print_info: max token length = 150
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: layer 0 assigned to device CUDA0, is_swa = 0
load_tensors: layer 1 assigned to device CUDA0, is_swa = 0
load_tensors: layer 2 assigned to device CUDA0, is_swa = 0
load_tensors: layer 3 assigned to device CUDA0, is_swa = 0
load_tensors: layer 4 assigned to device CUDA0, is_swa = 0
load_tensors: layer 5 assigned to device CUDA0, is_swa = 0
load_tensors: layer 6 assigned to device CUDA0, is_swa = 0
load_tensors: layer 7 assigned to device CUDA0, is_swa = 0
load_tensors: layer 8 assigned to device CUDA0, is_swa = 0
load_tensors: layer 9 assigned to device CUDA0, is_swa = 0
load_tensors: layer 10 assigned to device CUDA0, is_swa = 0
load_tensors: layer 11 assigned to device CUDA0, is_swa = 0
load_tensors: layer 12 assigned to device CUDA0, is_swa = 0
load_tensors: layer 13 assigned to device CUDA0, is_swa = 0
load_tensors: layer 14 assigned to device CUDA0, is_swa = 0
load_tensors: layer 15 assigned to device CUDA0, is_swa = 0
load_tensors: layer 16 assigned to device CUDA0, is_swa = 0
load_tensors: layer 17 assigned to device CUDA1, is_swa = 0
load_tensors: layer 18 assigned to device CUDA1, is_swa = 0
load_tensors: layer 19 assigned to device CUDA1, is_swa = 0
load_tensors: layer 20 assigned to device CUDA1, is_swa = 0
load_tensors: layer 21 assigned to device CUDA1, is_swa = 0
load_tensors: layer 22 assigned to device CUDA1, is_swa = 0
load_tensors: layer 23 assigned to device CUDA1, is_swa = 0
load_tensors: layer 24 assigned to device CUDA1, is_swa = 0
load_tensors: layer 25 assigned to device CUDA1, is_swa = 0
load_tensors: layer 26 assigned to device CUDA1, is_swa = 0
load_tensors: layer 27 assigned to device CUDA1, is_swa = 0
load_tensors: layer 28 assigned to device CUDA1, is_swa = 0
load_tensors: layer 29 assigned to device CUDA1, is_swa = 0
load_tensors: layer 30 assigned to device CUDA1, is_swa = 0
load_tensors: layer 31 assigned to device CUDA1, is_swa = 0
load_tensors: layer 32 assigned to device CUDA1, is_swa = 0
load_tensors: layer 33 assigned to device CUDA1, is_swa = 0
load_tensors: layer 34 assigned to device CUDA1, is_swa = 0
load_tensors: layer 35 assigned to device CUDA1, is_swa = 0
load_tensors: layer 36 assigned to device CUDA1, is_swa = 0
load_tensors: layer 37 assigned to device CUDA1, is_swa = 0
load_tensors: layer 38 assigned to device CUDA1, is_swa = 0
load_tensors: layer 39 assigned to device CUDA1, is_swa = 0
load_tensors: layer 40 assigned to device CUDA1, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q5_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors: CUDA0 model buffer size = 6401.91 MiB
load_tensors: CUDA1 model buffer size = 9139.71 MiB
load_tensors: CPU model buffer size = 440.00 MiB
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
.......................................load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1
.......................................................load_all_data: no device found for buffer type CPU for async uploads
..
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 131072
llama_context: n_ctx_per_seq = 131072
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 1000000000.0
llama_context: freq_scale = 1
set_abort_callback: call
llama_context: CUDA_Host output buffer size = 0.50 MiB
create_memory: n_ctx = 131072 (padded)
llama_kv_cache_unified: layer 0: dev = CUDA0
llama_kv_cache_unified: layer 1: dev = CUDA0
llama_kv_cache_unified: layer 2: dev = CUDA0
llama_kv_cache_unified: layer 3: dev = CUDA0
llama_kv_cache_unified: layer 4: dev = CUDA0
llama_kv_cache_unified: layer 5: dev = CUDA0
llama_kv_cache_unified: layer 6: dev = CUDA0
llama_kv_cache_unified: layer 7: dev = CUDA0
llama_kv_cache_unified: layer 8: dev = CUDA0
llama_kv_cache_unified: layer 9: dev = CUDA0
llama_kv_cache_unified: layer 10: dev = CUDA0
llama_kv_cache_unified: layer 11: dev = CUDA0
llama_kv_cache_unified: layer 12: dev = CUDA0
llama_kv_cache_unified: layer 13: dev = CUDA0
llama_kv_cache_unified: layer 14: dev = CUDA0
llama_kv_cache_unified: layer 15: dev = CUDA0
llama_kv_cache_unified: layer 16: dev = CUDA0
llama_kv_cache_unified: layer 17: dev = CUDA1
llama_kv_cache_unified: layer 18: dev = CUDA1
llama_kv_cache_unified: layer 19: dev = CUDA1
llama_kv_cache_unified: layer 20: dev = CUDA1
llama_kv_cache_unified: layer 21: dev = CUDA1
llama_kv_cache_unified: layer 22: dev = CUDA1
llama_kv_cache_unified: layer 23: dev = CUDA1
llama_kv_cache_unified: layer 24: dev = CUDA1
llama_kv_cache_unified: layer 25: dev = CUDA1
llama_kv_cache_unified: layer 26: dev = CUDA1
llama_kv_cache_unified: layer 27: dev = CUDA1
llama_kv_cache_unified: layer 28: dev = CUDA1
llama_kv_cache_unified: layer 29: dev = CUDA1
llama_kv_cache_unified: layer 30: dev = CUDA1
llama_kv_cache_unified: layer 31: dev = CUDA1
llama_kv_cache_unified: layer 32: dev = CUDA1
llama_kv_cache_unified: layer 33: dev = CUDA1
llama_kv_cache_unified: layer 34: dev = CUDA1
llama_kv_cache_unified: layer 35: dev = CUDA1
llama_kv_cache_unified: layer 36: dev = CUDA1
llama_kv_cache_unified: layer 37: dev = CUDA1
llama_kv_cache_unified: layer 38: dev = CUDA1
llama_kv_cache_unified: layer 39: dev = CUDA1
llama_kv_cache_unified: CUDA0 KV buffer size = 8704.00 MiB
llama_kv_cache_unified: CUDA1 KV buffer size = 11776.00 MiB
llama_kv_cache_unified: size = 20480.00 MiB (131072 cells, 40 layers, 1 seqs), K (f16): 10240.00 MiB, V (f16): 10240.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: max_nodes = 65536
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: CUDA0 compute buffer size = 1356.01 MiB
llama_context: CUDA1 compute buffer size = 818.02 MiB
llama_context: CUDA_Host compute buffer size = 1034.02 MiB
llama_context: graph nodes = 1287
llama_context: graph splits = 3
clear_adapter_lora: call
common_init_from_params: setting dry_penalty_last_n to ctx_size = 131072
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
set_warmup: value = 1
set_warmup: value = 0
Failed to infer a tool call example (possible template bug)
clip_model_loader: model name: Mistral-Small-3.1-24B-Instruct-2503
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 223
clip_model_loader: n_kv: 23
clip_model_loader: has vision encoder
clip_model_loader: tensor[0]: n_dims = 1, name = v.token_embd.img_break, tensor_size=20480, offset=0, shape:[5120, 1, 1, 1], type = f32
clip_model_loader: tensor[1]: n_dims = 2, name = mm.1.weight, tensor_size=10485760, offset=20480, shape:[1024, 5120, 1, 1], type = f16
clip_model_loader: tensor[2]: n_dims = 2, name = mm.2.weight, tensor_size=52428800, offset=10506240, shape:[5120, 5120, 1, 1], type = f16
clip_model_loader: tensor[3]: n_dims = 1, name = mm.input_norm.weight, tensor_size=4096, offset=62935040, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[4]: n_dims = 2, name = mm.patch_merger.weight, tensor_size=8388608, offset=62939136, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[5]: n_dims = 1, name = v.pre_ln.weight, tensor_size=4096, offset=71327744, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[6]: n_dims = 4, name = v.patch_embd.weight, tensor_size=1204224, offset=71331840, shape:[14, 14, 3, 1024], type = f16
clip_model_loader: tensor[7]: n_dims = 2, name = v.blk.0.attn_k.weight, tensor_size=2097152, offset=72536064, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[8]: n_dims = 2, name = v.blk.0.attn_out.weight, tensor_size=2097152, offset=74633216, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[9]: n_dims = 2, name = v.blk.0.attn_q.weight, tensor_size=2097152, offset=76730368, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[10]: n_dims = 2, name = v.blk.0.attn_v.weight, tensor_size=2097152, offset=78827520, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[11]: n_dims = 1, name = v.blk.0.ln1.weight, tensor_size=4096, offset=80924672, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[12]: n_dims = 2, name = v.blk.0.ffn_down.weight, tensor_size=8388608, offset=80928768, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[13]: n_dims = 2, name = v.blk.0.ffn_gate.weight, tensor_size=8388608, offset=89317376, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[14]: n_dims = 2, name = v.blk.0.ffn_up.weight, tensor_size=8388608, offset=97705984, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[15]: n_dims = 1, name = v.blk.0.ln2.weight, tensor_size=4096, offset=106094592, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[16]: n_dims = 2, name = v.blk.1.attn_k.weight, tensor_size=2097152, offset=106098688, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[17]: n_dims = 2, name = v.blk.1.attn_out.weight, tensor_size=2097152, offset=108195840, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[18]: n_dims = 2, name = v.blk.1.attn_q.weight, tensor_size=2097152, offset=110292992, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[19]: n_dims = 2, name = v.blk.1.attn_v.weight, tensor_size=2097152, offset=112390144, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[20]: n_dims = 1, name = v.blk.1.ln1.weight, tensor_size=4096, offset=114487296, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[21]: n_dims = 2, name = v.blk.1.ffn_down.weight, tensor_size=8388608, offset=114491392, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[22]: n_dims = 2, name = v.blk.1.ffn_gate.weight, tensor_size=8388608, offset=122880000, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[23]: n_dims = 2, name = v.blk.1.ffn_up.weight, tensor_size=8388608, offset=131268608, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[24]: n_dims = 1, name = v.blk.1.ln2.weight, tensor_size=4096, offset=139657216, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[25]: n_dims = 2, name = v.blk.10.attn_k.weight, tensor_size=2097152, offset=139661312, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[26]: n_dims = 2, name = v.blk.10.attn_out.weight, tensor_size=2097152, offset=141758464, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[27]: n_dims = 2, name = v.blk.10.attn_q.weight, tensor_size=2097152, offset=143855616, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[28]: n_dims = 2, name = v.blk.10.attn_v.weight, tensor_size=2097152, offset=145952768, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[29]: n_dims = 1, name = v.blk.10.ln1.weight, tensor_size=4096, offset=148049920, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[30]: n_dims = 2, name = v.blk.10.ffn_down.weight, tensor_size=8388608, offset=148054016, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[31]: n_dims = 2, name = v.blk.10.ffn_gate.weight, tensor_size=8388608, offset=156442624, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[32]: n_dims = 2, name = v.blk.10.ffn_up.weight, tensor_size=8388608, offset=164831232, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[33]: n_dims = 1, name = v.blk.10.ln2.weight, tensor_size=4096, offset=173219840, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[34]: n_dims = 2, name = v.blk.11.attn_k.weight, tensor_size=2097152, offset=173223936, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[35]: n_dims = 2, name = v.blk.11.attn_out.weight, tensor_size=2097152, offset=175321088, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[36]: n_dims = 2, name = v.blk.11.attn_q.weight, tensor_size=2097152, offset=177418240, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[37]: n_dims = 2, name = v.blk.11.attn_v.weight, tensor_size=2097152, offset=179515392, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[38]: n_dims = 1, name = v.blk.11.ln1.weight, tensor_size=4096, offset=181612544, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[39]: n_dims = 2, name = v.blk.11.ffn_down.weight, tensor_size=8388608, offset=181616640, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[40]: n_dims = 2, name = v.blk.11.ffn_gate.weight, tensor_size=8388608, offset=190005248, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[41]: n_dims = 2, name = v.blk.11.ffn_up.weight, tensor_size=8388608, offset=198393856, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[42]: n_dims = 1, name = v.blk.11.ln2.weight, tensor_size=4096, offset=206782464, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[43]: n_dims = 2, name = v.blk.12.attn_k.weight, tensor_size=2097152, offset=206786560, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[44]: n_dims = 2, name = v.blk.12.attn_out.weight, tensor_size=2097152, offset=208883712, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[45]: n_dims = 2, name = v.blk.12.attn_q.weight, tensor_size=2097152, offset=210980864, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[46]: n_dims = 2, name = v.blk.12.attn_v.weight, tensor_size=2097152, offset=213078016, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[47]: n_dims = 1, name = v.blk.12.ln1.weight, tensor_size=4096, offset=215175168, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[48]: n_dims = 2, name = v.blk.12.ffn_down.weight, tensor_size=8388608, offset=215179264, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[49]: n_dims = 2, name = v.blk.12.ffn_gate.weight, tensor_size=8388608, offset=223567872, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[50]: n_dims = 2, name = v.blk.12.ffn_up.weight, tensor_size=8388608, offset=231956480, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[51]: n_dims = 1, name = v.blk.12.ln2.weight, tensor_size=4096, offset=240345088, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[52]: n_dims = 2, name = v.blk.13.attn_k.weight, tensor_size=2097152, offset=240349184, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[53]: n_dims = 2, name = v.blk.13.attn_out.weight, tensor_size=2097152, offset=242446336, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[54]: n_dims = 2, name = v.blk.13.attn_q.weight, tensor_size=2097152, offset=244543488, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[55]: n_dims = 2, name = v.blk.13.attn_v.weight, tensor_size=2097152, offset=246640640, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[56]: n_dims = 1, name = v.blk.13.ln1.weight, tensor_size=4096, offset=248737792, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[57]: n_dims = 2, name = v.blk.13.ffn_down.weight, tensor_size=8388608, offset=248741888, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[58]: n_dims = 2, name = v.blk.13.ffn_gate.weight, tensor_size=8388608, offset=257130496, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[59]: n_dims = 2, name = v.blk.13.ffn_up.weight, tensor_size=8388608, offset=265519104, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[60]: n_dims = 1, name = v.blk.13.ln2.weight, tensor_size=4096, offset=273907712, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[61]: n_dims = 2, name = v.blk.14.attn_k.weight, tensor_size=2097152, offset=273911808, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[62]: n_dims = 2, name = v.blk.14.attn_out.weight, tensor_size=2097152, offset=276008960, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[63]: n_dims = 2, name = v.blk.14.attn_q.weight, tensor_size=2097152, offset=278106112, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[64]: n_dims = 2, name = v.blk.14.attn_v.weight, tensor_size=2097152, offset=280203264, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[65]: n_dims = 1, name = v.blk.14.ln1.weight, tensor_size=4096, offset=282300416, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[66]: n_dims = 2, name = v.blk.14.ffn_down.weight, tensor_size=8388608, offset=282304512, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[67]: n_dims = 2, name = v.blk.14.ffn_gate.weight, tensor_size=8388608, offset=290693120, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[68]: n_dims = 2, name = v.blk.14.ffn_up.weight, tensor_size=8388608, offset=299081728, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[69]: n_dims = 1, name = v.blk.14.ln2.weight, tensor_size=4096, offset=307470336, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[70]: n_dims = 2, name = v.blk.15.attn_k.weight, tensor_size=2097152, offset=307474432, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[71]: n_dims = 2, name = v.blk.15.attn_out.weight, tensor_size=2097152, offset=309571584, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[72]: n_dims = 2, name = v.blk.15.attn_q.weight, tensor_size=2097152, offset=311668736, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[73]: n_dims = 2, name = v.blk.15.attn_v.weight, tensor_size=2097152, offset=313765888, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[74]: n_dims = 1, name = v.blk.15.ln1.weight, tensor_size=4096, offset=315863040, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[75]: n_dims = 2, name = v.blk.15.ffn_down.weight, tensor_size=8388608, offset=315867136, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[76]: n_dims = 2, name = v.blk.15.ffn_gate.weight, tensor_size=8388608, offset=324255744, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[77]: n_dims = 2, name = v.blk.15.ffn_up.weight, tensor_size=8388608, offset=332644352, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[78]: n_dims = 1, name = v.blk.15.ln2.weight, tensor_size=4096, offset=341032960, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[79]: n_dims = 2, name = v.blk.16.attn_k.weight, tensor_size=2097152, offset=341037056, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[80]: n_dims = 2, name = v.blk.16.attn_out.weight, tensor_size=2097152, offset=343134208, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[81]: n_dims = 2, name = v.blk.16.attn_q.weight, tensor_size=2097152, offset=345231360, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[82]: n_dims = 2, name = v.blk.16.attn_v.weight, tensor_size=2097152, offset=347328512, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[83]: n_dims = 1, name = v.blk.16.ln1.weight, tensor_size=4096, offset=349425664, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[84]: n_dims = 2, name = v.blk.16.ffn_down.weight, tensor_size=8388608, offset=349429760, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[85]: n_dims = 2, name = v.blk.16.ffn_gate.weight, tensor_size=8388608, offset=357818368, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[86]: n_dims = 2, name = v.blk.16.ffn_up.weight, tensor_size=8388608, offset=366206976, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[87]: n_dims = 1, name = v.blk.16.ln2.weight, tensor_size=4096, offset=374595584, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[88]: n_dims = 2, name = v.blk.17.attn_k.weight, tensor_size=2097152, offset=374599680, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[89]: n_dims = 2, name = v.blk.17.attn_out.weight, tensor_size=2097152, offset=376696832, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[90]: n_dims = 2, name = v.blk.17.attn_q.weight, tensor_size=2097152, offset=378793984, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[91]: n_dims = 2, name = v.blk.17.attn_v.weight, tensor_size=2097152, offset=380891136, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[92]: n_dims = 1, name = v.blk.17.ln1.weight, tensor_size=4096, offset=382988288, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[93]: n_dims = 2, name = v.blk.17.ffn_down.weight, tensor_size=8388608, offset=382992384, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[94]: n_dims = 2, name = v.blk.17.ffn_gate.weight, tensor_size=8388608, offset=391380992, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[95]: n_dims = 2, name = v.blk.17.ffn_up.weight, tensor_size=8388608, offset=399769600, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[96]: n_dims = 1, name = v.blk.17.ln2.weight, tensor_size=4096, offset=408158208, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[97]: n_dims = 2, name = v.blk.18.attn_k.weight, tensor_size=2097152, offset=408162304, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[98]: n_dims = 2, name = v.blk.18.attn_out.weight, tensor_size=2097152, offset=410259456, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[99]: n_dims = 2, name = v.blk.18.attn_q.weight, tensor_size=2097152, offset=412356608, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[100]: n_dims = 2, name = v.blk.18.attn_v.weight, tensor_size=2097152, offset=414453760, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[101]: n_dims = 1, name = v.blk.18.ln1.weight, tensor_size=4096, offset=416550912, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[102]: n_dims = 2, name = v.blk.18.ffn_down.weight, tensor_size=8388608, offset=416555008, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[103]: n_dims = 2, name = v.blk.18.ffn_gate.weight, tensor_size=8388608, offset=424943616, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[104]: n_dims = 2, name = v.blk.18.ffn_up.weight, tensor_size=8388608, offset=433332224, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[105]: n_dims = 1, name = v.blk.18.ln2.weight, tensor_size=4096, offset=441720832, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[106]: n_dims = 2, name = v.blk.19.attn_k.weight, tensor_size=2097152, offset=441724928, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[107]: n_dims = 2, name = v.blk.19.attn_out.weight, tensor_size=2097152, offset=443822080, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[108]: n_dims = 2, name = v.blk.19.attn_q.weight, tensor_size=2097152, offset=445919232, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[109]: n_dims = 2, name = v.blk.19.attn_v.weight, tensor_size=2097152, offset=448016384, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[110]: n_dims = 1, name = v.blk.19.ln1.weight, tensor_size=4096, offset=450113536, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[111]: n_dims = 2, name = v.blk.19.ffn_down.weight, tensor_size=8388608, offset=450117632, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[112]: n_dims = 2, name = v.blk.19.ffn_gate.weight, tensor_size=8388608, offset=458506240, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[113]: n_dims = 2, name = v.blk.19.ffn_up.weight, tensor_size=8388608, offset=466894848, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[114]: n_dims = 1, name = v.blk.19.ln2.weight, tensor_size=4096, offset=475283456, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[115]: n_dims = 2, name = v.blk.2.attn_k.weight, tensor_size=2097152, offset=475287552, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[116]: n_dims = 2, name = v.blk.2.attn_out.weight, tensor_size=2097152, offset=477384704, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[117]: n_dims = 2, name = v.blk.2.attn_q.weight, tensor_size=2097152, offset=479481856, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[118]: n_dims = 2, name = v.blk.2.attn_v.weight, tensor_size=2097152, offset=481579008, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[119]: n_dims = 1, name = v.blk.2.ln1.weight, tensor_size=4096, offset=483676160, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[120]: n_dims = 2, name = v.blk.2.ffn_down.weight, tensor_size=8388608, offset=483680256, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[121]: n_dims = 2, name = v.blk.2.ffn_gate.weight, tensor_size=8388608, offset=492068864, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[122]: n_dims = 2, name = v.blk.2.ffn_up.weight, tensor_size=8388608, offset=500457472, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[123]: n_dims = 1, name = v.blk.2.ln2.weight, tensor_size=4096, offset=508846080, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[124]: n_dims = 2, name = v.blk.20.attn_k.weight, tensor_size=2097152, offset=508850176, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[125]: n_dims = 2, name = v.blk.20.attn_out.weight, tensor_size=2097152, offset=510947328, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[126]: n_dims = 2, name = v.blk.20.attn_q.weight, tensor_size=2097152, offset=513044480, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[127]: n_dims = 2, name = v.blk.20.attn_v.weight, tensor_size=2097152, offset=515141632, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[128]: n_dims = 1, name = v.blk.20.ln1.weight, tensor_size=4096, offset=517238784, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[129]: n_dims = 2, name = v.blk.20.ffn_down.weight, tensor_size=8388608, offset=517242880, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[130]: n_dims = 2, name = v.blk.20.ffn_gate.weight, tensor_size=8388608, offset=525631488, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[131]: n_dims = 2, name = v.blk.20.ffn_up.weight, tensor_size=8388608, offset=534020096, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[132]: n_dims = 1, name = v.blk.20.ln2.weight, tensor_size=4096, offset=542408704, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[133]: n_dims = 2, name = v.blk.21.attn_k.weight, tensor_size=2097152, offset=542412800, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[134]: n_dims = 2, name = v.blk.21.attn_out.weight, tensor_size=2097152, offset=544509952, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[135]: n_dims = 2, name = v.blk.21.attn_q.weight, tensor_size=2097152, offset=546607104, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[136]: n_dims = 2, name = v.blk.21.attn_v.weight, tensor_size=2097152, offset=548704256, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[137]: n_dims = 1, name = v.blk.21.ln1.weight, tensor_size=4096, offset=550801408, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[138]: n_dims = 2, name = v.blk.21.ffn_down.weight, tensor_size=8388608, offset=550805504, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[139]: n_dims = 2, name = v.blk.21.ffn_gate.weight, tensor_size=8388608, offset=559194112, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[140]: n_dims = 2, name = v.blk.21.ffn_up.weight, tensor_size=8388608, offset=567582720, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[141]: n_dims = 1, name = v.blk.21.ln2.weight, tensor_size=4096, offset=575971328, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[142]: n_dims = 2, name = v.blk.22.attn_k.weight, tensor_size=2097152, offset=575975424, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[143]: n_dims = 2, name = v.blk.22.attn_out.weight, tensor_size=2097152, offset=578072576, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[144]: n_dims = 2, name = v.blk.22.attn_q.weight, tensor_size=2097152, offset=580169728, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[145]: n_dims = 2, name = v.blk.22.attn_v.weight, tensor_size=2097152, offset=582266880, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[146]: n_dims = 1, name = v.blk.22.ln1.weight, tensor_size=4096, offset=584364032, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[147]: n_dims = 2, name = v.blk.22.ffn_down.weight, tensor_size=8388608, offset=584368128, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[148]: n_dims = 2, name = v.blk.22.ffn_gate.weight, tensor_size=8388608, offset=592756736, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[149]: n_dims = 2, name = v.blk.22.ffn_up.weight, tensor_size=8388608, offset=601145344, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[150]: n_dims = 1, name = v.blk.22.ln2.weight, tensor_size=4096, offset=609533952, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[151]: n_dims = 2, name = v.blk.23.attn_k.weight, tensor_size=2097152, offset=609538048, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[152]: n_dims = 2, name = v.blk.23.attn_out.weight, tensor_size=2097152, offset=611635200, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[153]: n_dims = 2, name = v.blk.23.attn_q.weight, tensor_size=2097152, offset=613732352, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[154]: n_dims = 2, name = v.blk.23.attn_v.weight, tensor_size=2097152, offset=615829504, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[155]: n_dims = 1, name = v.blk.23.ln1.weight, tensor_size=4096, offset=617926656, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[156]: n_dims = 2, name = v.blk.23.ffn_down.weight, tensor_size=8388608, offset=617930752, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[157]: n_dims = 2, name = v.blk.23.ffn_gate.weight, tensor_size=8388608, offset=626319360, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[158]: n_dims = 2, name = v.blk.23.ffn_up.weight, tensor_size=8388608, offset=634707968, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[159]: n_dims = 1, name = v.blk.23.ln2.weight, tensor_size=4096, offset=643096576, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[160]: n_dims = 2, name = v.blk.3.attn_k.weight, tensor_size=2097152, offset=643100672, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[161]: n_dims = 2, name = v.blk.3.attn_out.weight, tensor_size=2097152, offset=645197824, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[162]: n_dims = 2, name = v.blk.3.attn_q.weight, tensor_size=2097152, offset=647294976, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[163]: n_dims = 2, name = v.blk.3.attn_v.weight, tensor_size=2097152, offset=649392128, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[164]: n_dims = 1, name = v.blk.3.ln1.weight, tensor_size=4096, offset=651489280, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[165]: n_dims = 2, name = v.blk.3.ffn_down.weight, tensor_size=8388608, offset=651493376, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[166]: n_dims = 2, name = v.blk.3.ffn_gate.weight, tensor_size=8388608, offset=659881984, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[167]: n_dims = 2, name = v.blk.3.ffn_up.weight, tensor_size=8388608, offset=668270592, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[168]: n_dims = 1, name = v.blk.3.ln2.weight, tensor_size=4096, offset=676659200, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[169]: n_dims = 2, name = v.blk.4.attn_k.weight, tensor_size=2097152, offset=676663296, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[170]: n_dims = 2, name = v.blk.4.attn_out.weight, tensor_size=2097152, offset=678760448, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[171]: n_dims = 2, name = v.blk.4.attn_q.weight, tensor_size=2097152, offset=680857600, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[172]: n_dims = 2, name = v.blk.4.attn_v.weight, tensor_size=2097152, offset=682954752, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[173]: n_dims = 1, name = v.blk.4.ln1.weight, tensor_size=4096, offset=685051904, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[174]: n_dims = 2, name = v.blk.4.ffn_down.weight, tensor_size=8388608, offset=685056000, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[175]: n_dims = 2, name = v.blk.4.ffn_gate.weight, tensor_size=8388608, offset=693444608, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[176]: n_dims = 2, name = v.blk.4.ffn_up.weight, tensor_size=8388608, offset=701833216, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[177]: n_dims = 1, name = v.blk.4.ln2.weight, tensor_size=4096, offset=710221824, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[178]: n_dims = 2, name = v.blk.5.attn_k.weight, tensor_size=2097152, offset=710225920, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[179]: n_dims = 2, name = v.blk.5.attn_out.weight, tensor_size=2097152, offset=712323072, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[180]: n_dims = 2, name = v.blk.5.attn_q.weight, tensor_size=2097152, offset=714420224, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[181]: n_dims = 2, name = v.blk.5.attn_v.weight, tensor_size=2097152, offset=716517376, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[182]: n_dims = 1, name = v.blk.5.ln1.weight, tensor_size=4096, offset=718614528, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[183]: n_dims = 2, name = v.blk.5.ffn_down.weight, tensor_size=8388608, offset=718618624, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[184]: n_dims = 2, name = v.blk.5.ffn_gate.weight, tensor_size=8388608, offset=727007232, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[185]: n_dims = 2, name = v.blk.5.ffn_up.weight, tensor_size=8388608, offset=735395840, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[186]: n_dims = 1, name = v.blk.5.ln2.weight, tensor_size=4096, offset=743784448, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[187]: n_dims = 2, name = v.blk.6.attn_k.weight, tensor_size=2097152, offset=743788544, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[188]: n_dims = 2, name = v.blk.6.attn_out.weight, tensor_size=2097152, offset=745885696, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[189]: n_dims = 2, name = v.blk.6.attn_q.weight, tensor_size=2097152, offset=747982848, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[190]: n_dims = 2, name = v.blk.6.attn_v.weight, tensor_size=2097152, offset=750080000, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[191]: n_dims = 1, name = v.blk.6.ln1.weight, tensor_size=4096, offset=752177152, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[192]: n_dims = 2, name = v.blk.6.ffn_down.weight, tensor_size=8388608, offset=752181248, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[193]: n_dims = 2, name = v.blk.6.ffn_gate.weight, tensor_size=8388608, offset=760569856, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[194]: n_dims = 2, name = v.blk.6.ffn_up.weight, tensor_size=8388608, offset=768958464, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[195]: n_dims = 1, name = v.blk.6.ln2.weight, tensor_size=4096, offset=777347072, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[196]: n_dims = 2, name = v.blk.7.attn_k.weight, tensor_size=2097152, offset=777351168, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[197]: n_dims = 2, name = v.blk.7.attn_out.weight, tensor_size=2097152, offset=779448320, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[198]: n_dims = 2, name = v.blk.7.attn_q.weight, tensor_size=2097152, offset=781545472, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[199]: n_dims = 2, name = v.blk.7.attn_v.weight, tensor_size=2097152, offset=783642624, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[200]: n_dims = 1, name = v.blk.7.ln1.weight, tensor_size=4096, offset=785739776, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[201]: n_dims = 2, name = v.blk.7.ffn_down.weight, tensor_size=8388608, offset=785743872, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[202]: n_dims = 2, name = v.blk.7.ffn_gate.weight, tensor_size=8388608, offset=794132480, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[203]: n_dims = 2, name = v.blk.7.ffn_up.weight, tensor_size=8388608, offset=802521088, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[204]: n_dims = 1, name = v.blk.7.ln2.weight, tensor_size=4096, offset=810909696, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[205]: n_dims = 2, name = v.blk.8.attn_k.weight, tensor_size=2097152, offset=810913792, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[206]: n_dims = 2, name = v.blk.8.attn_out.weight, tensor_size=2097152, offset=813010944, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[207]: n_dims = 2, name = v.blk.8.attn_q.weight, tensor_size=2097152, offset=815108096, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[208]: n_dims = 2, name = v.blk.8.attn_v.weight, tensor_size=2097152, offset=817205248, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[209]: n_dims = 1, name = v.blk.8.ln1.weight, tensor_size=4096, offset=819302400, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[210]: n_dims = 2, name = v.blk.8.ffn_down.weight, tensor_size=8388608, offset=819306496, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[211]: n_dims = 2, name = v.blk.8.ffn_gate.weight, tensor_size=8388608, offset=827695104, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[212]: n_dims = 2, name = v.blk.8.ffn_up.weight, tensor_size=8388608, offset=836083712, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[213]: n_dims = 1, name = v.blk.8.ln2.weight, tensor_size=4096, offset=844472320, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[214]: n_dims = 2, name = v.blk.9.attn_k.weight, tensor_size=2097152, offset=844476416, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[215]: n_dims = 2, name = v.blk.9.attn_out.weight, tensor_size=2097152, offset=846573568, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[216]: n_dims = 2, name = v.blk.9.attn_q.weight, tensor_size=2097152, offset=848670720, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[217]: n_dims = 2, name = v.blk.9.attn_v.weight, tensor_size=2097152, offset=850767872, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[218]: n_dims = 1, name = v.blk.9.ln1.weight, tensor_size=4096, offset=852865024, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[219]: n_dims = 2, name = v.blk.9.ffn_down.weight, tensor_size=8388608, offset=852869120, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[220]: n_dims = 2, name = v.blk.9.ffn_gate.weight, tensor_size=8388608, offset=861257728, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[221]: n_dims = 2, name = v.blk.9.ffn_up.weight, tensor_size=8388608, offset=869646336, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[222]: n_dims = 1, name = v.blk.9.ln2.weight, tensor_size=4096, offset=878034944, shape:[1024, 1, 1, 1], type = f32
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector: pixtral
load_hparams: n_embd: 1024
load_hparams: n_head: 16
load_hparams: n_ff: 4096
load_hparams: n_layer: 24
load_hparams: ffn_op: silu
load_hparams: projection_dim: 5120
--- vision hparams ---
load_hparams: image_size: 1540
load_hparams: patch_size: 14
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 0
load_hparams: proj_scale_factor: 0
load_hparams: n_wa_pattern: 0
load_hparams: model size: 837.36 MiB
load_hparams: metadata size: 0.08 MiB
load_tensors: loaded 223 tensors from e:\neuro\LLM-server\models\mistral-small-3.1-24b-instruct-2503\mmproj-F16.gguf
alloc_compute_meta: CUDA0 compute buffer size = 2.97 MiB
alloc_compute_meta: CPU compute buffer size = 0.14 MiB
srv load_model: loaded multimodal model, 'e:\neuro\LLM-server\models\mistral-small-3.1-24b-instruct-2503\mmproj-F16.gguf'
srv load_model: ctx_shift is not supported by multimodal, it will be disabled
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 131072
slot reset: id 0 | task -1 |
main: model loaded
main: chat template, chat_template: {%- set today = strftime_now("%Y-%m-%d") %}
{%- set default_system_message = "You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.\nYour knowledge base was last updated on 2023-10-01. The current date is " + today + ".\n\nWhen you're not sure about some information, you say that you don't have the information and don't make up anything.\nIf the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. \"What are some good restaurants around me?\" => \"Where are you?\" or \"When is the next flight to Tokyo\" => \"Where do you travel from?\")" %}
{{- bos_token }}
{%- if messages[0]['role'] == 'system' %}
{%- if messages[0]['content'] is string %}
{%- set system_message = messages[0]['content'] %}
{%- else %}
{%- set system_message = messages[0]['content'][0]['text'] %}
{%- endif %}
{%- set loop_messages = messages[1:] %}
{%- else %}
{%- set system_message = default_system_message %}
{%- set loop_messages = messages %}
{%- endif %}
{{- '[SYSTEM_PROMPT]' + system_message + '[/SYSTEM_PROMPT]' }}
{%- for message in loop_messages %}
{%- if message['role'] == 'user' %}
{%- if message['content'] is string %}
{{- '[INST]' + message['content'] + '[/INST]' }}
{%- else %}
{{- '[INST]' }}
{%- for block in message['content'] %}
{%- if block['type'] == 'text' %}
{{- block['text'] }}
{%- elif block['type'] in ['image', 'image_url'] %}
{{- '[IMG]' }}
{%- else %}
{{- raise_exception('Only text and image blocks are supported in message content!') }}
{%- endif %}
{%- endfor %}
{{- '[/INST]' }}
{%- endif %}
{%- elif message['role'] == 'system' %}
{%- if message['content'] is string %}
{{- '[SYSTEM_PROMPT]' + message['content'] + '[/SYSTEM_PROMPT]' }}
{%- else %}
{{- '[SYSTEM_PROMPT]' + message['content'][0]['text'] + '[/SYSTEM_PROMPT]' }}
{%- endif %}
{%- elif message['role'] == 'assistant' %}
{%- if message['content'] is string %}
{{- message['content'] + eos_token }}
{%- else %}
{{- message['content'][0]['text'] + eos_token }}
{%- endif %}
{%- else %}
{{- raise_exception('Only user, system and assistant roles are supported!') }}
{%- endif %}
{%- endfor %}, example_format: '[SYSTEM_PROMPT]You are a helpful assistant[/SYSTEM_PROMPT][INST]Hello[/INST]Hi there</s>[INST]How are you?[/INST]'
main: server is listening on http://0.0.0.0:5000 - starting the main loop
que start_loop: processing new tasks
que start_loop: update slots
srv update_slots: all slots are idle
srv kv_cache_cle: clearing KV cache
que start_loop: waiting for new tasks
request: {"stream": true, "model": "local", "messages": [{"role": "user", "content": [{"type": "text", "text": "Describe image"}, {"type": "image_url", "image_url": {"url": ""}}]}]}
add_text: [SYSTEM_PROMPT]You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.
Your knowledge base was last updated on 2023-10-01. The current date is 2025-05-29.
When you're not sure about some information, you say that you don't have the information and don't make up anything.
If the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. "What are some good restaurants around me?" => "Where are you?" or "When is the next flight to Tokyo" => "Where do you travel from?")[/SYSTEM_PROMPT][INST]Describe image
image_tokens->nx = 1
image_tokens->ny = 1
batch_f32 size = 1
add_text: [IMG_END]
add_text: [/INST]
srv params_from_: Grammar:
srv params_from_: Grammar lazy: false
srv params_from_: Chat format: Content-only
srv add_waiting_: add task 0 to waiting list. current waiting = 0 (before add)
que post: new task, id = 0/1, front = 0
que start_loop: processing new tasks
que start_loop: processing task, id = 0
slot get_availabl: id 0 | task -1 | selected slot by lru, t_last = -1
slot reset: id 0 | task -1 |
slot launch_slot_: id 0 | task 0 | launching slot : {"id":0,"id_task":0,"n_ctx":131072,"speculative":false,"is_processing":false,"non_causal":false,"params":{"n_predict":-1,"seed":4294967295,"temperature":0.20000000298023224,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":64,"top_p":0.699999988079071,"min_p":0.009999999776482582,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":131072,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":true,"thinking_forced_open":false,"samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"prompt":"<s>[SYSTEM_PROMPT]You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.\nYour knowledge base was last updated on 2023-10-01. The current date is 2025-05-29.\n\nWhen you're not sure about some information, you say that you don't have the information and don't make up anything.\nIf the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. \"What are some good restaurants around me?\" => \"Where are you?\" or \"When is the next flight to Tokyo\" => \"Where do you travel from?\")[/SYSTEM_PROMPT][INST]Describe image\n[IMG_END][/INST]","next_token":{"has_next_token":true,"has_new_line":false,"n_remain":-1,"n_decoded":0,"stopping_word":""}}
slot launch_slot_: id 0 | task 0 | processing task
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 1, front = 0
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 183
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 180, n_tokens = 180, progress = 0.983607
srv update_slots: decoding batch, n_tokens = 180
set_embeddings: value = 0
clear_adapter_lora: call
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 1
encoding image slice...
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 2, front = 0
slot update_slots: id 0 | task 0 | kv cache rm [180, end)
srv process_chun: processing image...
image slice encoded in 110 ms
decoding image batch 1/1, n_tokens_batch = 1
image decoded (batch 1/1) in 13 ms
srv process_chun: image processed in 124 ms
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 183, n_tokens = 2, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 183, n_tokens = 2
srv update_slots: decoding batch, n_tokens = 2
set_embeddings: value = 0
clear_adapter_lora: call
CUDA error: an illegal memory access was encountered
current device: 1, in function ggml_backend_cuda_synchronize at C:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2461
cudaStreamSynchronize(cuda_ctx->stream())
C:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:75: CUDA error