Skip to content

Eval bug: CUDA error: an illegal memory access was encountered on mistral-small-3.1-24b-instruct with mmproj #13879

@hronoas

Description

@hronoas

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from E:\neuro\LLM-server\llamacpp\new\ggml-cuda.dll
load_backend: loaded RPC backend from E:\neuro\LLM-server\llamacpp\new\ggml-rpc.dll
load_backend: loaded CPU backend from E:\neuro\LLM-server\llamacpp\new\ggml-cpu-alderlake.dll
version: 5527 (763d06e)
built with clang version 18.1.8 for x86_64-pc-windows-msvc

Operating systems

Windows

GGML backends

CUDA

Hardware

i9-13900F
RTX 4090 + RTX 3090

Models

Mistral-Small-3.1-24B-Instruct-2503-UD-Q6_K_XL

Problem description & steps to reproduce

Error using the OpenAI API endpoint with images:

CUDA error: an illegal memory access was encountered
current device: 1, in function ggml_backend_cuda_synchronize at C:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2461
cudaStreamSynchronize(cuda_ctx->stream())
C:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:75: CUDA error

Works correctly without an image.
The problem does not occur with llama-b5504-bin-win-cuda-12.4-x64.
The bug occurs only in Mistral models. Qwen and Gemma are working correctly.

First Bad Commit

The issue started appearing in Llama-B5505-bin-win-cuda-12.4-x64

Relevant log output

set LLAMA_ARG_MODEL=e:\neuro\LLM-server\models\mistral-small-3.1-24b-instruct-2503\Mistral-Small-3.1-24B-Instruct-2503-UD-Q5_K_XL.gguf
set LLAMA_ARG_MMPROJ=e:\neuro\LLM-server\models\mistral-small-3.1-24b-instruct-2503\mmproj-F16.gguf
set LLAMA_ARG_DEVICE=CUDA0,CUDA1
set LLAMA_ARG_N_GPU_LAYERS=99
set LLAMA_ARG_TENSOR_SPLIT=17,24
set LLAMA_ARG_MODEL_DRAFT=
set LLAMA_ARG_N_GPU_LAYERS_DRAFT=99
set CONTEXT_SIZE=131072
set CACHE_TYPE=f16
set LLAMA_ARG_CTX_SIZE=131072
set LLAMA_ARG_CACHE_TYPE_K=f16
set LLAMA_ARG_CACHE_TYPE_V=f16
set LLAMA_ARG_CTX_SIZE_DRAFT=131072
set LLAMA_ARG_DRAFT_MAX=16
set LLAMA_ARG_SPLIT_MODE=layer
llama-server.exe --host 0.0.0.0 --port 5000 --n-predict -1 --keep -1 --threads 30 --no-webui --flash-attn --no-mmap -v --temp 0.2 --top-k 64 --top-p 0.7 --min-p 0.01 --jinja --device-draft CUDA0

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from E:\neuro\LLM-server\llamacpp\new\ggml-cuda.dll
load_backend: loaded RPC backend from E:\neuro\LLM-server\llamacpp\new\ggml-rpc.dll
load_backend: loaded CPU backend from E:\neuro\LLM-server\llamacpp\new\ggml-cpu-alderlake.dll
build: 5527 (763d06ed) with clang version 18.1.8 for x86_64-pc-windows-msvc
system info: n_threads = 30, n_threads_batch = 30, total_threads = 32

system_info: n_threads = 30 (n_threads_batch = 30) / 32 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

Web UI is disabled
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 5000, http threads: 31
main: loading model
srv    load_model: loading model 'e:\LLM-server\models\mistral-small-3.1-24b-instruct-2503\Mistral-Small-3.1-24B-Instruct-2503-UD-Q5_K_XL.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22994 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 363 tensors from e:\neuro\LLM-server\models\mistral-small-3.1-24b-instruct-2503\Mistral-Small-3.1-24B-Instruct-2503-UD-Q5_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Mistral-Small-3.1-24B-Instruct-2503
llama_model_loader: - kv   3:                            general.version str              = 2503
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Mistral-Small-3.1-24B-Instruct-2503
llama_model_loader: - kv   6:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   7:                         general.size_label str              = 24B
llama_model_loader: - kv   8:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   9:                          llama.block_count u32              = 40
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 32768
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 1000000000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  18:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 131072
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = tekken
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,131072]  = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,131072]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,269443]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ �...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  28:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 11
llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  31:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- set today = strftime_now("%Y-%m-%...
llama_model_loader: - kv  33:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - kv  35:                          general.file_type u32              = 17
llama_model_loader: - kv  36:                      quantize.imatrix.file str              = Mistral-Small-3.1-24B-Instruct-2503-G...
llama_model_loader: - kv  37:                   quantize.imatrix.dataset str              = unsloth_calibration_Mistral-Small-3.1...
llama_model_loader: - kv  38:             quantize.imatrix.entries_count i32              = 280
llama_model_loader: - kv  39:              quantize.imatrix.chunks_count i32              = 576
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q8_0:    1 tensors
llama_model_loader: - type q4_K:   20 tensors
llama_model_loader: - type q5_K:  184 tensors
llama_model_loader: - type q6_K:   77 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q5_K - Medium
print_info: file size   = 15.61 GiB (5.69 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token:    475 '<SPECIAL_475>' is not marked as EOG
load: control token:    787 '<SPECIAL_787>' is not marked as EOG
load: control token:     59 '<SPECIAL_59>' is not marked as EOG
.......
.......
load: control token:    992 '<SPECIAL_992>' is not marked as EOG
load: control token:    993 '<SPECIAL_993>' is not marked as EOG
load: control token:    997 '<SPECIAL_997>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 1000
load: token to piece cache size = 0.8498 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 5120
print_info: n_layer          = 40
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 32768
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 13B
print_info: model params     = 23.57 B
print_info: general.name     = Mistral-Small-3.1-24B-Instruct-2503
print_info: vocab type       = BPE
print_info: n_vocab          = 131072
print_info: n_merges         = 269443
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: PAD token        = 11 '<pad>'
print_info: LF token         = 1010 'Ċ'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 150
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: layer   0 assigned to device CUDA0, is_swa = 0
load_tensors: layer   1 assigned to device CUDA0, is_swa = 0
load_tensors: layer   2 assigned to device CUDA0, is_swa = 0
load_tensors: layer   3 assigned to device CUDA0, is_swa = 0
load_tensors: layer   4 assigned to device CUDA0, is_swa = 0
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 0
load_tensors: layer   7 assigned to device CUDA0, is_swa = 0
load_tensors: layer   8 assigned to device CUDA0, is_swa = 0
load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
load_tensors: layer  10 assigned to device CUDA0, is_swa = 0
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 0
load_tensors: layer  13 assigned to device CUDA0, is_swa = 0
load_tensors: layer  14 assigned to device CUDA0, is_swa = 0
load_tensors: layer  15 assigned to device CUDA0, is_swa = 0
load_tensors: layer  16 assigned to device CUDA0, is_swa = 0
load_tensors: layer  17 assigned to device CUDA1, is_swa = 0
load_tensors: layer  18 assigned to device CUDA1, is_swa = 0
load_tensors: layer  19 assigned to device CUDA1, is_swa = 0
load_tensors: layer  20 assigned to device CUDA1, is_swa = 0
load_tensors: layer  21 assigned to device CUDA1, is_swa = 0
load_tensors: layer  22 assigned to device CUDA1, is_swa = 0
load_tensors: layer  23 assigned to device CUDA1, is_swa = 0
load_tensors: layer  24 assigned to device CUDA1, is_swa = 0
load_tensors: layer  25 assigned to device CUDA1, is_swa = 0
load_tensors: layer  26 assigned to device CUDA1, is_swa = 0
load_tensors: layer  27 assigned to device CUDA1, is_swa = 0
load_tensors: layer  28 assigned to device CUDA1, is_swa = 0
load_tensors: layer  29 assigned to device CUDA1, is_swa = 0
load_tensors: layer  30 assigned to device CUDA1, is_swa = 0
load_tensors: layer  31 assigned to device CUDA1, is_swa = 0
load_tensors: layer  32 assigned to device CUDA1, is_swa = 0
load_tensors: layer  33 assigned to device CUDA1, is_swa = 0
load_tensors: layer  34 assigned to device CUDA1, is_swa = 0
load_tensors: layer  35 assigned to device CUDA1, is_swa = 0
load_tensors: layer  36 assigned to device CUDA1, is_swa = 0
load_tensors: layer  37 assigned to device CUDA1, is_swa = 0
load_tensors: layer  38 assigned to device CUDA1, is_swa = 0
load_tensors: layer  39 assigned to device CUDA1, is_swa = 0
load_tensors: layer  40 assigned to device CUDA1, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q5_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:        CUDA0 model buffer size =  6401.91 MiB
load_tensors:        CUDA1 model buffer size =  9139.71 MiB
load_tensors:          CPU model buffer size =   440.00 MiB
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
.......................................load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1
.......................................................load_all_data: no device found for buffer type CPU for async uploads
..
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 131072
llama_context: n_ctx_per_seq = 131072
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000000.0
llama_context: freq_scale    = 1
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     0.50 MiB
create_memory: n_ctx = 131072 (padded)
llama_kv_cache_unified: layer   0: dev = CUDA0
llama_kv_cache_unified: layer   1: dev = CUDA0
llama_kv_cache_unified: layer   2: dev = CUDA0
llama_kv_cache_unified: layer   3: dev = CUDA0
llama_kv_cache_unified: layer   4: dev = CUDA0
llama_kv_cache_unified: layer   5: dev = CUDA0
llama_kv_cache_unified: layer   6: dev = CUDA0
llama_kv_cache_unified: layer   7: dev = CUDA0
llama_kv_cache_unified: layer   8: dev = CUDA0
llama_kv_cache_unified: layer   9: dev = CUDA0
llama_kv_cache_unified: layer  10: dev = CUDA0
llama_kv_cache_unified: layer  11: dev = CUDA0
llama_kv_cache_unified: layer  12: dev = CUDA0
llama_kv_cache_unified: layer  13: dev = CUDA0
llama_kv_cache_unified: layer  14: dev = CUDA0
llama_kv_cache_unified: layer  15: dev = CUDA0
llama_kv_cache_unified: layer  16: dev = CUDA0
llama_kv_cache_unified: layer  17: dev = CUDA1
llama_kv_cache_unified: layer  18: dev = CUDA1
llama_kv_cache_unified: layer  19: dev = CUDA1
llama_kv_cache_unified: layer  20: dev = CUDA1
llama_kv_cache_unified: layer  21: dev = CUDA1
llama_kv_cache_unified: layer  22: dev = CUDA1
llama_kv_cache_unified: layer  23: dev = CUDA1
llama_kv_cache_unified: layer  24: dev = CUDA1
llama_kv_cache_unified: layer  25: dev = CUDA1
llama_kv_cache_unified: layer  26: dev = CUDA1
llama_kv_cache_unified: layer  27: dev = CUDA1
llama_kv_cache_unified: layer  28: dev = CUDA1
llama_kv_cache_unified: layer  29: dev = CUDA1
llama_kv_cache_unified: layer  30: dev = CUDA1
llama_kv_cache_unified: layer  31: dev = CUDA1
llama_kv_cache_unified: layer  32: dev = CUDA1
llama_kv_cache_unified: layer  33: dev = CUDA1
llama_kv_cache_unified: layer  34: dev = CUDA1
llama_kv_cache_unified: layer  35: dev = CUDA1
llama_kv_cache_unified: layer  36: dev = CUDA1
llama_kv_cache_unified: layer  37: dev = CUDA1
llama_kv_cache_unified: layer  38: dev = CUDA1
llama_kv_cache_unified: layer  39: dev = CUDA1
llama_kv_cache_unified:      CUDA0 KV buffer size =  8704.00 MiB
llama_kv_cache_unified:      CUDA1 KV buffer size = 11776.00 MiB
llama_kv_cache_unified: size = 20480.00 MiB (131072 cells,  40 layers,  1 seqs), K (f16): 10240.00 MiB, V (f16): 10240.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: max_nodes = 65536
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:      CUDA0 compute buffer size =  1356.01 MiB
llama_context:      CUDA1 compute buffer size =   818.02 MiB
llama_context:  CUDA_Host compute buffer size =  1034.02 MiB
llama_context: graph nodes  = 1287
llama_context: graph splits = 3
clear_adapter_lora: call
common_init_from_params: setting dry_penalty_last_n to ctx_size = 131072
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
set_warmup: value = 1
set_warmup: value = 0
Failed to infer a tool call example (possible template bug)
clip_model_loader: model name:   Mistral-Small-3.1-24B-Instruct-2503
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    223
clip_model_loader: n_kv:         23

clip_model_loader: has vision encoder
clip_model_loader: tensor[0]: n_dims = 1, name = v.token_embd.img_break, tensor_size=20480, offset=0, shape:[5120, 1, 1, 1], type = f32
clip_model_loader: tensor[1]: n_dims = 2, name = mm.1.weight, tensor_size=10485760, offset=20480, shape:[1024, 5120, 1, 1], type = f16
clip_model_loader: tensor[2]: n_dims = 2, name = mm.2.weight, tensor_size=52428800, offset=10506240, shape:[5120, 5120, 1, 1], type = f16
clip_model_loader: tensor[3]: n_dims = 1, name = mm.input_norm.weight, tensor_size=4096, offset=62935040, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[4]: n_dims = 2, name = mm.patch_merger.weight, tensor_size=8388608, offset=62939136, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[5]: n_dims = 1, name = v.pre_ln.weight, tensor_size=4096, offset=71327744, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[6]: n_dims = 4, name = v.patch_embd.weight, tensor_size=1204224, offset=71331840, shape:[14, 14, 3, 1024], type = f16
clip_model_loader: tensor[7]: n_dims = 2, name = v.blk.0.attn_k.weight, tensor_size=2097152, offset=72536064, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[8]: n_dims = 2, name = v.blk.0.attn_out.weight, tensor_size=2097152, offset=74633216, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[9]: n_dims = 2, name = v.blk.0.attn_q.weight, tensor_size=2097152, offset=76730368, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[10]: n_dims = 2, name = v.blk.0.attn_v.weight, tensor_size=2097152, offset=78827520, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[11]: n_dims = 1, name = v.blk.0.ln1.weight, tensor_size=4096, offset=80924672, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[12]: n_dims = 2, name = v.blk.0.ffn_down.weight, tensor_size=8388608, offset=80928768, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[13]: n_dims = 2, name = v.blk.0.ffn_gate.weight, tensor_size=8388608, offset=89317376, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[14]: n_dims = 2, name = v.blk.0.ffn_up.weight, tensor_size=8388608, offset=97705984, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[15]: n_dims = 1, name = v.blk.0.ln2.weight, tensor_size=4096, offset=106094592, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[16]: n_dims = 2, name = v.blk.1.attn_k.weight, tensor_size=2097152, offset=106098688, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[17]: n_dims = 2, name = v.blk.1.attn_out.weight, tensor_size=2097152, offset=108195840, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[18]: n_dims = 2, name = v.blk.1.attn_q.weight, tensor_size=2097152, offset=110292992, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[19]: n_dims = 2, name = v.blk.1.attn_v.weight, tensor_size=2097152, offset=112390144, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[20]: n_dims = 1, name = v.blk.1.ln1.weight, tensor_size=4096, offset=114487296, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[21]: n_dims = 2, name = v.blk.1.ffn_down.weight, tensor_size=8388608, offset=114491392, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[22]: n_dims = 2, name = v.blk.1.ffn_gate.weight, tensor_size=8388608, offset=122880000, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[23]: n_dims = 2, name = v.blk.1.ffn_up.weight, tensor_size=8388608, offset=131268608, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[24]: n_dims = 1, name = v.blk.1.ln2.weight, tensor_size=4096, offset=139657216, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[25]: n_dims = 2, name = v.blk.10.attn_k.weight, tensor_size=2097152, offset=139661312, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[26]: n_dims = 2, name = v.blk.10.attn_out.weight, tensor_size=2097152, offset=141758464, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[27]: n_dims = 2, name = v.blk.10.attn_q.weight, tensor_size=2097152, offset=143855616, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[28]: n_dims = 2, name = v.blk.10.attn_v.weight, tensor_size=2097152, offset=145952768, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[29]: n_dims = 1, name = v.blk.10.ln1.weight, tensor_size=4096, offset=148049920, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[30]: n_dims = 2, name = v.blk.10.ffn_down.weight, tensor_size=8388608, offset=148054016, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[31]: n_dims = 2, name = v.blk.10.ffn_gate.weight, tensor_size=8388608, offset=156442624, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[32]: n_dims = 2, name = v.blk.10.ffn_up.weight, tensor_size=8388608, offset=164831232, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[33]: n_dims = 1, name = v.blk.10.ln2.weight, tensor_size=4096, offset=173219840, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[34]: n_dims = 2, name = v.blk.11.attn_k.weight, tensor_size=2097152, offset=173223936, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[35]: n_dims = 2, name = v.blk.11.attn_out.weight, tensor_size=2097152, offset=175321088, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[36]: n_dims = 2, name = v.blk.11.attn_q.weight, tensor_size=2097152, offset=177418240, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[37]: n_dims = 2, name = v.blk.11.attn_v.weight, tensor_size=2097152, offset=179515392, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[38]: n_dims = 1, name = v.blk.11.ln1.weight, tensor_size=4096, offset=181612544, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[39]: n_dims = 2, name = v.blk.11.ffn_down.weight, tensor_size=8388608, offset=181616640, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[40]: n_dims = 2, name = v.blk.11.ffn_gate.weight, tensor_size=8388608, offset=190005248, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[41]: n_dims = 2, name = v.blk.11.ffn_up.weight, tensor_size=8388608, offset=198393856, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[42]: n_dims = 1, name = v.blk.11.ln2.weight, tensor_size=4096, offset=206782464, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[43]: n_dims = 2, name = v.blk.12.attn_k.weight, tensor_size=2097152, offset=206786560, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[44]: n_dims = 2, name = v.blk.12.attn_out.weight, tensor_size=2097152, offset=208883712, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[45]: n_dims = 2, name = v.blk.12.attn_q.weight, tensor_size=2097152, offset=210980864, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[46]: n_dims = 2, name = v.blk.12.attn_v.weight, tensor_size=2097152, offset=213078016, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[47]: n_dims = 1, name = v.blk.12.ln1.weight, tensor_size=4096, offset=215175168, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[48]: n_dims = 2, name = v.blk.12.ffn_down.weight, tensor_size=8388608, offset=215179264, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[49]: n_dims = 2, name = v.blk.12.ffn_gate.weight, tensor_size=8388608, offset=223567872, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[50]: n_dims = 2, name = v.blk.12.ffn_up.weight, tensor_size=8388608, offset=231956480, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[51]: n_dims = 1, name = v.blk.12.ln2.weight, tensor_size=4096, offset=240345088, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[52]: n_dims = 2, name = v.blk.13.attn_k.weight, tensor_size=2097152, offset=240349184, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[53]: n_dims = 2, name = v.blk.13.attn_out.weight, tensor_size=2097152, offset=242446336, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[54]: n_dims = 2, name = v.blk.13.attn_q.weight, tensor_size=2097152, offset=244543488, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[55]: n_dims = 2, name = v.blk.13.attn_v.weight, tensor_size=2097152, offset=246640640, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[56]: n_dims = 1, name = v.blk.13.ln1.weight, tensor_size=4096, offset=248737792, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[57]: n_dims = 2, name = v.blk.13.ffn_down.weight, tensor_size=8388608, offset=248741888, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[58]: n_dims = 2, name = v.blk.13.ffn_gate.weight, tensor_size=8388608, offset=257130496, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[59]: n_dims = 2, name = v.blk.13.ffn_up.weight, tensor_size=8388608, offset=265519104, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[60]: n_dims = 1, name = v.blk.13.ln2.weight, tensor_size=4096, offset=273907712, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[61]: n_dims = 2, name = v.blk.14.attn_k.weight, tensor_size=2097152, offset=273911808, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[62]: n_dims = 2, name = v.blk.14.attn_out.weight, tensor_size=2097152, offset=276008960, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[63]: n_dims = 2, name = v.blk.14.attn_q.weight, tensor_size=2097152, offset=278106112, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[64]: n_dims = 2, name = v.blk.14.attn_v.weight, tensor_size=2097152, offset=280203264, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[65]: n_dims = 1, name = v.blk.14.ln1.weight, tensor_size=4096, offset=282300416, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[66]: n_dims = 2, name = v.blk.14.ffn_down.weight, tensor_size=8388608, offset=282304512, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[67]: n_dims = 2, name = v.blk.14.ffn_gate.weight, tensor_size=8388608, offset=290693120, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[68]: n_dims = 2, name = v.blk.14.ffn_up.weight, tensor_size=8388608, offset=299081728, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[69]: n_dims = 1, name = v.blk.14.ln2.weight, tensor_size=4096, offset=307470336, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[70]: n_dims = 2, name = v.blk.15.attn_k.weight, tensor_size=2097152, offset=307474432, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[71]: n_dims = 2, name = v.blk.15.attn_out.weight, tensor_size=2097152, offset=309571584, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[72]: n_dims = 2, name = v.blk.15.attn_q.weight, tensor_size=2097152, offset=311668736, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[73]: n_dims = 2, name = v.blk.15.attn_v.weight, tensor_size=2097152, offset=313765888, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[74]: n_dims = 1, name = v.blk.15.ln1.weight, tensor_size=4096, offset=315863040, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[75]: n_dims = 2, name = v.blk.15.ffn_down.weight, tensor_size=8388608, offset=315867136, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[76]: n_dims = 2, name = v.blk.15.ffn_gate.weight, tensor_size=8388608, offset=324255744, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[77]: n_dims = 2, name = v.blk.15.ffn_up.weight, tensor_size=8388608, offset=332644352, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[78]: n_dims = 1, name = v.blk.15.ln2.weight, tensor_size=4096, offset=341032960, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[79]: n_dims = 2, name = v.blk.16.attn_k.weight, tensor_size=2097152, offset=341037056, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[80]: n_dims = 2, name = v.blk.16.attn_out.weight, tensor_size=2097152, offset=343134208, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[81]: n_dims = 2, name = v.blk.16.attn_q.weight, tensor_size=2097152, offset=345231360, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[82]: n_dims = 2, name = v.blk.16.attn_v.weight, tensor_size=2097152, offset=347328512, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[83]: n_dims = 1, name = v.blk.16.ln1.weight, tensor_size=4096, offset=349425664, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[84]: n_dims = 2, name = v.blk.16.ffn_down.weight, tensor_size=8388608, offset=349429760, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[85]: n_dims = 2, name = v.blk.16.ffn_gate.weight, tensor_size=8388608, offset=357818368, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[86]: n_dims = 2, name = v.blk.16.ffn_up.weight, tensor_size=8388608, offset=366206976, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[87]: n_dims = 1, name = v.blk.16.ln2.weight, tensor_size=4096, offset=374595584, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[88]: n_dims = 2, name = v.blk.17.attn_k.weight, tensor_size=2097152, offset=374599680, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[89]: n_dims = 2, name = v.blk.17.attn_out.weight, tensor_size=2097152, offset=376696832, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[90]: n_dims = 2, name = v.blk.17.attn_q.weight, tensor_size=2097152, offset=378793984, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[91]: n_dims = 2, name = v.blk.17.attn_v.weight, tensor_size=2097152, offset=380891136, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[92]: n_dims = 1, name = v.blk.17.ln1.weight, tensor_size=4096, offset=382988288, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[93]: n_dims = 2, name = v.blk.17.ffn_down.weight, tensor_size=8388608, offset=382992384, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[94]: n_dims = 2, name = v.blk.17.ffn_gate.weight, tensor_size=8388608, offset=391380992, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[95]: n_dims = 2, name = v.blk.17.ffn_up.weight, tensor_size=8388608, offset=399769600, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[96]: n_dims = 1, name = v.blk.17.ln2.weight, tensor_size=4096, offset=408158208, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[97]: n_dims = 2, name = v.blk.18.attn_k.weight, tensor_size=2097152, offset=408162304, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[98]: n_dims = 2, name = v.blk.18.attn_out.weight, tensor_size=2097152, offset=410259456, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[99]: n_dims = 2, name = v.blk.18.attn_q.weight, tensor_size=2097152, offset=412356608, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[100]: n_dims = 2, name = v.blk.18.attn_v.weight, tensor_size=2097152, offset=414453760, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[101]: n_dims = 1, name = v.blk.18.ln1.weight, tensor_size=4096, offset=416550912, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[102]: n_dims = 2, name = v.blk.18.ffn_down.weight, tensor_size=8388608, offset=416555008, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[103]: n_dims = 2, name = v.blk.18.ffn_gate.weight, tensor_size=8388608, offset=424943616, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[104]: n_dims = 2, name = v.blk.18.ffn_up.weight, tensor_size=8388608, offset=433332224, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[105]: n_dims = 1, name = v.blk.18.ln2.weight, tensor_size=4096, offset=441720832, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[106]: n_dims = 2, name = v.blk.19.attn_k.weight, tensor_size=2097152, offset=441724928, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[107]: n_dims = 2, name = v.blk.19.attn_out.weight, tensor_size=2097152, offset=443822080, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[108]: n_dims = 2, name = v.blk.19.attn_q.weight, tensor_size=2097152, offset=445919232, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[109]: n_dims = 2, name = v.blk.19.attn_v.weight, tensor_size=2097152, offset=448016384, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[110]: n_dims = 1, name = v.blk.19.ln1.weight, tensor_size=4096, offset=450113536, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[111]: n_dims = 2, name = v.blk.19.ffn_down.weight, tensor_size=8388608, offset=450117632, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[112]: n_dims = 2, name = v.blk.19.ffn_gate.weight, tensor_size=8388608, offset=458506240, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[113]: n_dims = 2, name = v.blk.19.ffn_up.weight, tensor_size=8388608, offset=466894848, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[114]: n_dims = 1, name = v.blk.19.ln2.weight, tensor_size=4096, offset=475283456, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[115]: n_dims = 2, name = v.blk.2.attn_k.weight, tensor_size=2097152, offset=475287552, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[116]: n_dims = 2, name = v.blk.2.attn_out.weight, tensor_size=2097152, offset=477384704, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[117]: n_dims = 2, name = v.blk.2.attn_q.weight, tensor_size=2097152, offset=479481856, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[118]: n_dims = 2, name = v.blk.2.attn_v.weight, tensor_size=2097152, offset=481579008, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[119]: n_dims = 1, name = v.blk.2.ln1.weight, tensor_size=4096, offset=483676160, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[120]: n_dims = 2, name = v.blk.2.ffn_down.weight, tensor_size=8388608, offset=483680256, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[121]: n_dims = 2, name = v.blk.2.ffn_gate.weight, tensor_size=8388608, offset=492068864, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[122]: n_dims = 2, name = v.blk.2.ffn_up.weight, tensor_size=8388608, offset=500457472, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[123]: n_dims = 1, name = v.blk.2.ln2.weight, tensor_size=4096, offset=508846080, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[124]: n_dims = 2, name = v.blk.20.attn_k.weight, tensor_size=2097152, offset=508850176, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[125]: n_dims = 2, name = v.blk.20.attn_out.weight, tensor_size=2097152, offset=510947328, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[126]: n_dims = 2, name = v.blk.20.attn_q.weight, tensor_size=2097152, offset=513044480, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[127]: n_dims = 2, name = v.blk.20.attn_v.weight, tensor_size=2097152, offset=515141632, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[128]: n_dims = 1, name = v.blk.20.ln1.weight, tensor_size=4096, offset=517238784, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[129]: n_dims = 2, name = v.blk.20.ffn_down.weight, tensor_size=8388608, offset=517242880, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[130]: n_dims = 2, name = v.blk.20.ffn_gate.weight, tensor_size=8388608, offset=525631488, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[131]: n_dims = 2, name = v.blk.20.ffn_up.weight, tensor_size=8388608, offset=534020096, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[132]: n_dims = 1, name = v.blk.20.ln2.weight, tensor_size=4096, offset=542408704, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[133]: n_dims = 2, name = v.blk.21.attn_k.weight, tensor_size=2097152, offset=542412800, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[134]: n_dims = 2, name = v.blk.21.attn_out.weight, tensor_size=2097152, offset=544509952, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[135]: n_dims = 2, name = v.blk.21.attn_q.weight, tensor_size=2097152, offset=546607104, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[136]: n_dims = 2, name = v.blk.21.attn_v.weight, tensor_size=2097152, offset=548704256, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[137]: n_dims = 1, name = v.blk.21.ln1.weight, tensor_size=4096, offset=550801408, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[138]: n_dims = 2, name = v.blk.21.ffn_down.weight, tensor_size=8388608, offset=550805504, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[139]: n_dims = 2, name = v.blk.21.ffn_gate.weight, tensor_size=8388608, offset=559194112, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[140]: n_dims = 2, name = v.blk.21.ffn_up.weight, tensor_size=8388608, offset=567582720, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[141]: n_dims = 1, name = v.blk.21.ln2.weight, tensor_size=4096, offset=575971328, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[142]: n_dims = 2, name = v.blk.22.attn_k.weight, tensor_size=2097152, offset=575975424, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[143]: n_dims = 2, name = v.blk.22.attn_out.weight, tensor_size=2097152, offset=578072576, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[144]: n_dims = 2, name = v.blk.22.attn_q.weight, tensor_size=2097152, offset=580169728, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[145]: n_dims = 2, name = v.blk.22.attn_v.weight, tensor_size=2097152, offset=582266880, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[146]: n_dims = 1, name = v.blk.22.ln1.weight, tensor_size=4096, offset=584364032, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[147]: n_dims = 2, name = v.blk.22.ffn_down.weight, tensor_size=8388608, offset=584368128, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[148]: n_dims = 2, name = v.blk.22.ffn_gate.weight, tensor_size=8388608, offset=592756736, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[149]: n_dims = 2, name = v.blk.22.ffn_up.weight, tensor_size=8388608, offset=601145344, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[150]: n_dims = 1, name = v.blk.22.ln2.weight, tensor_size=4096, offset=609533952, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[151]: n_dims = 2, name = v.blk.23.attn_k.weight, tensor_size=2097152, offset=609538048, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[152]: n_dims = 2, name = v.blk.23.attn_out.weight, tensor_size=2097152, offset=611635200, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[153]: n_dims = 2, name = v.blk.23.attn_q.weight, tensor_size=2097152, offset=613732352, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[154]: n_dims = 2, name = v.blk.23.attn_v.weight, tensor_size=2097152, offset=615829504, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[155]: n_dims = 1, name = v.blk.23.ln1.weight, tensor_size=4096, offset=617926656, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[156]: n_dims = 2, name = v.blk.23.ffn_down.weight, tensor_size=8388608, offset=617930752, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[157]: n_dims = 2, name = v.blk.23.ffn_gate.weight, tensor_size=8388608, offset=626319360, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[158]: n_dims = 2, name = v.blk.23.ffn_up.weight, tensor_size=8388608, offset=634707968, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[159]: n_dims = 1, name = v.blk.23.ln2.weight, tensor_size=4096, offset=643096576, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[160]: n_dims = 2, name = v.blk.3.attn_k.weight, tensor_size=2097152, offset=643100672, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[161]: n_dims = 2, name = v.blk.3.attn_out.weight, tensor_size=2097152, offset=645197824, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[162]: n_dims = 2, name = v.blk.3.attn_q.weight, tensor_size=2097152, offset=647294976, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[163]: n_dims = 2, name = v.blk.3.attn_v.weight, tensor_size=2097152, offset=649392128, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[164]: n_dims = 1, name = v.blk.3.ln1.weight, tensor_size=4096, offset=651489280, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[165]: n_dims = 2, name = v.blk.3.ffn_down.weight, tensor_size=8388608, offset=651493376, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[166]: n_dims = 2, name = v.blk.3.ffn_gate.weight, tensor_size=8388608, offset=659881984, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[167]: n_dims = 2, name = v.blk.3.ffn_up.weight, tensor_size=8388608, offset=668270592, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[168]: n_dims = 1, name = v.blk.3.ln2.weight, tensor_size=4096, offset=676659200, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[169]: n_dims = 2, name = v.blk.4.attn_k.weight, tensor_size=2097152, offset=676663296, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[170]: n_dims = 2, name = v.blk.4.attn_out.weight, tensor_size=2097152, offset=678760448, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[171]: n_dims = 2, name = v.blk.4.attn_q.weight, tensor_size=2097152, offset=680857600, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[172]: n_dims = 2, name = v.blk.4.attn_v.weight, tensor_size=2097152, offset=682954752, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[173]: n_dims = 1, name = v.blk.4.ln1.weight, tensor_size=4096, offset=685051904, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[174]: n_dims = 2, name = v.blk.4.ffn_down.weight, tensor_size=8388608, offset=685056000, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[175]: n_dims = 2, name = v.blk.4.ffn_gate.weight, tensor_size=8388608, offset=693444608, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[176]: n_dims = 2, name = v.blk.4.ffn_up.weight, tensor_size=8388608, offset=701833216, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[177]: n_dims = 1, name = v.blk.4.ln2.weight, tensor_size=4096, offset=710221824, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[178]: n_dims = 2, name = v.blk.5.attn_k.weight, tensor_size=2097152, offset=710225920, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[179]: n_dims = 2, name = v.blk.5.attn_out.weight, tensor_size=2097152, offset=712323072, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[180]: n_dims = 2, name = v.blk.5.attn_q.weight, tensor_size=2097152, offset=714420224, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[181]: n_dims = 2, name = v.blk.5.attn_v.weight, tensor_size=2097152, offset=716517376, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[182]: n_dims = 1, name = v.blk.5.ln1.weight, tensor_size=4096, offset=718614528, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[183]: n_dims = 2, name = v.blk.5.ffn_down.weight, tensor_size=8388608, offset=718618624, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[184]: n_dims = 2, name = v.blk.5.ffn_gate.weight, tensor_size=8388608, offset=727007232, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[185]: n_dims = 2, name = v.blk.5.ffn_up.weight, tensor_size=8388608, offset=735395840, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[186]: n_dims = 1, name = v.blk.5.ln2.weight, tensor_size=4096, offset=743784448, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[187]: n_dims = 2, name = v.blk.6.attn_k.weight, tensor_size=2097152, offset=743788544, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[188]: n_dims = 2, name = v.blk.6.attn_out.weight, tensor_size=2097152, offset=745885696, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[189]: n_dims = 2, name = v.blk.6.attn_q.weight, tensor_size=2097152, offset=747982848, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[190]: n_dims = 2, name = v.blk.6.attn_v.weight, tensor_size=2097152, offset=750080000, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[191]: n_dims = 1, name = v.blk.6.ln1.weight, tensor_size=4096, offset=752177152, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[192]: n_dims = 2, name = v.blk.6.ffn_down.weight, tensor_size=8388608, offset=752181248, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[193]: n_dims = 2, name = v.blk.6.ffn_gate.weight, tensor_size=8388608, offset=760569856, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[194]: n_dims = 2, name = v.blk.6.ffn_up.weight, tensor_size=8388608, offset=768958464, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[195]: n_dims = 1, name = v.blk.6.ln2.weight, tensor_size=4096, offset=777347072, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[196]: n_dims = 2, name = v.blk.7.attn_k.weight, tensor_size=2097152, offset=777351168, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[197]: n_dims = 2, name = v.blk.7.attn_out.weight, tensor_size=2097152, offset=779448320, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[198]: n_dims = 2, name = v.blk.7.attn_q.weight, tensor_size=2097152, offset=781545472, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[199]: n_dims = 2, name = v.blk.7.attn_v.weight, tensor_size=2097152, offset=783642624, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[200]: n_dims = 1, name = v.blk.7.ln1.weight, tensor_size=4096, offset=785739776, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[201]: n_dims = 2, name = v.blk.7.ffn_down.weight, tensor_size=8388608, offset=785743872, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[202]: n_dims = 2, name = v.blk.7.ffn_gate.weight, tensor_size=8388608, offset=794132480, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[203]: n_dims = 2, name = v.blk.7.ffn_up.weight, tensor_size=8388608, offset=802521088, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[204]: n_dims = 1, name = v.blk.7.ln2.weight, tensor_size=4096, offset=810909696, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[205]: n_dims = 2, name = v.blk.8.attn_k.weight, tensor_size=2097152, offset=810913792, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[206]: n_dims = 2, name = v.blk.8.attn_out.weight, tensor_size=2097152, offset=813010944, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[207]: n_dims = 2, name = v.blk.8.attn_q.weight, tensor_size=2097152, offset=815108096, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[208]: n_dims = 2, name = v.blk.8.attn_v.weight, tensor_size=2097152, offset=817205248, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[209]: n_dims = 1, name = v.blk.8.ln1.weight, tensor_size=4096, offset=819302400, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[210]: n_dims = 2, name = v.blk.8.ffn_down.weight, tensor_size=8388608, offset=819306496, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[211]: n_dims = 2, name = v.blk.8.ffn_gate.weight, tensor_size=8388608, offset=827695104, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[212]: n_dims = 2, name = v.blk.8.ffn_up.weight, tensor_size=8388608, offset=836083712, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[213]: n_dims = 1, name = v.blk.8.ln2.weight, tensor_size=4096, offset=844472320, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[214]: n_dims = 2, name = v.blk.9.attn_k.weight, tensor_size=2097152, offset=844476416, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[215]: n_dims = 2, name = v.blk.9.attn_out.weight, tensor_size=2097152, offset=846573568, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[216]: n_dims = 2, name = v.blk.9.attn_q.weight, tensor_size=2097152, offset=848670720, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[217]: n_dims = 2, name = v.blk.9.attn_v.weight, tensor_size=2097152, offset=850767872, shape:[1024, 1024, 1, 1], type = f16
clip_model_loader: tensor[218]: n_dims = 1, name = v.blk.9.ln1.weight, tensor_size=4096, offset=852865024, shape:[1024, 1, 1, 1], type = f32
clip_model_loader: tensor[219]: n_dims = 2, name = v.blk.9.ffn_down.weight, tensor_size=8388608, offset=852869120, shape:[4096, 1024, 1, 1], type = f16
clip_model_loader: tensor[220]: n_dims = 2, name = v.blk.9.ffn_gate.weight, tensor_size=8388608, offset=861257728, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[221]: n_dims = 2, name = v.blk.9.ffn_up.weight, tensor_size=8388608, offset=869646336, shape:[1024, 4096, 1, 1], type = f16
clip_model_loader: tensor[222]: n_dims = 1, name = v.blk.9.ln2.weight, tensor_size=4096, offset=878034944, shape:[1024, 1, 1, 1], type = f32
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector:          pixtral
load_hparams: n_embd:             1024
load_hparams: n_head:             16
load_hparams: n_ff:               4096
load_hparams: n_layer:            24
load_hparams: ffn_op:             silu
load_hparams: projection_dim:     5120

--- vision hparams ---
load_hparams: image_size:         1540
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  0
load_hparams: n_wa_pattern:       0

load_hparams: model size:         837.36 MiB
load_hparams: metadata size:      0.08 MiB
load_tensors: loaded 223 tensors from e:\neuro\LLM-server\models\mistral-small-3.1-24b-instruct-2503\mmproj-F16.gguf
alloc_compute_meta:      CUDA0 compute buffer size =     2.97 MiB
alloc_compute_meta:        CPU compute buffer size =     0.14 MiB
srv    load_model: loaded multimodal model, 'e:\neuro\LLM-server\models\mistral-small-3.1-24b-instruct-2503\mmproj-F16.gguf'
srv    load_model: ctx_shift is not supported by multimodal, it will be disabled
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 131072
slot        reset: id  0 | task -1 |
main: model loaded
main: chat template, chat_template: {%- set today = strftime_now("%Y-%m-%d") %}
{%- set default_system_message = "You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.\nYour knowledge base was last updated on 2023-10-01. The current date is " + today + ".\n\nWhen you're not sure about some information, you say that you don't have the information and don't make up anything.\nIf the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. \"What are some good restaurants around me?\" => \"Where are you?\" or \"When is the next flight to Tokyo\" => \"Where do you travel from?\")" %}

{{- bos_token }}

{%- if messages[0]['role'] == 'system' %}
    {%- if messages[0]['content'] is string %}
        {%- set system_message = messages[0]['content'] %}
    {%- else %}
        {%- set system_message = messages[0]['content'][0]['text'] %}
    {%- endif %}
    {%- set loop_messages = messages[1:] %}
{%- else %}
    {%- set system_message = default_system_message %}
    {%- set loop_messages = messages %}
{%- endif %}
{{- '[SYSTEM_PROMPT]' + system_message + '[/SYSTEM_PROMPT]' }}

{%- for message in loop_messages %}
    {%- if message['role'] == 'user' %}
        {%- if message['content'] is string %}
            {{- '[INST]' + message['content'] + '[/INST]' }}
        {%- else %}
            {{- '[INST]' }}
            {%- for block in message['content'] %}
                {%- if block['type'] == 'text' %}
                    {{- block['text'] }}
                {%- elif block['type'] in ['image', 'image_url'] %}
                    {{- '[IMG]' }}
                {%- else %}
                    {{- raise_exception('Only text and image blocks are supported in message content!') }}
                {%- endif %}
            {%- endfor %}
            {{- '[/INST]' }}
        {%- endif %}
    {%- elif message['role'] == 'system' %}
        {%- if message['content'] is string %}
            {{- '[SYSTEM_PROMPT]' + message['content'] + '[/SYSTEM_PROMPT]' }}
        {%- else %}
            {{- '[SYSTEM_PROMPT]' + message['content'][0]['text'] + '[/SYSTEM_PROMPT]' }}
        {%- endif %}
    {%- elif message['role'] == 'assistant' %}
        {%- if message['content'] is string %}
            {{- message['content'] + eos_token }}
        {%- else %}
            {{- message['content'][0]['text'] + eos_token }}
        {%- endif %}
    {%- else %}
        {{- raise_exception('Only user, system and assistant roles are supported!') }}
    {%- endif %}
{%- endfor %}, example_format: '[SYSTEM_PROMPT]You are a helpful assistant[/SYSTEM_PROMPT][INST]Hello[/INST]Hi there</s>[INST]How are you?[/INST]'
main: server is listening on http://0.0.0.0:5000 - starting the main loop
que    start_loop: processing new tasks
que    start_loop: update slots
srv  update_slots: all slots are idle
srv  kv_cache_cle: clearing KV cache
que    start_loop: waiting for new tasks
request: {"stream": true, "model": "local", "messages": [{"role": "user", "content": [{"type": "text", "text": "Describe image"}, {"type": "image_url", "image_url": {"url": ""}}]}]}
add_text: [SYSTEM_PROMPT]You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.
Your knowledge base was last updated on 2023-10-01. The current date is 2025-05-29.

When you're not sure about some information, you say that you don't have the information and don't make up anything.
If the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. "What are some good restaurants around me?" => "Where are you?" or "When is the next flight to Tokyo" => "Where do you travel from?")[/SYSTEM_PROMPT][INST]Describe image

image_tokens->nx = 1
image_tokens->ny = 1
batch_f32 size = 1
add_text: [IMG_END]
add_text: [/INST]
srv  params_from_: Grammar:
srv  params_from_: Grammar lazy: false
srv  params_from_: Chat format: Content-only
srv  add_waiting_: add task 0 to waiting list. current waiting = 0 (before add)
que          post: new task, id = 0/1, front = 0
que    start_loop: processing new tasks
que    start_loop: processing task, id = 0
slot get_availabl: id  0 | task -1 | selected slot by lru, t_last = -1
slot        reset: id  0 | task -1 |
slot launch_slot_: id  0 | task 0 | launching slot : {"id":0,"id_task":0,"n_ctx":131072,"speculative":false,"is_processing":false,"non_causal":false,"params":{"n_predict":-1,"seed":4294967295,"temperature":0.20000000298023224,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":64,"top_p":0.699999988079071,"min_p":0.009999999776482582,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":131072,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":true,"thinking_forced_open":false,"samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":0,"speculative.p_min":0.75,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"prompt":"<s>[SYSTEM_PROMPT]You are Mistral Small 3, a Large Language Model (LLM) created by Mistral AI, a French startup headquartered in Paris.\nYour knowledge base was last updated on 2023-10-01. The current date is 2025-05-29.\n\nWhen you're not sure about some information, you say that you don't have the information and don't make up anything.\nIf the user's question is not clear, ambiguous, or does not provide enough context for you to accurately answer the question, you do not try to answer it right away and you rather ask the user to clarify their request (e.g. \"What are some good restaurants around me?\" => \"Where are you?\" or \"When is the next flight to Tokyo\" => \"Where do you travel from?\")[/SYSTEM_PROMPT][INST]Describe image\n[IMG_END][/INST]","next_token":{"has_next_token":true,"has_new_line":false,"n_remain":-1,"n_decoded":0,"stopping_word":""}}
slot launch_slot_: id  0 | task 0 | processing task
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 1, front = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 183
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 180, n_tokens = 180, progress = 0.983607
srv  update_slots: decoding batch, n_tokens = 180
set_embeddings: value = 0
clear_adapter_lora: call
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 1
encoding image slice...
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 2, front = 0
slot update_slots: id  0 | task 0 | kv cache rm [180, end)
srv  process_chun: processing image...
image slice encoded in 110 ms
decoding image batch 1/1, n_tokens_batch = 1
image decoded (batch 1/1) in 13 ms
srv  process_chun: image processed in 124 ms
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 183, n_tokens = 2, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 183, n_tokens = 2
srv  update_slots: decoding batch, n_tokens = 2
set_embeddings: value = 0
clear_adapter_lora: call
CUDA error: an illegal memory access was encountered
  current device: 1, in function ggml_backend_cuda_synchronize at C:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2461
  cudaStreamSynchronize(cuda_ctx->stream())
C:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:75: CUDA error

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions