Skip to content

bug: segfault on q4_0 repacking on windows avx2 cpu only #16479

@LostRuins

Description

@LostRuins

This is a continuation after #14904 (comment)
I encounter a segfault on llama-cli with q4_0 models utilizing CPU repacking on windows avx2.
Running on GPU is fine, and if repacking is disabled, everything works fine too

Model used: gemma-3-4b-it-Q4_0.gguf
gcc version 12.2.0 (GCC)
Win64DevKit on Windows 10 LTSC
i9-13980HX CPU

First commit with issue: #14904

Compiling in debug produces no asserts. Here is my debug build run with gdb.

D:\llama.cpp\bin>gdb --args llama-cli.exe --model D:\llama.cpp\gemma-3-4b-it-qat-Q4_0.gguf -p hello
Reading symbols from llama-cli.exe...
(gdb) run
Starting program: D:\llama.cpp\bin\llama-cli.exe --model D:\llama.cpp\gemma-3-4b-it-qat-Q4_0.gguf -p hello
[New Thread 26384.0x2ed8]
[New Thread 26384.0x9dfc]
[New Thread 26384.0x92d0]
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (13th Gen Intel(R) Core(TM) i9-13980HX)
[New Thread 26384.0x3898]
[Thread 26384.0x3898 exited with code 0]
[New Thread 26384.0x3190]
build: 6713 (d2ee056e) with cc (GCC) 12.2.0 for x86_64-w64-mingw32 (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 41 key-value pairs and 444 tensors from D:\llama.cpp\gemma-3-4b-it-qat-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 3 4b It Qat Q4_0 Unquantized
llama_model_loader: - kv   3:                           general.finetune str              = it-qat-unquantized
llama_model_loader: - kv   4:                           general.basename str              = gemma-3
llama_model_loader: - kv   5:                         general.size_label str              = 4B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Gemma 3 4b It
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Google
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv  11:                               general.tags arr[str,4]       = ["gemma3", "gemma", "google", "image-...
llama_model_loader: - kv  12:                      gemma3.context_length u32              = 131072
llama_model_loader: - kv  13:                    gemma3.embedding_length u32              = 2560
llama_model_loader: - kv  14:                         gemma3.block_count u32              = 34
llama_model_loader: - kv  15:                 gemma3.feed_forward_length u32              = 10240
llama_model_loader: - kv  16:                gemma3.attention.head_count u32              = 8
llama_model_loader: - kv  17:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  18:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  19:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  20:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv  22:             gemma3.attention.head_count_kv u32              = 4
llama_model_loader: - kv  23:                   gemma3.rope.scaling.type str              = linear
llama_model_loader: - kv  24:                 gemma3.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,262208]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  28:                      tokenizer.ggml.scores arr[f32,262208]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,262208]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  32:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  36:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  38:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  39:               general.quantization_version u32              = 2
llama_model_loader: - kv  40:                          general.file_type u32              = 2
llama_model_loader: - type  f32:  205 tensors
llama_model_loader: - type q4_0:  238 tensors
llama_model_loader: - type q8_0:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 2.35 GiB (5.19 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
�[0mload: printing all EOG tokens:
load:   - 1 ('<eos>')
load:   - 106 ('<end_of_turn>')
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2560
print_info: n_layer          = 34
print_info: n_head           = 8
print_info: n_head_kv        = 4
print_info: n_rot            = 256
print_info: n_swa            = 1024
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 10240
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 4B
print_info: model params     = 3.88 B
print_info: general.name     = Gemma 3 4b It Qat Q4_0 Unquantized
print_info: vocab type       = SPM
print_info: n_vocab          = 262208
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_REPACK model buffer size =  1721.25 MiB
load_tensors:   CPU_Mapped model buffer size =  2402.82 MiB
.........................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
�[0mllama_context:        CPU  output buffer size =     1.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache:        CPU KV buffer size =    80.00 MiB
llama_kv_cache: size =   80.00 MiB (  4096 cells,   5 layers,  1/1 seqs), K (f16):   40.00 MiB, V (f16):   40.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 1536 cells
llama_kv_cache:        CPU KV buffer size =   174.00 MiB
llama_kv_cache: size =  174.00 MiB (  1536 cells,  29 layers,  1/1 seqs), K (f16):   87.00 MiB, V (f16):   87.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:        CPU compute buffer size =   517.12 MiB
llama_context: graph nodes  = 1369
llama_context: graph splits = 1
common_init_from_params: added <eos> logit bias = -inf
common_init_from_params: added <end_of_turn> logit bias = -inf
common_init_from_params: setting [New Thread 26384.0x8b08]
[New Thread 26384.0x6fcc]
d[New Thread 26384.0x86a4]
ry[New Thread 26384.0x538c]
_p[New Thread 26384.0x5718]
ena[New Thread 26384.0x2c40]
[New Thread 26384.0x52b4]
l[New Thread 26384.0x4680]
[New Thread 26384.0x909c]
[New Thread 26384.0x94d4]
ty[New Thread 26384.0x9870]
[New Thread 26384.0x7270]
_[New Thread 26384.0x737c]
[New Thread 26384.0x8cf8]
[New Thread 26384.0x5bc8]
last_n to ctx_size = 409
Thread 19 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 26384.0x737c]
0x00007ff7c408965b in mul_sum_us8_pairs_acc_int32x8 (acc=..., ax=..., sy=...) at D:/llama.cpp/ggml/src/ggml-cpu/arch/x86/repack.cpp:151
151     static inline __m256i mul_sum_us8_pairs_acc_int32x8(const __m256i acc, const __m256i ax, const __m256i sy) {

I then obtain the backtrace.

(gdb) bt
#0  0x00007ff7c408965b in mul_sum_us8_pairs_acc_int32x8 (acc=..., ax=..., sy=...)
    at D:/llama.cpp/ggml/src/ggml-cpu/arch/x86/repack.cpp:151
#1  0x00007ff7c4089803 in mul_sum_i8_pairs_acc_int32x8 (acc=..., x=..., y=...)
    at D:/llama.cpp/ggml/src/ggml-cpu/arch/x86/repack.cpp:173
#2  0x00007ff7c40ae911 in gemv_q4_b32_8x8_q8_0_lut_avx<block<4, 8> >(int, float * __restrict__, size_t, const void * __restrict__, const void * __restrict__, int, int, __m256i) (n=2560, s=0x202907a2480, bs=2048, vx=0x2038bc5a080,
    vy=0x202ee906740, nr=1, nc=128, signextendlut=...) at D:/llama.cpp/ggml/src/ggml-cpu/arch/x86/repack.cpp:603
#3  0x00007ff7c408d012 in ggml_gemv_q4_0_8x8_q8_0(int, float * __restrict__, size_t, const void * __restrict__, const void * __restrict__, int, int) (n=2560, s=0x202907a2480, bs=2048, vx=0x2038bc5a080, vy=0x202ee906740, nr=1, nc=128)
    at D:/llama.cpp/ggml/src/ggml-cpu/arch/x86/repack.cpp:1383
#4  0x00007ff7c408781f in ggml::cpu::repack::gemv<block_q4_0, 8ll, 8ll, (ggml_type)8> (n=2560, s=0x202907a2480,
    bs=2048, vx=0x2038bc5a080, vy=0x202ee906740, nr=1, nc=128) at D:/llama.cpp/ggml/src/ggml-cpu/repack.cpp:1501
#5  0x00007ff7c425f23b in ggml::cpu::repack::tensor_traits<block_q4_0, 8ll, 8ll, (ggml_type)8>::forward_mul_mat (
    this=0x7ff7c4573730 <ggml_repack_get_optimal_repack_type(ggml_tensor const*)::q4_0_8x8_q8_0>,
    params=0x1a4ebff750, op=0x202f0228940) at D:/llama.cpp/ggml/src/ggml-cpu/repack.cpp:1666
#6  0x00007ff7c425e864 in ggml::cpu::repack::tensor_traits<block_q4_0, 8ll, 8ll, (ggml_type)8>::compute_forward (
    this=0x7ff7c4573730 <ggml_repack_get_optimal_repack_type(ggml_tensor const*)::q4_0_8x8_q8_0>,
    params=0x1a4ebff750, op=0x202f0228940) at D:/llama.cpp/ggml/src/ggml-cpu/repack.cpp:1591
#7  0x00007ff7c4088a21 in ggml_cpu_extra_compute_forward (params=0x1a4ebff750, op=0x202f0228940)
    at D:/llama.cpp/ggml/src/ggml-cpu/traits.cpp:17
#8  0x00007ff7c40bd4b0 in ggml_compute_forward (params=0x1a4ebff750, tensor=0x202f0228940)
    at D:/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:1669
#9  0x00007ff7c40bf1f8 in ggml_graph_compute_thread (data=0x202f0aa9d60)
    at D:/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2883
#10 0x00007ff7c40c0153 in ggml_graph_compute._omp_fn.0 () at D:/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:3172
#11 0x00007ff7c41c9c75 in gomp_thread_start ()
#12 0x00007ff7c41ea993 in pthread_create_wrapper ()
#13 0x00007ffbaa8daf5a in msvcrt!_beginthreadex () from C:\Windows\System32\msvcrt.dll
#14 0x00007ffbaa8db02c in msvcrt!_endthreadex () from C:\Windows\System32\msvcrt.dll

I ran it again with -t 1

D:\llama.cpp\bin>gdb --args llama-cli.exe --model D:\llama.cpp\gemma-3-4b-it-qat-Q4_0.gguf -p hello -t 1
Reading symbols from llama-cli.exe...
(gdb) run
Starting program: D:\llama.cpp\bin\llama-cli.exe --model D:\llama.cpp\gemma-3-4b-it-qat-Q4_0.gguf -p hello -t 1
[New Thread 18200.0x6cd0]
[New Thread 18200.0x2bd4]
[New Thread 18200.0x901c]
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (13th Gen Intel(R) Core(TM) i9-13980HX)
[New Thread 18200.0x8fe0]
[Thread 18200.0x8fe0 exited with code 0]
[New Thread 18200.0x9a0c]
build: 6713 (d2ee056e) with cc (GCC) 12.2.0 for x86_64-w64-mingw32 (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 41 key-value pairs and 444 tensors from D:\llama.cpp\gemma-3-4b-it-qat-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 3 4b It Qat Q4_0 Unquantized
llama_model_loader: - kv   3:                           general.finetune str              = it-qat-unquantized
llama_model_loader: - kv   4:                           general.basename str              = gemma-3
llama_model_loader: - kv   5:                         general.size_label str              = 4B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Gemma 3 4b It
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Google
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv  11:                               general.tags arr[str,4]       = ["gemma3", "gemma", "google", "image-...
llama_model_loader: - kv  12:                      gemma3.context_length u32              = 131072
llama_model_loader: - kv  13:                    gemma3.embedding_length u32              = 2560
llama_model_loader: - kv  14:                         gemma3.block_count u32              = 34
llama_model_loader: - kv  15:                 gemma3.feed_forward_length u32              = 10240
llama_model_loader: - kv  16:                gemma3.attention.head_count u32              = 8
llama_model_loader: - kv  17:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  18:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  19:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  20:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv  22:             gemma3.attention.head_count_kv u32              = 4
llama_model_loader: - kv  23:                   gemma3.rope.scaling.type str              = linear
llama_model_loader: - kv  24:                 gemma3.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,262208]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  28:                      tokenizer.ggml.scores arr[f32,262208]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,262208]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  32:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  36:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  38:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  39:               general.quantization_version u32              = 2
llama_model_loader: - kv  40:                          general.file_type u32              = 2
llama_model_loader: - type  f32:  205 tensors
llama_model_loader: - type q4_0:  238 tensors
llama_model_loader: - type q8_0:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 2.35 GiB (5.19 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
�[0mload: printing all EOG tokens:
load:   - 1 ('<eos>')
load:   - 106 ('<end_of_turn>')
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2560
print_info: n_layer          = 34
print_info: n_head           = 8
print_info: n_head_kv        = 4
print_info: n_rot            = 256
print_info: n_swa            = 1024
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 10240
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 4B
print_info: model params     = 3.88 B
print_info: general.name     = Gemma 3 4b It Qat Q4_0 Unquantized
print_info: vocab type       = SPM
print_info: n_vocab          = 262208
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_REPACK model buffer size =  1721.25 MiB
load_tensors:   CPU_Mapped model buffer size =  2402.82 MiB
.........................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
�[0mllama_context:        CPU  output buffer size =     1.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache:        CPU KV buffer size =    80.00 MiB
llama_kv_cache: size =   80.00 MiB (  4096 cells,   5 layers,  1/1 seqs), K (f16):   40.00 MiB, V (f16):   40.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 1536 cells
llama_kv_cache:        CPU KV buffer size =   174.00 MiB
llama_kv_cache: size =  174.00 MiB (  1536 cells,  29 layers,  1/1 seqs), K (f16):   87.00 MiB, V (f16):   87.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:        CPU compute buffer size =   517.12 MiB
llama_context: graph nodes  = 1369
llama_con
Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007ff7c408cfd7 in ggml_gemv_q4_0_8x8_q8_0(int, float * __restrict__, size_t, const void * __restrict__, const void * __restrict__, int, int) (n=2560, s=0x11e2d22c880, bs=2048, vx=0x11dac138080, vy=0x11d0efd6720, nr=1, nc=2048) at D:/llama.cpp/ggml/src/ggml-cpu/arch/x86/repack.cpp:1383
1383            gemv_q4_b32_8x8_q8_0_lut_avx<block_q4_0x8>(n, s, bs, vx, vy, nr, nc, signextendlut);
(gdb) bt
#0  0x00007ff7c408cfd7 in ggml_gemv_q4_0_8x8_q8_0(int, float * __restrict__, size_t, const void * __restrict__, const void * __restrict__, int, int) (n=2560, s=0x11e2d22c880, bs=2048, vx=0x11dac138080, vy=0x11d0efd6720, nr=1, nc=2048)
    at D:/llama.cpp/ggml/src/ggml-cpu/arch/x86/repack.cpp:1383
#1  0x00007ff7c408781f in ggml::cpu::repack::gemv<block_q4_0, 8ll, 8ll, (ggml_type)8> (n=2560, s=0x11e2d22c880,
    bs=2048, vx=0x11dac138080, vy=0x11d0efd6720, nr=1, nc=2048) at D:/llama.cpp/ggml/src/ggml-cpu/repack.cpp:1501
#2  0x00007ff7c425f23b in ggml::cpu::repack::tensor_traits<block_q4_0, 8ll, 8ll, (ggml_type)8>::forward_mul_mat (
    this=0x7ff7c4573730 <ggml_repack_get_optimal_repack_type(ggml_tensor const*)::q4_0_8x8_q8_0>,
    params=0x17599f8020, op=0x11d10977940) at D:/llama.cpp/ggml/src/ggml-cpu/repack.cpp:1666
#3  0x00007ff7c425e864 in ggml::cpu::repack::tensor_traits<block_q4_0, 8ll, 8ll, (ggml_type)8>::compute_forward (
    this=0x7ff7c4573730 <ggml_repack_get_optimal_repack_type(ggml_tensor const*)::q4_0_8x8_q8_0>,
    params=0x17599f8020, op=0x11d10977940) at D:/llama.cpp/ggml/src/ggml-cpu/repack.cpp:1591
#4  0x00007ff7c4088a21 in ggml_cpu_extra_compute_forward (params=0x17599f8020, op=0x11d10977940)
    at D:/llama.cpp/ggml/src/ggml-cpu/traits.cpp:17
#5  0x00007ff7c40bd4b0 in ggml_compute_forward (params=0x17599f8020, tensor=0x11d10977940)
    at D:/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:1669
#6  0x00007ff7c40bf1f8 in ggml_graph_compute_thread (data=0x11d111e9880)
    at D:/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2883
#7  0x00007ff7c40bf7d6 in ggml_graph_compute (cgraph=0x11d1124cf38, cplan=0x17599f8350)
    at D:/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:3176
#8  0x00007ff7c407d5fb in ggml_backend_cpu_graph_compute (backend=0x11d0ef5aca0, cgraph=0x11d1124cf38)
    at D:/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.cpp:186
#9  0x00007ff7c41938a9 in ggml_backend_graph_compute_async (backend=0x11d0ef5aca0, cgraph=0x11d1124cf38)
    at D:/llama.cpp/ggml/src/ggml-backend.cpp:359
#10 0x00007ff7c419842f in ggml_backend_sched_compute_splits (sched=0x11d0ef564a0)
    at D:/llama.cpp/ggml/src/ggml-backend.cpp:1553
#11 0x00007ff7c4199130 in ggml_backend_sched_graph_compute_async (sched=0x11d0ef564a0, graph=0x11d10950060)
    at D:/llama.cpp/ggml/src/ggml-backend.cpp:1753
#12 0x00007ff7c3fd82a0 in llama_context::graph_compute (this=0x11d14448660, gf=0x11d10950060, batched=true)
    at D:/llama.cpp/src/llama-context.cpp:1460

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions