Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Jul 27, 2025

Repack 8x block_iq4_nl into block_iq4_nlx8 + add AVX implementation

  • Reuse the existing block_q4_0x8 GEMV/GEMM implementation (the logic is the same, just the lookup table for nibbles -> bytes is different)
  • Cleanup some UNUSED macros (not exhaustive)

TODOs:

  • Test the __AVX512F__ path after the refactoring

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jul 27, 2025
@ggerganov
Copy link
Member Author

@Srihari-mcw Since you have access to AVX512, could you run this branch with an iq4_nl quantizaion and verify that the perplexity is within norm?

@Srihari-mcw
Copy link
Collaborator

Srihari-mcw commented Jul 30, 2025

@Srihari-mcw Since you have access to AVX512, could you run this branch with an iq4_nl quantizaion and verify that the perplexity is within norm?

Sure, will check and get back on the same. Thanks

@ggerganov ggerganov force-pushed the gg/repack-iq4_nl-avx2 branch from e2661ed to d1788b7 Compare July 30, 2025 12:36
@Srihari-mcw
Copy link
Collaborator

Hi @ggerganov , we tested the model for perplexity with meta llama2 7B model quantized to 'IQ4_NL' and observed the following perplexity in AVX512 Machine (AMD Ryzen 5 7600X). The perplexity seem close enough

model perplexity (Final estimate PPL) Commit id
llama 7B IQ4_NL 5.8822 +/- 0.03282 Base - 00131d6e
llama 7B IQ4_NL 5.8828 +/- 0.03283 PR Branch - d1788b72

@ggerganov ggerganov force-pushed the gg/repack-iq4_nl-avx2 branch from d1788b7 to 0de01ed Compare August 13, 2025 06:39
@ggerganov ggerganov merged commit 00f35d5 into master Aug 13, 2025
58 checks passed
@ggerganov ggerganov deleted the gg/repack-iq4_nl-avx2 branch August 13, 2025 08:09
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 6, 2025
@LostRuins
Copy link
Collaborator

Hello @ggerganov , I believe that this commit breaks CPU repacked q4_0 models when compiled with gcc/g++ and w64devkit. Oddly, the CI binaries produced with MSVC seem perfectly fine.

I am able to reproduce this when building from the latest version of llama.cpp, getting a segmentation fault when running it.

Model used: gemma-3-4b-it-Q4_0.gguf
gcc version 12.2.0 (GCC)
Win64DevKit on Windows 10 LTSC

Let me know if you need more information. There might be some small modifications you need to make to get it to compile (such as #14953) or setting up CURL, but they are unrelated to this issue.

Once built, running llama-cli --model gemma-3-4b-it-Q4_0.gguf -p "hello" is enough to trigger the segfault.

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Oct 8, 2025
@ggerganov
Copy link
Member Author

I can't reproduce on my Ryzen:

gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

make -j && ./bin/llama-cli -hf ggml-org/gemma-3-4b-it-qat-GGUF -p "hello"

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

@LostRuins
Copy link
Collaborator

LostRuins commented Oct 8, 2025

Can you try using q4_0? I think -hf defaults to q4_k_m

Edit: https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-Q4_0.gguf?download=true

@ggerganov
Copy link
Member Author

It is Q4_0 - we don't upload Q4_K for QAT models.

@LostRuins
Copy link
Collaborator

Alright, let me see if I can figure it out. It might be a Windows thing, as it happens for me on the model you linked too

D:\llama.cpp\bin>llama-cli.exe --model D:\llama.cpp\gemma-3-4b-it-qat-Q4_0.gguf -p "hello"
build: 6713 (d2ee056e) with cc (GCC) 12.2.0 for x86_64-w64-mingw32
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 41 key-value pairs and 444 tensors from D:\llama.cpp\gemma-3-4b-it-qat-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 3 4b It Qat Q4_0 Unquantized
llama_model_loader: - kv   3:                           general.finetune str              = it-qat-unquantized
llama_model_loader: - kv   4:                           general.basename str              = gemma-3
llama_model_loader: - kv   5:                         general.size_label str              = 4B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Gemma 3 4b It
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Google
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv  11:                               general.tags arr[str,4]       = ["gemma3", "gemma", "google", "image-...
llama_model_loader: - kv  12:                      gemma3.context_length u32              = 131072
llama_model_loader: - kv  13:                    gemma3.embedding_length u32              = 2560
llama_model_loader: - kv  14:                         gemma3.block_count u32              = 34
llama_model_loader: - kv  15:                 gemma3.feed_forward_length u32              = 10240
llama_model_loader: - kv  16:                gemma3.attention.head_count u32              = 8
llama_model_loader: - kv  17:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  18:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  19:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  20:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv  22:             gemma3.attention.head_count_kv u32              = 4
llama_model_loader: - kv  23:                   gemma3.rope.scaling.type str              = linear
llama_model_loader: - kv  24:                 gemma3.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,262208]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  28:                      tokenizer.ggml.scores arr[f32,262208]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,262208]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  32:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  36:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  38:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  39:               general.quantization_version u32              = 2
llama_model_loader: - kv  40:                          general.file_type u32              = 2
llama_model_loader: - type  f32:  205 tensors
llama_model_loader: - type q4_0:  238 tensors
llama_model_loader: - type q8_0:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 2.35 GiB (5.19 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
�[0mload: printing all EOG tokens:
load:   - 1 ('<eos>')
load:   - 106 ('<end_of_turn>')
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2560
print_info: n_layer          = 34
print_info: n_head           = 8
print_info: n_head_kv        = 4
print_info: n_rot            = 256
print_info: n_swa            = 1024
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 10240
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 4B
print_info: model params     = 3.88 B
print_info: general.name     = Gemma 3 4b It Qat Q4_0 Unquantized
print_info: vocab type       = SPM
print_info: n_vocab          = 262208
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_REPACK model buffer size =  1721.25 MiB
load_tensors:   CPU_Mapped model buffer size =  2402.82 MiB
.........................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
�[0mllama_context:        CPU  output buffer size =     1.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache:        CPU KV buffer size =    80.00 MiB
llama_kv_cache: size =   80.00 MiB (  4096 cells,   5 layers,  1/1 seqs), K (f16):   40.00 MiB, V (f16):   40.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 1536 cells
llama_kv_cache:        CPU KV buffer size =   174.00 MiB
llama_kv_cache: size =  174.00 MiB (  1536 cells,  29 layers,  1/1 seqs), K (f16):   87.00 MiB, V (f16):   87.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:        CPU compute buffer size =   517.12 MiB
llama_context: graph nodes  = 1369
llama_context: graph splits = 1
common_init_from_params: added <eos> logit bias = -inf
common_init_from_params: added <end_of_turn> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
�[0m
D:\llama.cpp\bin>

@ggerganov
Copy link
Member Author

Try to run it with Debug build to see if you hit any asserts. What CPU do you use? Does it have AVX512?

@LostRuins
Copy link
Collaborator

Hi @ggerganov , I am using a laptop with a i9-13980HX CPU. I don't think it has AVX512 support.
Compiling in debug produces no asserts. Here is my debug build run with gdb.

D:\llama.cpp\bin>gdb --args llama-cli.exe --model D:\llama.cpp\gemma-3-4b-it-qat-Q4_0.gguf -p hello
Reading symbols from llama-cli.exe...
(gdb) run
Starting program: D:\llama.cpp\bin\llama-cli.exe --model D:\llama.cpp\gemma-3-4b-it-qat-Q4_0.gguf -p hello
[New Thread 26384.0x2ed8]
[New Thread 26384.0x9dfc]
[New Thread 26384.0x92d0]
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (13th Gen Intel(R) Core(TM) i9-13980HX)
[New Thread 26384.0x3898]
[Thread 26384.0x3898 exited with code 0]
[New Thread 26384.0x3190]
build: 6713 (d2ee056e) with cc (GCC) 12.2.0 for x86_64-w64-mingw32 (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 41 key-value pairs and 444 tensors from D:\llama.cpp\gemma-3-4b-it-qat-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 3 4b It Qat Q4_0 Unquantized
llama_model_loader: - kv   3:                           general.finetune str              = it-qat-unquantized
llama_model_loader: - kv   4:                           general.basename str              = gemma-3
llama_model_loader: - kv   5:                         general.size_label str              = 4B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Gemma 3 4b It
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Google
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv  11:                               general.tags arr[str,4]       = ["gemma3", "gemma", "google", "image-...
llama_model_loader: - kv  12:                      gemma3.context_length u32              = 131072
llama_model_loader: - kv  13:                    gemma3.embedding_length u32              = 2560
llama_model_loader: - kv  14:                         gemma3.block_count u32              = 34
llama_model_loader: - kv  15:                 gemma3.feed_forward_length u32              = 10240
llama_model_loader: - kv  16:                gemma3.attention.head_count u32              = 8
llama_model_loader: - kv  17:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  18:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  19:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  20:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv  22:             gemma3.attention.head_count_kv u32              = 4
llama_model_loader: - kv  23:                   gemma3.rope.scaling.type str              = linear
llama_model_loader: - kv  24:                 gemma3.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,262208]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  28:                      tokenizer.ggml.scores arr[f32,262208]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,262208]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  32:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  36:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  38:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  39:               general.quantization_version u32              = 2
llama_model_loader: - kv  40:                          general.file_type u32              = 2
llama_model_loader: - type  f32:  205 tensors
llama_model_loader: - type q4_0:  238 tensors
llama_model_loader: - type q8_0:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 2.35 GiB (5.19 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
�[0mload: printing all EOG tokens:
load:   - 1 ('<eos>')
load:   - 106 ('<end_of_turn>')
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2560
print_info: n_layer          = 34
print_info: n_head           = 8
print_info: n_head_kv        = 4
print_info: n_rot            = 256
print_info: n_swa            = 1024
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 10240
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 4B
print_info: model params     = 3.88 B
print_info: general.name     = Gemma 3 4b It Qat Q4_0 Unquantized
print_info: vocab type       = SPM
print_info: n_vocab          = 262208
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_REPACK model buffer size =  1721.25 MiB
load_tensors:   CPU_Mapped model buffer size =  2402.82 MiB
.........................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
�[0mllama_context:        CPU  output buffer size =     1.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache:        CPU KV buffer size =    80.00 MiB
llama_kv_cache: size =   80.00 MiB (  4096 cells,   5 layers,  1/1 seqs), K (f16):   40.00 MiB, V (f16):   40.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 1536 cells
llama_kv_cache:        CPU KV buffer size =   174.00 MiB
llama_kv_cache: size =  174.00 MiB (  1536 cells,  29 layers,  1/1 seqs), K (f16):   87.00 MiB, V (f16):   87.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:        CPU compute buffer size =   517.12 MiB
llama_context: graph nodes  = 1369
llama_context: graph splits = 1
common_init_from_params: added <eos> logit bias = -inf
common_init_from_params: added <end_of_turn> logit bias = -inf
common_init_from_params: setting [New Thread 26384.0x8b08]
[New Thread 26384.0x6fcc]
d[New Thread 26384.0x86a4]
ry[New Thread 26384.0x538c]
_p[New Thread 26384.0x5718]
ena[New Thread 26384.0x2c40]
[New Thread 26384.0x52b4]
l[New Thread 26384.0x4680]
[New Thread 26384.0x909c]
[New Thread 26384.0x94d4]
ty[New Thread 26384.0x9870]
[New Thread 26384.0x7270]
_[New Thread 26384.0x737c]
[New Thread 26384.0x8cf8]
[New Thread 26384.0x5bc8]
last_n to ctx_size = 409
Thread 19 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 26384.0x737c]
0x00007ff7c408965b in mul_sum_us8_pairs_acc_int32x8 (acc=..., ax=..., sy=...) at D:/llama.cpp/ggml/src/ggml-cpu/arch/x86/repack.cpp:151
151     static inline __m256i mul_sum_us8_pairs_acc_int32x8(const __m256i acc, const __m256i ax, const __m256i sy) {

I then obtain the backtrace.

(gdb) bt
#0  0x00007ff7c408965b in mul_sum_us8_pairs_acc_int32x8 (acc=..., ax=..., sy=...)
    at D:/llama.cpp/ggml/src/ggml-cpu/arch/x86/repack.cpp:151
#1  0x00007ff7c4089803 in mul_sum_i8_pairs_acc_int32x8 (acc=..., x=..., y=...)
    at D:/llama.cpp/ggml/src/ggml-cpu/arch/x86/repack.cpp:173
#2  0x00007ff7c40ae911 in gemv_q4_b32_8x8_q8_0_lut_avx<block<4, 8> >(int, float * __restrict__, size_t, const void * __restrict__, const void * __restrict__, int, int, __m256i) (n=2560, s=0x202907a2480, bs=2048, vx=0x2038bc5a080,
    vy=0x202ee906740, nr=1, nc=128, signextendlut=...) at D:/llama.cpp/ggml/src/ggml-cpu/arch/x86/repack.cpp:603
#3  0x00007ff7c408d012 in ggml_gemv_q4_0_8x8_q8_0(int, float * __restrict__, size_t, const void * __restrict__, const void * __restrict__, int, int) (n=2560, s=0x202907a2480, bs=2048, vx=0x2038bc5a080, vy=0x202ee906740, nr=1, nc=128)
    at D:/llama.cpp/ggml/src/ggml-cpu/arch/x86/repack.cpp:1383
#4  0x00007ff7c408781f in ggml::cpu::repack::gemv<block_q4_0, 8ll, 8ll, (ggml_type)8> (n=2560, s=0x202907a2480,
    bs=2048, vx=0x2038bc5a080, vy=0x202ee906740, nr=1, nc=128) at D:/llama.cpp/ggml/src/ggml-cpu/repack.cpp:1501
#5  0x00007ff7c425f23b in ggml::cpu::repack::tensor_traits<block_q4_0, 8ll, 8ll, (ggml_type)8>::forward_mul_mat (
    this=0x7ff7c4573730 <ggml_repack_get_optimal_repack_type(ggml_tensor const*)::q4_0_8x8_q8_0>,
    params=0x1a4ebff750, op=0x202f0228940) at D:/llama.cpp/ggml/src/ggml-cpu/repack.cpp:1666
#6  0x00007ff7c425e864 in ggml::cpu::repack::tensor_traits<block_q4_0, 8ll, 8ll, (ggml_type)8>::compute_forward (
    this=0x7ff7c4573730 <ggml_repack_get_optimal_repack_type(ggml_tensor const*)::q4_0_8x8_q8_0>,
    params=0x1a4ebff750, op=0x202f0228940) at D:/llama.cpp/ggml/src/ggml-cpu/repack.cpp:1591
#7  0x00007ff7c4088a21 in ggml_cpu_extra_compute_forward (params=0x1a4ebff750, op=0x202f0228940)
    at D:/llama.cpp/ggml/src/ggml-cpu/traits.cpp:17
#8  0x00007ff7c40bd4b0 in ggml_compute_forward (params=0x1a4ebff750, tensor=0x202f0228940)
    at D:/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:1669
#9  0x00007ff7c40bf1f8 in ggml_graph_compute_thread (data=0x202f0aa9d60)
    at D:/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2883
#10 0x00007ff7c40c0153 in ggml_graph_compute._omp_fn.0 () at D:/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:3172
#11 0x00007ff7c41c9c75 in gomp_thread_start ()
#12 0x00007ff7c41ea993 in pthread_create_wrapper ()
#13 0x00007ffbaa8daf5a in msvcrt!_beginthreadex () from C:\Windows\System32\msvcrt.dll
#14 0x00007ffbaa8db02c in msvcrt!_endthreadex () from C:\Windows\System32\msvcrt.dll

I ran it again with -t 1

D:\llama.cpp\bin>gdb --args llama-cli.exe --model D:\llama.cpp\gemma-3-4b-it-qat-Q4_0.gguf -p hello -t 1
Reading symbols from llama-cli.exe...
(gdb) run
Starting program: D:\llama.cpp\bin\llama-cli.exe --model D:\llama.cpp\gemma-3-4b-it-qat-Q4_0.gguf -p hello -t 1
[New Thread 18200.0x6cd0]
[New Thread 18200.0x2bd4]
[New Thread 18200.0x901c]
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (13th Gen Intel(R) Core(TM) i9-13980HX)
[New Thread 18200.0x8fe0]
[Thread 18200.0x8fe0 exited with code 0]
[New Thread 18200.0x9a0c]
build: 6713 (d2ee056e) with cc (GCC) 12.2.0 for x86_64-w64-mingw32 (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 41 key-value pairs and 444 tensors from D:\llama.cpp\gemma-3-4b-it-qat-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 3 4b It Qat Q4_0 Unquantized
llama_model_loader: - kv   3:                           general.finetune str              = it-qat-unquantized
llama_model_loader: - kv   4:                           general.basename str              = gemma-3
llama_model_loader: - kv   5:                         general.size_label str              = 4B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Gemma 3 4b It
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Google
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv  11:                               general.tags arr[str,4]       = ["gemma3", "gemma", "google", "image-...
llama_model_loader: - kv  12:                      gemma3.context_length u32              = 131072
llama_model_loader: - kv  13:                    gemma3.embedding_length u32              = 2560
llama_model_loader: - kv  14:                         gemma3.block_count u32              = 34
llama_model_loader: - kv  15:                 gemma3.feed_forward_length u32              = 10240
llama_model_loader: - kv  16:                gemma3.attention.head_count u32              = 8
llama_model_loader: - kv  17:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  18:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  19:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  20:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv  22:             gemma3.attention.head_count_kv u32              = 4
llama_model_loader: - kv  23:                   gemma3.rope.scaling.type str              = linear
llama_model_loader: - kv  24:                 gemma3.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,262208]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  28:                      tokenizer.ggml.scores arr[f32,262208]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,262208]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  32:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  36:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  38:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  39:               general.quantization_version u32              = 2
llama_model_loader: - kv  40:                          general.file_type u32              = 2
llama_model_loader: - type  f32:  205 tensors
llama_model_loader: - type q4_0:  238 tensors
llama_model_loader: - type q8_0:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 2.35 GiB (5.19 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
�[0mload: printing all EOG tokens:
load:   - 1 ('<eos>')
load:   - 106 ('<end_of_turn>')
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2560
print_info: n_layer          = 34
print_info: n_head           = 8
print_info: n_head_kv        = 4
print_info: n_rot            = 256
print_info: n_swa            = 1024
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 10240
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 4B
print_info: model params     = 3.88 B
print_info: general.name     = Gemma 3 4b It Qat Q4_0 Unquantized
print_info: vocab type       = SPM
print_info: n_vocab          = 262208
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_REPACK model buffer size =  1721.25 MiB
load_tensors:   CPU_Mapped model buffer size =  2402.82 MiB
.........................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
�[0mllama_context:        CPU  output buffer size =     1.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache:        CPU KV buffer size =    80.00 MiB
llama_kv_cache: size =   80.00 MiB (  4096 cells,   5 layers,  1/1 seqs), K (f16):   40.00 MiB, V (f16):   40.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 1536 cells
llama_kv_cache:        CPU KV buffer size =   174.00 MiB
llama_kv_cache: size =  174.00 MiB (  1536 cells,  29 layers,  1/1 seqs), K (f16):   87.00 MiB, V (f16):   87.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:        CPU compute buffer size =   517.12 MiB
llama_context: graph nodes  = 1369
llama_con
Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007ff7c408cfd7 in ggml_gemv_q4_0_8x8_q8_0(int, float * __restrict__, size_t, const void * __restrict__, const void * __restrict__, int, int) (n=2560, s=0x11e2d22c880, bs=2048, vx=0x11dac138080, vy=0x11d0efd6720, nr=1, nc=2048) at D:/llama.cpp/ggml/src/ggml-cpu/arch/x86/repack.cpp:1383
1383            gemv_q4_b32_8x8_q8_0_lut_avx<block_q4_0x8>(n, s, bs, vx, vy, nr, nc, signextendlut);
(gdb) bt
#0  0x00007ff7c408cfd7 in ggml_gemv_q4_0_8x8_q8_0(int, float * __restrict__, size_t, const void * __restrict__, const void * __restrict__, int, int) (n=2560, s=0x11e2d22c880, bs=2048, vx=0x11dac138080, vy=0x11d0efd6720, nr=1, nc=2048)
    at D:/llama.cpp/ggml/src/ggml-cpu/arch/x86/repack.cpp:1383
#1  0x00007ff7c408781f in ggml::cpu::repack::gemv<block_q4_0, 8ll, 8ll, (ggml_type)8> (n=2560, s=0x11e2d22c880,
    bs=2048, vx=0x11dac138080, vy=0x11d0efd6720, nr=1, nc=2048) at D:/llama.cpp/ggml/src/ggml-cpu/repack.cpp:1501
#2  0x00007ff7c425f23b in ggml::cpu::repack::tensor_traits<block_q4_0, 8ll, 8ll, (ggml_type)8>::forward_mul_mat (
    this=0x7ff7c4573730 <ggml_repack_get_optimal_repack_type(ggml_tensor const*)::q4_0_8x8_q8_0>,
    params=0x17599f8020, op=0x11d10977940) at D:/llama.cpp/ggml/src/ggml-cpu/repack.cpp:1666
#3  0x00007ff7c425e864 in ggml::cpu::repack::tensor_traits<block_q4_0, 8ll, 8ll, (ggml_type)8>::compute_forward (
    this=0x7ff7c4573730 <ggml_repack_get_optimal_repack_type(ggml_tensor const*)::q4_0_8x8_q8_0>,
    params=0x17599f8020, op=0x11d10977940) at D:/llama.cpp/ggml/src/ggml-cpu/repack.cpp:1591
#4  0x00007ff7c4088a21 in ggml_cpu_extra_compute_forward (params=0x17599f8020, op=0x11d10977940)
    at D:/llama.cpp/ggml/src/ggml-cpu/traits.cpp:17
#5  0x00007ff7c40bd4b0 in ggml_compute_forward (params=0x17599f8020, tensor=0x11d10977940)
    at D:/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:1669
#6  0x00007ff7c40bf1f8 in ggml_graph_compute_thread (data=0x11d111e9880)
    at D:/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:2883
#7  0x00007ff7c40bf7d6 in ggml_graph_compute (cgraph=0x11d1124cf38, cplan=0x17599f8350)
    at D:/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c:3176
#8  0x00007ff7c407d5fb in ggml_backend_cpu_graph_compute (backend=0x11d0ef5aca0, cgraph=0x11d1124cf38)
    at D:/llama.cpp/ggml/src/ggml-cpu/ggml-cpu.cpp:186
#9  0x00007ff7c41938a9 in ggml_backend_graph_compute_async (backend=0x11d0ef5aca0, cgraph=0x11d1124cf38)
    at D:/llama.cpp/ggml/src/ggml-backend.cpp:359
#10 0x00007ff7c419842f in ggml_backend_sched_compute_splits (sched=0x11d0ef564a0)
    at D:/llama.cpp/ggml/src/ggml-backend.cpp:1553
#11 0x00007ff7c4199130 in ggml_backend_sched_graph_compute_async (sched=0x11d0ef564a0, graph=0x11d10950060)
    at D:/llama.cpp/ggml/src/ggml-backend.cpp:1753
#12 0x00007ff7c3fd82a0 in llama_context::graph_compute (this=0x11d14448660, gf=0x11d10950060, batched=true)
    at D:/llama.cpp/src/llama-context.cpp:1460

When run without gdb, the terminal output fails at the same place as my previous comment. Somehow when using gdb it seems to be truncated earlier (maybe just a flush issue?)

@LostRuins
Copy link
Collaborator

On Occam's advice I have created a standalone issue for this #16479

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants