Skip to content

Eval bug: amx : segfault using ggml_backend_amx_buffer_interface as get_tensor() is null #11115

@dranger003

Description

@dranger003

Name and Version

$ ./build/Debug/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
  Device 1: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
  Device 2: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
register_backend: registered backend CUDA (3 devices)
register_device: registered device CUDA0 (NVIDIA RTX 5000 Ada Generation)
register_device: registered device CUDA1 (NVIDIA RTX 5000 Ada Generation)
register_device: registered device CUDA2 (NVIDIA RTX 5000 Ada Generation)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Xeon(R) w5-3435X)
version: 4430 (ecebbd29)
built with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu

Operating systems

Linux

GGML backends

AMX, CPU, CUDA

Hardware

1x Intel Xeon w5-3435X + 3x RTX 5000 Ada

Models

deepseek-ai/DeepSeek-V3 (Q3_K_M)

Problem description & steps to reproduce

This is the first time I encounter this issue on this system and it seems to only happen with a long prompt using deepseek-v3 while a short prompt works fine.

Looking at amx.cpp I see .get_tensor is nullptr (on purpose?) so this explains the segfault, but why is it null, or why is it being called despite being null, it's not clear? See below for details.

system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
$ git branch -vv
* master ecebbd29 [origin/master] llama : remove unused headers (#11109)
$ cmake -S . -B build/Debug -DCMAKE_BUILD_TYPE=Debug -DGGML_NATIVE=ON -DGGML_CUDA=ON && cmake --build build/Debug -j
./build/Debug/bin/llama-cli -t 8 -ngl 9 -ts 3,3,3 -c 4096 --temp 0.0 -sp -m ggml-deepseek-v3-q3_k.gguf -p "You are a helpful assistant.<|User|>A farmer with a wolf, a goat, and a cabbage must cross a river by boat. The boat can carry only the farmer and a single item. If left unattended together, the wolf would eat the goat, or the goat would eat the cabbage. How can they cross the river without anything being eaten?<|Assistant|>"
Thread 1 "llama-cli" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff7ef4bd9 in ggml_backend_tensor_get (tensor=0x55555a07ada0, data=0x7fff6e0ff010, offset=0, size=66060288) at ~/src/llama.cpp/ggml/src/ggml-backend.cpp:281
#2  0x00007ffff7ef51f8 in ggml_backend_tensor_copy (src=0x55555a07ada0, dst=0x7ffa98b3f450) at ~/src/llama.cpp/ggml/src/ggml-backend.cpp:385
#3  0x00007ffff7ef8f31 in ggml_backend_sched_compute_splits (sched=0x55555878bb90) at ~/src/llama.cpp/ggml/src/ggml-backend.cpp:1391
#4  0x00007ffff7ef9be5 in ggml_backend_sched_graph_compute_async (sched=0x55555878bb90, graph=0x7fff762bd030) at ~/src/llama.cpp/ggml/src/ggml-backend.cpp:1588
#5  0x00007ffff7c5d846 in llama_graph_compute (lctx=..., gf=0x7fff762bd030, n_threads=8, threadpool=0x555557e9b940) at ~/src/llama.cpp/src/llama.cpp:10690
#6  0x00007ffff7c5e572 in llama_decode_impl (lctx=..., inp_batch=...) at ~/src/llama.cpp/src/llama.cpp:10891
#7  0x00007ffff7c63514 in llama_decode (ctx=0x55555a22a440, batch=...) at ~/src/llama.cpp/src/llama.cpp:12209
#8  0x00005555555b2e14 in main (argc=16, argv=0x7fffffffe768) at ~/src/llama.cpp/examples/main/main.cpp:611
(gdb) frame 1
#1  0x00007ffff7ef4bd9 in ggml_backend_tensor_get (tensor=0x55555a07ada0, data=0x7fff6e0ff010, offset=0, size=66060288) at ~/src/llama.cpp/ggml/src/ggml-backend.cpp:281
281         buf->iface.get_tensor(buf, tensor, data, offset, size);
(gdb) print *buf
$1 = {iface = {free_buffer = 0x7ffff78d1a30 <ggml_backend_amx_buffer_free_buffer(ggml_backend_buffer_t)>, get_base = 0x7ffff78d1a4f <ggml_backend_amx_buffer_get_base(ggml_backend_buffer_t)>,
    init_tensor = 0x7ffff78d1a61 <ggml_backend_amx_buffer_init_tensor(ggml_backend_buffer_t, ggml_tensor*)>,
    memset_tensor = 0x7ffff78d1a92 <ggml_backend_amx_buffer_memset_tensor(ggml_backend_buffer_t, ggml_tensor*, uint8_t, size_t, size_t)>,
    set_tensor = 0x7ffff78d1ad6 <ggml_backend_amx_buffer_set_tensor(ggml_backend_buffer_t, ggml_tensor*, void const*, size_t, size_t)>, get_tensor = 0x0, cpy_tensor = 0x0,
    clear = 0x7ffff78d1b88 <ggml_backend_amx_buffer_clear(ggml_backend_buffer_t, uint8_t)>, reset = 0x0}, buft = 0x7ffff79bade0 <ggml_backend_amx_buffer_type()::ggml_backend_buffer_type_amx>, context = 0x7fa85c29f040,
  size = 4225105920, usage = GGML_BACKEND_BUFFER_USAGE_WEIGHTS}
(gdb) 
// ...
static ggml_backend_buffer_i ggml_backend_amx_buffer_interface = {
    /* .free_buffer     = */ ggml_backend_amx_buffer_free_buffer,
    /* .get_base        = */ ggml_backend_amx_buffer_get_base,
    /* .init_tensor     = */ ggml_backend_amx_buffer_init_tensor,
    /* .memset_tensor   = */ ggml_backend_amx_buffer_memset_tensor,
    /* .set_tensor      = */ ggml_backend_amx_buffer_set_tensor,
    /* .get_tensor      = */ nullptr,                                 // <<=== issue seems here?
    /* .cpy_tensor      = */ nullptr,
    /* .clear           = */ ggml_backend_amx_buffer_clear,
    /* .reset           = */ nullptr,
};
// ...
Full output below.
$ gdb --args ./build/Debug/bin/llama-cli -t 8 -ngl 9 -ts 3,3,3 -c 4096 --temp 0.0 -sp -m /md0/models/deepseek-ai/ggml-deepseek-v3-q3_k.gguf -p "You are a helpful assistant.<|User|>A farmer with a wolf, a goat, and a cabbage must cross a river by boat. The boat can carry only the farmer and a single item. If left unattended together, the wolf would eat the goat, or the goat would eat the cabbage. How can they cross the river without anything being eaten?<|Assistant|>"
GNU gdb (GDB) 15.2
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./build/Debug/bin/llama-cli...
(gdb) r
Starting program: ~/src/llama.cpp/build/Debug/bin/llama-cli -t 8 -ngl 9 -ts 3,3,3 -c 4096 --temp 0.0 -sp -m /md0/models/deepseek-ai/ggml-deepseek-v3-q3_k.gguf -p You\ are\ a\ helpful\ assistant.\<|User|\>A\ farmer\ with\ a\ wolf,\ a\ goat,\ and\ a\ cabbage\ must\ cross\ a\ river\ by\ boat.\ The\ boat\ can\ carry\ only\ the\ farmer\ and\ a\ single\ item.\ If\ left\ unattended\ together,\ the\ wolf\ would\ eat\ the\ goat,\ or\ the\ goat\ would\ eat\ the\ cabbage.\ How\ can\ they\ cross\ the\ river\ without\ anything\ being\ eaten\?\<|Assistant|\>

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.archlinux.org>
Enable debuginfod for this session? (y or [n])
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7fffc75ff000 (LWP 74420)]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
  Device 1: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
  Device 2: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
[New Thread 0x7fffb8dde000 (LWP 74430)]
[New Thread 0x7fffb85dd000 (LWP 74431)]
[New Thread 0x7fffb6bbb000 (LWP 74432)]
[New Thread 0x7fffb63ba000 (LWP 74433)]
register_backend: registered backend CUDA (3 devices)
register_device: registered device CUDA0 (NVIDIA RTX 5000 Ada Generation)
register_device: registered device CUDA1 (NVIDIA RTX 5000 Ada Generation)
register_device: registered device CUDA2 (NVIDIA RTX 5000 Ada Generation)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Xeon(R) w5-3435X)
[New Thread 0x7fffb4998000 (LWP 74434)]
build: 4430 (ecebbd29) with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
[New Thread 0x7fff99bff000 (LWP 74435)]
[New Thread 0x7fff993fe000 (LWP 74436)]
llama_model_load_from_file: using device CUDA0 (NVIDIA RTX 5000 Ada Generation) - 31938 MiB free
llama_model_load_from_file: using device CUDA1 (NVIDIA RTX 5000 Ada Generation) - 31938 MiB free
llama_model_load_from_file: using device CUDA2 (NVIDIA RTX 5000 Ada Generation) - 31921 MiB free
llama_model_loader: loaded meta data with 45 key-value pairs and 1025 tensors from /md0/models/deepseek-ai/ggml-deepseek-v3-q3_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek-V3
llama_model_loader: - kv   3:                            general.version str              = V3
llama_model_loader: - kv   4:                           general.basename str              = models-deepseek-ai-DeepSeek
llama_model_loader: - kv   5:                         general.size_label str              = 256x20B
llama_model_loader: - kv   6:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   7:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   8:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv   9:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  10:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  11:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  12:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  13: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  15:                          general.file_type u32              = 12
llama_model_loader: - kv  16:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  17:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  18:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  19:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  20:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  21:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  22:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  23:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  24:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  25:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  26:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  27:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  28:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  29:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  30:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  31: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  32: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,129280]  = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  38:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  41:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  42:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q3_K:  483 tensors
llama_model_loader: - type q4_K:  177 tensors
llama_model_loader: - type q5_K:    3 tensors
llama_model_loader: - type bf16:    1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 129280
llm_load_print_meta: n_merges         = 127741
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18432
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = Q3_K - Medium
llm_load_print_meta: model params     = 671.03 B
llm_load_print_meta: model size       = 298.29 GiB (3.82 BPW)
llm_load_print_meta: general.name     = DeepSeek-V3
llm_load_print_meta: BOS token        = 0 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: EOT token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 131 'Ä'
llm_load_print_meta: FIM PRE token    = 128801 '<|fim▁begin|>'
llm_load_print_meta: FIM SUF token    = 128800 '<|fim▁hole|>'
llm_load_print_meta: FIM MID token    = 128802 '<|fim▁end|>'
llm_load_print_meta: EOG token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
llm_load_tensors: offloading 9 repeating layers to GPU
llm_load_tensors: offloaded 9/62 layers to GPU
llm_load_tensors:        CUDA0 model buffer size = 15643.55 MiB
llm_load_tensors:        CUDA1 model buffer size = 15643.55 MiB
llm_load_tensors:        CUDA2 model buffer size = 15643.55 MiB
llm_load_tensors:          AMX model buffer size =  4029.38 MiB
llm_load_tensors:   CPU_Mapped model buffer size = 258518.15 MiB
...............[New Thread 0x7fff76dde000 (LWP 74437)]
[New Thread 0x7fb4cf1ff000 (LWP 74438)]
[New Thread 0x7fb4ce9fe000 (LWP 74439)]
[New Thread 0x7fa85c29e000 (LWP 74440)]
[New Thread 0x7fa85ba9d000 (LWP 74441)]
[New Thread 0x7fa85b29c000 (LWP 74442)]
[New Thread 0x7fa85aa9b000 (LWP 74443)]
[New Thread 0x7fa85a29a000 (LWP 74444)]
[New Thread 0x7fa859a99000 (LWP 74445)]
[New Thread 0x7fa859298000 (LWP 74446)]
[New Thread 0x7fa858a97000 (LWP 74447)]
[New Thread 0x7fa858296000 (LWP 74448)]
[New Thread 0x7fa857a95000 (LWP 74449)]
[New Thread 0x7fa857294000 (LWP 74450)]
[New Thread 0x7fa856a93000 (LWP 74451)]
[New Thread 0x7fa856292000 (LWP 74452)]
[New Thread 0x7fa855a91000 (LWP 74453)]
[New Thread 0x7fa855290000 (LWP 74454)]
[New Thread 0x7fa854a8f000 (LWP 74455)]
[New Thread 0x7fa85428e000 (LWP 74456)]
[New Thread 0x7fa853a8d000 (LWP 74457)]
[New Thread 0x7fa85328c000 (LWP 74458)]
[New Thread 0x7fa852a8b000 (LWP 74459)]
[New Thread 0x7fa85228a000 (LWP 74460)]
[New Thread 0x7fa851a89000 (LWP 74461)]
[New Thread 0x7fa851288000 (LWP 74462)]
[New Thread 0x7fa850a87000 (LWP 74463)]
[New Thread 0x7fa850286000 (LWP 74464)]
[New Thread 0x7fa84fa85000 (LWP 74465)]
[New Thread 0x7fa84f284000 (LWP 74466)]
[New Thread 0x7fa84ea83000 (LWP 74467)]
.....................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 0.025
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init:      CUDA0 KV buffer size =   960.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   960.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =   960.00 MiB
llama_kv_cache_init:        CPU KV buffer size = 16640.00 MiB
llama_new_context_with_model: KV self size  = 19520.00 MiB, K (f16): 11712.00 MiB, V (f16): 7808.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  3630.00 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  1186.00 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =  1186.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    88.01 MiB
llama_new_context_with_model: graph nodes  = 5025
llama_new_context_with_model: graph splits = 979 (with bs=512), 5 (with bs=1)
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[Thread 0x7fa84fa85000 (LWP 74465) exited]
[Thread 0x7fa851a89000 (LWP 74461) exited]
[Thread 0x7fa852a8b000 (LWP 74459) exited]
[Thread 0x7fa85328c000 (LWP 74458) exited]
[Thread 0x7fa853a8d000 (LWP 74457) exited]
[Thread 0x7fa856292000 (LWP 74452) exited]
[Thread 0x7fa857294000 (LWP 74450) exited]
[Thread 0x7fa857a95000 (LWP 74449) exited]
[Thread 0x7fa859298000 (LWP 74446) exited]
[Thread 0x7fa850286000 (LWP 74464) exited]
[Thread 0x7fa84f284000 (LWP 74466) exited]
[Thread 0x7fa85428e000 (LWP 74456) exited]
[Thread 0x7fa854a8f000 (LWP 74455) exited]
[Thread 0x7fa855a91000 (LWP 74453) exited]
[Thread 0x7fa856a93000 (LWP 74451) exited]
[Thread 0x7fa858a97000 (LWP 74447) exited]
[Thread 0x7fa850a87000 (LWP 74463) exited]
[Thread 0x7fa85228a000 (LWP 74460) exited]
[Thread 0x7fa855290000 (LWP 74454) exited]
[Thread 0x7fa858296000 (LWP 74448) exited]
[Thread 0x7fa859a99000 (LWP 74445) exited]
[Thread 0x7fa851288000 (LWP 74462) exited]
[Thread 0x7fa85a29a000 (LWP 74444) exited]
[Thread 0x7fa84ea83000 (LWP 74467) exited]
main: llama threadpool init, n_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 604357139
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

<|begin▁of▁sentence|>You are a helpful assistant.<|User|>A farmer with a wolf, a goat, and a cabbage must cross a river by boat. The boat can carry only the farmer and a single item. If left unattended together, the wolf would eat the goat, or the goat would eat the cabbage. How can they cross the river without anything being eaten?<|Assistant|>
Thread 1 "llama-cli" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff7ef4bd9 in ggml_backend_tensor_get (tensor=0x55555a07ada0, data=0x7fff6e0ff010, offset=0, size=66060288) at ~/src/llama.cpp/ggml/src/ggml-backend.cpp:281
#2  0x00007ffff7ef51f8 in ggml_backend_tensor_copy (src=0x55555a07ada0, dst=0x7ffa98b3f450) at ~/src/llama.cpp/ggml/src/ggml-backend.cpp:385
#3  0x00007ffff7ef8f31 in ggml_backend_sched_compute_splits (sched=0x55555878bb90) at ~/src/llama.cpp/ggml/src/ggml-backend.cpp:1391
#4  0x00007ffff7ef9be5 in ggml_backend_sched_graph_compute_async (sched=0x55555878bb90, graph=0x7fff762bd030) at ~/src/llama.cpp/ggml/src/ggml-backend.cpp:1588
#5  0x00007ffff7c5d846 in llama_graph_compute (lctx=..., gf=0x7fff762bd030, n_threads=8, threadpool=0x555557e9b940) at ~/src/llama.cpp/src/llama.cpp:10690
#6  0x00007ffff7c5e572 in llama_decode_impl (lctx=..., inp_batch=...) at ~/src/llama.cpp/src/llama.cpp:10891
#7  0x00007ffff7c63514 in llama_decode (ctx=0x55555a22a440, batch=...) at ~/src/llama.cpp/src/llama.cpp:12209
#8  0x00005555555b2e14 in main (argc=16, argv=0x7fffffffe768) at ~/src/llama.cpp/examples/main/main.cpp:611
(gdb) frame 1
#1  0x00007ffff7ef4bd9 in ggml_backend_tensor_get (tensor=0x55555a07ada0, data=0x7fff6e0ff010, offset=0, size=66060288) at ~/src/llama.cpp/ggml/src/ggml-backend.cpp:281
281         buf->iface.get_tensor(buf, tensor, data, offset, size);
(gdb) print *buf
$1 = {iface = {free_buffer = 0x7ffff78d1a30 <ggml_backend_amx_buffer_free_buffer(ggml_backend_buffer_t)>, get_base = 0x7ffff78d1a4f <ggml_backend_amx_buffer_get_base(ggml_backend_buffer_t)>,
    init_tensor = 0x7ffff78d1a61 <ggml_backend_amx_buffer_init_tensor(ggml_backend_buffer_t, ggml_tensor*)>,
    memset_tensor = 0x7ffff78d1a92 <ggml_backend_amx_buffer_memset_tensor(ggml_backend_buffer_t, ggml_tensor*, uint8_t, size_t, size_t)>,
    set_tensor = 0x7ffff78d1ad6 <ggml_backend_amx_buffer_set_tensor(ggml_backend_buffer_t, ggml_tensor*, void const*, size_t, size_t)>, get_tensor = 0x0, cpy_tensor = 0x0,
    clear = 0x7ffff78d1b88 <ggml_backend_amx_buffer_clear(ggml_backend_buffer_t, uint8_t)>, reset = 0x0}, buft = 0x7ffff79bade0 <ggml_backend_amx_buffer_type()::ggml_backend_buffer_type_amx>, context = 0x7fa85c29f040,
  size = 4225105920, usage = GGML_BACKEND_BUFFER_USAGE_WEIGHTS}
(gdb) 
However it works if the prompt is shorter as shown below.
$ gdb --args ./build/Debug/bin/llama-cli -t 8 -ngl 9 -ts 3,3,3 -c 4096 --temp 0.0 -sp -m /md0/models/deepseek-ai/ggml-deepseek-v3-q3_k.gguf -p "You are a helpful assistant.<|User|>Hello?<|Assistant|>"
GNU gdb (GDB) 15.2
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./build/Debug/bin/llama-cli...
(gdb) r
Starting program: ~/src/llama.cpp/build/Debug/bin/llama-cli -t 8 -ngl 9 -ts 3,3,3 -c 4096 --temp 0.0 -sp -m /md0/models/deepseek-ai/ggml-deepseek-v3-q3_k.gguf -p You\ are\ a\ helpful\ assistant.\<|User|\>Hello\?\<|Assistant|\>

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.archlinux.org>
Enable debuginfod for this session? (y or [n])
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7fffc75ff000 (LWP 74552)]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
  Device 1: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
  Device 2: NVIDIA RTX 5000 Ada Generation, compute capability 8.9, VMM: yes
[New Thread 0x7fffb8dde000 (LWP 74562)]
[New Thread 0x7fffb85dd000 (LWP 74563)]
[New Thread 0x7fffb6bbb000 (LWP 74564)]
[New Thread 0x7fffb63ba000 (LWP 74565)]
register_backend: registered backend CUDA (3 devices)
register_device: registered device CUDA0 (NVIDIA RTX 5000 Ada Generation)
register_device: registered device CUDA1 (NVIDIA RTX 5000 Ada Generation)
register_device: registered device CUDA2 (NVIDIA RTX 5000 Ada Generation)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Xeon(R) w5-3435X)
[New Thread 0x7fffb4998000 (LWP 74566)]
build: 4430 (ecebbd29) with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
[New Thread 0x7fff99bff000 (LWP 74567)]
[New Thread 0x7fff993fe000 (LWP 74568)]
llama_model_load_from_file: using device CUDA0 (NVIDIA RTX 5000 Ada Generation) - 31938 MiB free
llama_model_load_from_file: using device CUDA1 (NVIDIA RTX 5000 Ada Generation) - 31938 MiB free
llama_model_load_from_file: using device CUDA2 (NVIDIA RTX 5000 Ada Generation) - 31921 MiB free
llama_model_loader: loaded meta data with 45 key-value pairs and 1025 tensors from /md0/models/deepseek-ai/ggml-deepseek-v3-q3_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek-V3
llama_model_loader: - kv   3:                            general.version str              = V3
llama_model_loader: - kv   4:                           general.basename str              = models-deepseek-ai-DeepSeek
llama_model_loader: - kv   5:                         general.size_label str              = 256x20B
llama_model_loader: - kv   6:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   7:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   8:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv   9:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  10:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  11:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  12:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  13: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  15:                          general.file_type u32              = 12
llama_model_loader: - kv  16:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  17:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  18:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  19:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  20:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  21:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  22:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  23:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  24:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  25:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  26:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  27:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  28:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  29:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  30:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  31: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  32: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,129280]  = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  38:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  41:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  42:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q3_K:  483 tensors
llama_model_loader: - type q4_K:  177 tensors
llama_model_loader: - type q5_K:    3 tensors
llama_model_loader: - type bf16:    1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 129280
llm_load_print_meta: n_merges         = 127741
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18432
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = Q3_K - Medium
llm_load_print_meta: model params     = 671.03 B
llm_load_print_meta: model size       = 298.29 GiB (3.82 BPW)
llm_load_print_meta: general.name     = DeepSeek-V3
llm_load_print_meta: BOS token        = 0 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: EOT token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 131 'Ä'
llm_load_print_meta: FIM PRE token    = 128801 '<|fim▁begin|>'
llm_load_print_meta: FIM SUF token    = 128800 '<|fim▁hole|>'
llm_load_print_meta: FIM MID token    = 128802 '<|fim▁end|>'
llm_load_print_meta: EOG token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
llm_load_tensors: offloading 9 repeating layers to GPU
llm_load_tensors: offloaded 9/62 layers to GPU
llm_load_tensors:        CUDA0 model buffer size = 15643.55 MiB
llm_load_tensors:        CUDA1 model buffer size = 15643.55 MiB
llm_load_tensors:        CUDA2 model buffer size = 15643.55 MiB
llm_load_tensors:          AMX model buffer size =  4029.38 MiB
llm_load_tensors:   CPU_Mapped model buffer size = 258518.15 MiB
...............[New Thread 0x7fff76dde000 (LWP 74570)]
[New Thread 0x7fb4cf1ff000 (LWP 74571)]
[New Thread 0x7fb4ce9fe000 (LWP 74572)]
[New Thread 0x7fa85c29e000 (LWP 74573)]
[New Thread 0x7fa85ba9d000 (LWP 74574)]
[New Thread 0x7fa85b29c000 (LWP 74575)]
[New Thread 0x7fa85aa9b000 (LWP 74576)]
[New Thread 0x7fa85a29a000 (LWP 74577)]
[New Thread 0x7fa859a99000 (LWP 74578)]
[New Thread 0x7fa859298000 (LWP 74579)]
[New Thread 0x7fa858a97000 (LWP 74580)]
[New Thread 0x7fa858296000 (LWP 74581)]
[New Thread 0x7fa857a95000 (LWP 74582)]
[New Thread 0x7fa857294000 (LWP 74583)]
[New Thread 0x7fa856a93000 (LWP 74584)]
[New Thread 0x7fa856292000 (LWP 74585)]
[New Thread 0x7fa855a91000 (LWP 74586)]
[New Thread 0x7fa855290000 (LWP 74587)]
[New Thread 0x7fa854a8f000 (LWP 74588)]
[New Thread 0x7fa85428e000 (LWP 74589)]
[New Thread 0x7fa853a8d000 (LWP 74590)]
[New Thread 0x7fa85328c000 (LWP 74591)]
[New Thread 0x7fa852a8b000 (LWP 74592)]
[New Thread 0x7fa85228a000 (LWP 74593)]
[New Thread 0x7fa851a89000 (LWP 74594)]
[New Thread 0x7fa851288000 (LWP 74595)]
[New Thread 0x7fa850a87000 (LWP 74596)]
[New Thread 0x7fa850286000 (LWP 74597)]
[New Thread 0x7fa84fa85000 (LWP 74598)]
[New Thread 0x7fa84f284000 (LWP 74599)]
[New Thread 0x7fa84ea83000 (LWP 74600)]
.....................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 0.025
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init:      CUDA0 KV buffer size =   960.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   960.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =   960.00 MiB
llama_kv_cache_init:        CPU KV buffer size = 16640.00 MiB
llama_new_context_with_model: KV self size  = 19520.00 MiB, K (f16): 11712.00 MiB, V (f16): 7808.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  3630.00 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  1186.00 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =  1186.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    88.01 MiB
llama_new_context_with_model: graph nodes  = 5025
llama_new_context_with_model: graph splits = 979 (with bs=512), 5 (with bs=1)
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[Thread 0x7fa851a89000 (LWP 74594) exited]
[Thread 0x7fa850286000 (LWP 74597) exited]
[Thread 0x7fa850a87000 (LWP 74596) exited]
[Thread 0x7fa85228a000 (LWP 74593) exited]
[Thread 0x7fa85328c000 (LWP 74591) exited]
[Thread 0x7fa854a8f000 (LWP 74588) exited]
[Thread 0x7fa856a93000 (LWP 74584) exited]
[Thread 0x7fa857294000 (LWP 74583) exited]
[Thread 0x7fa858296000 (LWP 74581) exited]
[Thread 0x7fa84f284000 (LWP 74599) exited]
[Thread 0x7fa84fa85000 (LWP 74598) exited]
[Thread 0x7fa853a8d000 (LWP 74590) exited]
[Thread 0x7fa857a95000 (LWP 74582) exited]
[Thread 0x7fa855290000 (LWP 74587) exited]
[Thread 0x7fa85a29a000 (LWP 74577) exited]
[Thread 0x7fa84ea83000 (LWP 74600) exited]
[Thread 0x7fa851288000 (LWP 74595) exited]
[Thread 0x7fa852a8b000 (LWP 74592) exited]
[Thread 0x7fa85428e000 (LWP 74589) exited]
[Thread 0x7fa856292000 (LWP 74585) exited]
[Thread 0x7fa859a99000 (LWP 74578) exited]
[Thread 0x7fa855a91000 (LWP 74586) exited]
[Thread 0x7fa858a97000 (LWP 74580) exited]
[Thread 0x7fa859298000 (LWP 74579) exited]
main: llama threadpool init, n_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 2662695136
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

<|begin▁of▁sentence|>You are a helpful assistant.<|User|>Hello?<|Assistant|>Hello! How can I assist you today? 😊<|end▁of▁sentence|> [end of text]


llama_perf_sampler_print:    sampling time =      12.79 ms /    23 runs   (    0.56 ms per token,  1798.28 tokens per second)
llama_perf_context_print:        load time =   88880.74 ms
llama_perf_context_print: prompt eval time =   18756.90 ms /    11 tokens ( 1705.17 ms per token,     0.59 tokens per second)
llama_perf_context_print:        eval time =   19047.49 ms /    11 runs   ( 1731.59 ms per token,     0.58 tokens per second)
llama_perf_context_print:       total time =   37895.42 ms /    22 tokens
[Thread 0x7fffb4998000 (LWP 74566) exited]
[Thread 0x7fa85aa9b000 (LWP 74576) exited]
[Thread 0x7fa85b29c000 (LWP 74575) exited]
[Thread 0x7fa85ba9d000 (LWP 74574) exited]
[Thread 0x7fa85c29e000 (LWP 74573) exited]
[Thread 0x7fb4ce9fe000 (LWP 74572) exited]
[Thread 0x7fb4cf1ff000 (LWP 74571) exited]
[Thread 0x7fff76dde000 (LWP 74570) exited]
[Thread 0x7fff993fe000 (LWP 74568) exited]
[Thread 0x7fff99bff000 (LWP 74567) exited]
[Thread 0x7fffb63ba000 (LWP 74565) exited]
[Thread 0x7fffb6bbb000 (LWP 74564) exited]
[Thread 0x7fffb85dd000 (LWP 74563) exited]
[Thread 0x7fffb8dde000 (LWP 74562) exited]
[Thread 0x7ffff03aa000 (LWP 74549) exited]
[Thread 0x7fffc75ff000 (LWP 74552) exited]
[New process 74549]
[Inferior 1 (process 74549) exited normally]
(gdb)

First Bad Commit

9394bbd

Relevant log output

See problem description.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions