Skip to content

Eval bug: phi-4 crashes with new versions #13665

@porzione

Description

@porzione

Name and Version

llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 3060)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (OpenBLAS)
register_backend: registered backend RPC (0 devices)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 7 3800X 8-Core Processor)
version: 5435 (a4090d117)
built with cc (GCC) 15.1.1 20250425 for x86_64-pc-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

Nvidia 3060 12Gb
AMD Ryzen 7 3800X 8-Core

Models

Original MS build phi-4-Q4_K.gguf

Problem description & steps to reproduce

Last working version is b5426, after that llama-cli crashes with 'Segmentation fault'

Command: llama-cli -m /home/ftp/AI/microsoft/phi-4-gguf/phi-4-Q4_K.gguf -st --simple-io --flash-attn --no-display-prompt -ngl 41 --threads 8 --temp 0.25 --top-p 0.95 --ctx-size 2048 -p "$(cat ~/dox/AI/PROMPT.txt)"

Workaround: downgrade to b5426

First Bad Commit

b5429 (supposedly)

Relevant log output

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
register_backend: registered backend CUDA (1 devices)
register_device: registered device CUDA0 (NVIDIA GeForce RTX 3060)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (OpenBLAS)
register_backend: registered backend RPC (0 devices)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (AMD Ryzen 7 3800X 8-Core Processor)
build: 5435 (a4090d117) with cc (GCC) 15.1.1 20250425 for x86_64-pc-linux-gnu (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3060) - 10769 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 243 tensors from /home/ftp/AI/microsoft/phi-4-gguf/phi-4-Q4_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Phi 4
llama_model_loader: - kv   3:                            general.version str              = 4
llama_model_loader: - kv   4:                       general.organization str              = Microsoft
llama_model_loader: - kv   5:                           general.basename str              = phi
llama_model_loader: - kv   6:                         general.size_label str              = 15B
llama_model_loader: - kv   7:                            general.license str              = mit
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/microsoft/phi-...
llama_model_loader: - kv   9:                               general.tags arr[str,7]       = ["phi", "nlp", "math", "code", "chat"...
llama_model_loader: - kv  10:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  11:                        phi3.context_length u32              = 16384
llama_model_loader: - kv  12:  phi3.rope.scaling.original_context_length u32              = 16384
llama_model_loader: - kv  13:                      phi3.embedding_length u32              = 5120
llama_model_loader: - kv  14:                   phi3.feed_forward_length u32              = 17920
llama_model_loader: - kv  15:                           phi3.block_count u32              = 40
llama_model_loader: - kv  16:                  phi3.attention.head_count u32              = 40
llama_model_loader: - kv  17:               phi3.attention.head_count_kv u32              = 10
llama_model_loader: - kv  18:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  19:                  phi3.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                        phi3.rope.freq_base f32              = 250000.000000
llama_model_loader: - kv  21:              phi3.attention.sliding_window u32              = 0
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = dbrx
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 100257
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 100265
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 100349
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {% for message in messages %}{% if (m...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_K:  101 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:   21 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 8.43 GiB (4.94 BPW) 
load: special tokens cache size = 96
load: token to piece cache size = 0.6151 MB
print_info: arch             = phi3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 16384
print_info: n_embd           = 5120
print_info: n_layer          = 40
print_info: n_head           = 40
print_info: n_head_kv        = 10
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1280
print_info: n_embd_v_gqa     = 1280
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 17920
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 250000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 16384
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 14B
print_info: model params     = 14.66 B
print_info: general.name     = Phi 4
print_info: vocab type       = BPE
print_info: n_vocab          = 100352
print_info: n_merges         = 100000
print_info: BOS token        = 100257 '<|endoftext|>'
print_info: EOS token        = 100265 '<|im_end|>'
print_info: EOT token        = 100265 '<|im_end|>'
print_info: PAD token        = 100349 '<|dummy_85|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 100258 '<|fim_prefix|>'
print_info: FIM SUF token    = 100260 '<|fim_suffix|>'
print_info: FIM MID token    = 100259 '<|fim_middle|>'
print_info: EOG token        = 100257 '<|endoftext|>'
print_info: EOG token        = 100265 '<|im_end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 40 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 41/41 layers to GPU
load_tensors:        CUDA0 model buffer size =  8354.71 MiB
load_tensors:   CPU_Mapped model buffer size =   275.62 MiB
.......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 2048
llama_context: n_ctx_per_seq = 2048
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 250000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (2048) < n_ctx_train (16384) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.38 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   400.00 MiB
llama_kv_cache_unified: size =  400.00 MiB (  2048 cells,  40 layers), K (f16):  200.00 MiB, V (f16):  200.00 MiB
Segmentation fault (core dumped)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions