-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Closed
Labels
Description
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 523 (84d5475)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
CPU: 13th Gen Intel(R) Core(TM) i7-13700T (24) @ 4.90 GHz
GPU: NVIDIA GeForce RTX 3090 [Discrete]
Memory: 9.56 GiB / 125.51 GiB (8%)
Models
Phi-3.5-mini-instruct-Q4_K_M.gguf
Problem description & steps to reproduce
Embeddings model runs. But since update, I get segmentation fault:
strace llama-server -m Microsoft/quantized/Phi-3.5-mini-instruct-Q4_K_M.gguf
### First Bad Commit
_No response_
### Relevant log output
```shell
futex(0x55fe21b0f680, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f6dc, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f680, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f6d8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f680, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f6dc, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f680, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f6d8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f680, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f6dc, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f680, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f6d8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f680, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f6dc, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f680, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f6d8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f680, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f6dc, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f680, FUTEX_WAKE_PRIVATE, 1) = 1
brk(0x55fe2961f000) = 0x55fe2961f000
futex(0x55fe21b0f6d8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f680, FUTEX_WAKE_PRIVATE, 1) = 1
fadvise64(35, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap(NULL, 2393232384, PROT_READ, MAP_SHARED|MAP_POPULATE, 35, 0) = 0x7f9705400000
madvise(0x7f9705400000, 2393232384, MADV_WILLNEED) = 0
futex(0x55fe21b0f6dc, FUTEX_WAKE_PRIVATE, 1) = 1
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/33 layers to GPU
load_tensors: CPU_Mapped model buffer size = 2281.66 MiB
..........................................futex(0x55fe21b0f680, FUTEX_WAIT_PRIVATE, 2, NULL....) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x55fe21b0f6d8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f680, FUTEX_WAKE_PRIVATE, 1) = 0
...............................munmap(0x7f9705400000, 737280.............) = 0
futex(0x55fe21b0f6dc, FUTEX_WAKE_PRIVATE, 1) = 1
close(35) = 0
.
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
futex(0x55fe21b0f6d8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55fe21b0f680, FUTEX_WAKE_PRIVATE, 1) = 1
mmap(NULL, 1610616832, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f96a54b3000
llama_context: CPU output buffer size = 0.12 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
futex(0x55fe21b0f6dc, FUTEX_WAKE_PRIVATE, 1) = 1
init: CPU KV buffer size = 1536.00 MiB
llama_context: KV self size = 1536.00 MiB, K (f16): 768.00 MiB, V (f16): 768.00 MiB
mmap(NULL, 3280896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f97c04dc000
brk(0x55fe296cb000) = 0x55fe296cb000
brk(0x55fe29773000) = 0x55fe29773000
brk(0x55fe2981b000) = 0x55fe2981b000
mmap(NULL, 55316480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f96a1ff2000
futex(0x55fe21b0f6d8, FUTEX_WAKE_PRIVATE, 1) = 1
brk(0x55fe29840000) = 0x55fe29840000
ioctl(9, _IOC(_IOC_NONE, 0, 0x17, 0), 0x7ffc5d9f87a0) = 0
brk(0x55fe29868000) = 0x55fe29868000
brk(0x55fe298e5000) = 0x55fe298e5000
mmap(NULL, 436203520, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9687ff3000
munmap(0x7f9687ff3000, 53248) = 0
munmap(0x7f96a0000000, 33497088) = 0
ioctl(8, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2b, 0x30), 0x7ffc5d9f7e30) = 0
ioctl(9, _IOC(_IOC_NONE, 0, 0x49, 0), 0x7ffc5d9f7b00) = 0
ioctl(9, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc5d9f56f0) = 0
mmap(0x7f969e400000, 20979712, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f969e400000
ioctl(11, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x27, 0x38), 0x7ffc5d9f7da0) = 0
ioctl(9, _IOC(_IOC_NONE, 0, 0x49, 0), 0x7ffc5d9f7a80) = 0
ioctl(9, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc5d9f5670) = 0
futex(0x55fe21b0f6dc, FUTEX_WAKE_PRIVATE, 1) = 1
llama_context: CUDA0 compute buffer size = 354.00 MiB
llama_context: CUDA_Host compute buffer size = 20.01 MiB
llama_context: graph nodes = 1286
llama_context: graph splits = 293 (with bs=512), 6 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x48} ---
+++ killed by SIGSEGV +++