-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Description
Name and Version
llama-cli version 6710 (74b8fc1)
Also tested with:
- Official xcframework (latest release from GitHub)
- Homebrew llama.cpp version 6710 and last 2 builds
Both exhibit the same crash.
Operating systems
Mac
GGML backends
Metal, CPU
Hardware
Apple M2 Ultra
Models
Base Model:
- Name: Hermes-3-Llama-3.2-3B
- Quantization: Q4_0
- Size: 1.8GB
- Source: https://huggingface.co/NousResearch/Hermes-3-Llama-3.2-3B-GGUF
- File: Hermes-3-Llama-3.2-3B_q4_0.gguf
LoRA Adapter:
- Name: Hermes-3-Llama-3.2-3B_adapter.
- Size: 93MB
- Format: GGUF
- Trained and tested on Windows (works fine on llama cli there)
Problem description & steps to reproduce
LoRA adapter inference crashes on Mac with a graph size assertion error, but works perfectly on Windows. Base model inference and LoRA adapter loading work fine on Mac - only inference with an active LoRA adapter crashes.
TEPS TO REPRODUCE:
-
Test base model (this works):
llama-cli -m Hermes-3-Llama-3.2-3B_q4_0.gguf
-n 20 -ngl 99 -c 2048 -b 256
Result: Works perfectly, generates text -
Test with LoRA adapter (this crashes):
llama-cli -m Hermes-3-Llama-3.2-3B_q4_0.gguf
--lora gandalf_Hermes-3-Llama-3.2-3B_adapter.gguf
-n 20 -ngl 99 -c 2048 -b 256
Result: Crashes immediately with GGML_ASSERT
Workaround attempted(all failed):
-
Reduced context: -c 512, -c 1024
-
Reduced batch: -b 128
-
CPU only: -ngl 0
-
Limited threads: -t 4
-
Different prompts
-
Latest xcframework (just updated)
-
Homebrew build (v6710)
-
Windows: Base + LoRA works perfectly
-
Mac (xcframework): Base works, LoRA crashes
-
Mac (Homebrew): Base works, LoRA crashes
First Bad Commit
Unable to determine - issue appears to exist in latest and last 2 versions.
Relevant log output
Command executed:
llama-cli -m models/hermes3/base/Hermes-3-Llama-3.2-3B_q4_0.gguf --lora models/hermes3/adapters/Hermes-3-Llama-3.2-3B_adapter.gguf -p "Tell me about wizards." -n 20 -ngl 99 -c 2048 -b 256
Output:
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_device_init: GPU name: Apple M2 Ultra
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 115448.73 MB
build: 6710 (74b8fc17) with Apple clang version 17.0.0 (clang-1700.3.19.1) for arm64-apple-darwin25.0.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M2 Ultra) (unknown id) - 110100 MiB free
[... model loading succeeds ...]
llama_lora_adapter_init_impl: applying lora adapter from 'models/hermes3/adapters/gandalf_Hermes-3-Llama-3.2-3B_adapter.gguf'
[... LoRA loading succeeds ...]
Process 40788 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
frame #0: 0x000000018d35142c libsystem_kernel.dylib`__wait4 + 8
Stack trace:
frame #0: libsystem_kernel.dylib`__wait4 + 8
frame #1: libggml-base.dylib`ggml_abort + 156
frame #2: libggml-base.dylib`ggml_backend_sched_alloc_graph + 464
frame #3: libllama.dylib`llama_context::process_ubatch + 516
frame #4: libllama.dylib`llama_context::decode + 1148
frame #5: libllama.dylib`llama_decode + 20
frame #6: llama-cli`common_init_from_params + 2168
frame #7: llama-cli`main + 636
Error message:
/Users/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-backend.cpp:1718: GGML_ASSERT((int)sched->hash_set.size >= graph->n_nodes + graph->n_leafs) failed
Exit: signal SIGABRT (abort)