- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.4k
Closed
Labels
bug-unconfirmedcritical severityUsed to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)
Description
What happened?
Trying to run a llama-server on Apple Silicon M2 running Ventura. Same error either using the latest release or building from source. I'm trying to load Llama-3.2-3B-Instruct F16 from Meta. I created the gguf using convert_hf_to_gguf.py.
$ ./llama-server -m Llama-3.2-3B-Instruct-F16.gguf --verbose
Name and Version
From source
./llama-cli --version
version: 4048 (a71d81c)
built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for arm64-apple-darwin22.6.0
From the release
$ ./llama-cli --version
version: 4044 (97404c4)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.6.0
What operating system are you seeing the problem on?
Mac
Relevant log output
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: using embedded metal library
ggml_metal_init: error: Error Domain=MTLLibraryErrorDomain Code=3 "program_source:5134:17: error: zero-length arrays are not permitted in C++
    float4x4 lo[D16/NW4];
                ^~~~~~~
program_source:5391:18: note: in instantiation of function template specialization 'kernel_flash_attn_ext_vec<metal::matrix<half, 4, 4, void>, 1, &dequantize_f16, 64, 1, 32>' requested here
typedef decltype(kernel_flash_attn_ext_vec<half4x4, 1, dequantize_f16, 64>) flash_attn_ext_vec_t;
                 ^
program_source:5186:21: error: zero-length arrays are not permitted in C++
        float4x4 mq[D16/NW4];
                    ^~~~~~~
" UserInfo={NSLocalizedDescription=program_source:5134:17: error: zero-length arrays are not permitted in C++
    float4x4 lo[D16/NW4];
                ^~~~~~~
program_source:5391:18: note: in instantiation of function template specialization 'kernel_flash_attn_ext_vec<metal::matrix<half, 4, 4, void>, 1, &dequantize_f16, 64, 1, 32>' requested here
typedef decltype(kernel_flash_attn_ext_vec<half4x4, 1, dequantize_f16, 64>) flash_attn_ext_vec_t;
                 ^
program_source:5186:21: error: zero-length arrays are not permitted in C++
        float4x4 mq[D16/NW4];
                    ^~~~~~~
}
ggml_backend_metal_device_init: error: failed to allocate context
llama_new_context_with_model: failed to initialize Metal backend
common_init_from_params: failed to create context with model '/Users/username/dev/RAG/Llama-3.2-3B-Instruct-F16.gguf'
srv    load_model: failed to load model, '/Users/username/dev/RAG/Llama-3.2-3B-Instruct-F16.gguf'
main: exiting due to model loading errorstefanb
Metadata
Metadata
Assignees
Labels
bug-unconfirmedcritical severityUsed to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss)