Is the MacOS Precompiled Binary Preconfigured with LLAMA_METAL=1? #1498
-
Hello Community, I am working with the MacOS precompiled binary provided in this repository and am wondering if it is already configured with LLAMA_METAL=1. Could someone confirm if this is the case, or if additional setup is required to enable Metal support on MacOS? My setup is:
I was previously running koboldcpp I compiled with Gemma-3-12b-it-MAX-HORROR-D_AU-Q5_K_M-imat.gguf (context size 128000) and generating at comfortable reading speeds (not a super helpful description I know, and I apologize. After recently switching over to the latest MacOS Precompiled Binary it takes about half an hour to generate a paragraph. The flags I am using on boot are --contextsize 128000 --gpulayers 5 --quiet --threads 6 If I adjust the gpulayers or threads it crashes, this seems to be the most stable configuration I have tried. Thank you for your help! (I have searched the open discussions and thew ones I have found are a year or more out of date, so checking to see if anything has changed). EDIT: The full command I am running using the latest precompiled for OS is ./koboldcpp-mac-arm64 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf --contextsize 128000 --gpulayers 5 --quiet --threads 8 --blasthreads 8 and boy oh boy is it SLOW. D: I went to go see how I was running it prior to the update, python koboldcpp.py --noblas --threads 4 --blasthreads 4 --blasbatchsize 1024 --contextsize 128000 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf --nommap --usemlock --quiet Looking back my original assortment of flags... some of them are redundant as I compare it to the Wiki. I got it from a reddit post a while back when I was struggling to load any LLM, and for whatever reason it worked. EDIT No. 2 Electric Boogaloo: So I gaslit myself into thinking maybe I just needed a smaller quant, so I dropped to ./koboldcpp-mac-arm64 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q5_K_M-imat.gguf --contextsize 128000 --gpulayers 5 --quiet --threads 8 --blasthreads 8 which I know for a fact ran like great value butter previously - but it is now akin to molasses. Mercy me what have I done. 🤦♂️ EDIT 3: I am happy to run whatever diagnostic tests I can and provide any additional details as prior to this most recent update, koboldcpp has been hands down the best backend. I have tried --benchmark but it just hangs forever. I am going to try and leave it running tonight and will update tomorrow if anything generates. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 6 replies
-
Yes, the arm binaries do have metal enabled. Why are you only offloading 5 layers though? I see you used --gpulayers 5 but why? With unified memory you should do something more like --gpulayers 999 Let's try with a simple launch command ./koboldcpp-mac-arm64 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf --gpulayers 999 Tell me if this works |
Beta Was this translation helpful? Give feedback.
-
Thank you for your help! Trying to increase the context to 16k caused it to crash out. I know in the past I was able to achieve 128k, it was not lighting fast, but it was a comfortable reading speed.
|
Beta Was this translation helpful? Give feedback.
-
It crashed due to running OOM. Your M1 only has 16gb, correct? Are you sure you were able to load 128k of context before? You can try adding |
Beta Was this translation helpful? Give feedback.
-
Something else that has changed that I noticed is this "zsh: segmentation fault" error when using my original list of flags which I thought was due to the most recent updates to koboldcpp but I am now realizing that my OS was also recently updated. Sorry I have been on auto pilot these last few weeks. I am marking this as resolved and going to look into the segmentation fault error. If I manage to get it working again (I swear to the moon ad back I was able to use context 128k I have triple checked my terminal history) I will drop a message and see if there is anything that can be done for other Mac users. It seems like maybe the OS is preventing full access to memory? I am not entirely sure I need coffee - thank you again! |
Beta Was this translation helpful? Give feedback.
Yes, the arm binaries do have metal enabled.
Why are you only offloading 5 layers though? I see you used --gpulayers 5 but why? With unified memory you should do something more like --gpulayers 999
Let's try with a simple launch command
./koboldcpp-mac-arm64 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf --gpulayers 999
Tell me if this works