Request for Optimization Help: Running Large Context Models on M1 MacBook Pro (2020) #1501
Replies: 3 comments
-
Just a thought - by not offloading most layers (e.g. Configuration 2 or Configuration 3 with |
Beta Was this translation helpful? Give feedback.
-
Thank you for your insight! I wanted to provide an update based on further testing. When I increase --gpulayers higher than 5 (e.g., --gpulayers 10), the configuration fails with the following error:
It seems my system runs out of memory when attempting to offload more layers, so I’ve been sticking to configurations with --gpulayers 5 or fewer. I greatly appreciate all the work you do on this project and don’t want to send you on a wild goose chase. For now, I’m hoping to gather experiences from fellow Mac users to see if anyone has encountered a similar issue or found a workaround - I feel it in my bones that something has changed, I am just struggling to identify what. Thanks again for everything! |
Beta Was this translation helpful? Give feedback.
-
I see that Apple made adjustments to bounds checking with the most recent security content update, macOS 15.4.1. I have no idea to what extent, if at all, this would impact koboldcpp but I am going to look into that tomorrow. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I'm looking for advice on optimizing koboldcpp to run large-context models (128k context size) on my 2020 MacBook Pro M1. Below are the tests I've run, along with the results for different configurations.
System Details
Notes: I made a boo-boo and updated both my OS and koboldcpp in the same night, and something has definitely changed in how my Mac is now managing memory. I am trying to get to the bottom of it, but was hoping some other Mac users may have some insight. I understand that my machine is not a power house, however I believe that AI performance is not solely dependent on powerful hardware but also relies heavily on efficient algorithms, software optimization, and system architecture. (Think how DeekSeek came to be). Prior to the updates I was able to run this exact same model with 128k context using some flags I found on reddit which I now understand were contradictory, but trying to run them post OS update to test, I get a segmentation fault error.
Test Data
Configuration 0: Pre-Updates (I understand the flags are whack - I am learning and got this from a reddit post, it does not change the reality that it was working)
python koboldcpp.py --noblas --threads 4 --blasthreads 4 --blasbatchsize 1024 --contextsize 128000 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf --nommap --usemlock --quiet
Result: Ran at 'reading' speeds. Unfortunately I am unable to reproduce due to when attempting to run post OS update with the original flags I get a segmentation fault error.
Configuration 1: Post-Update
./koboldcpp-mac-arm64 --usemmap --threads 4 --contextsize 128000 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf
Result: Fails with the following errors.
Configuration 2: Post-Update
./koboldcpp-mac-arm64 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf --gpulayers 5 --contextsize 128000 --threads 4 --flashattention
Result: Passes
Processing Prompt: [BLAS] (113 / 113 tokens)
Generating Output: (29 / 330 tokens)
Performance:
[14:09:44] CtxLimit:142/128000, Amt:29/330, Init:0.01s, Process:16.41s (6.89T/s), Generate:400.01s (0.07T/s), Total:416.42s
Configuration 3: Post-Update
./koboldcpp-mac-arm64 --usemmap --threads 4 --contextsize 128000 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf --gpulayers 5 --flashattention
Result: Passes
Processing Prompt: [BLAS] (113 / 113 tokens)
Generating Output: (19 / 330 tokens)
Performance:
[13:44:38] CtxLimit:132/128000, Amt:19/330, Init:0.01s, Process:12.29s (9.19T/s), Generate:112.75s (0.17T/s), Total:125.04s
Configuration 4: Post-Update
./koboldcpp-mac-arm64 --usemmap --threads 4 --contextsize 128000 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf --gpulayers 10 --flashattention
Result: Fails with the following errors.
Configuration 5: Post-Update
./koboldcpp-mac-arm64 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf --contextsize 128000 --threads 4 --flashattention
Result: Fails - Whole system freezes.
Configuration 6: Post-Update
./koboldcpp-mac-arm64 --usemmap --threads 6 --contextsize 128000 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf --gpulayers 5 --flashattention
Result: Passes
Processing Prompt: [BLAS] (113 / 113 tokens)
Generating Output: (29 / 330 tokens)
Performance:
[14:45:57] CtxLimit:142/128000, Amt:29/330, Init:0.01s, Process:18.37s (6.15T/s), Generate:267.75s (0.11T/s), Total:286.11s
Configuration 7: Post-Update
./koboldcpp-mac-arm64 --usemmap --threads 8 --contextsize 128000 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf --gpulayers 5 --flashattention
Result: Passes
Processing Prompt: [BLAS] (113 / 113 tokens)
Generating Output: (33 / 330 tokens)
Performance:
[16:08:26] CtxLimit:146/128000, Amt:33/330, Init:0.01s, Process:15.74s (7.18T/s), Generate:293.36s (0.11T/s), Total:309.09s
What I’ve Tried So Far:
Questions:
Thanks in advance for any guidance!
Beta Was this translation helpful? Give feedback.
All reactions