Request for Optimization Help: Running Large Context Models on M1 MacBook Pro (2020) #1501

crystal-coding-time · 2025-04-24T19:23:52Z

crystal-coding-time
Apr 24, 2025

Hi everyone,

I'm looking for advice on optimizing koboldcpp to run large-context models (128k context size) on my 2020 MacBook Pro M1. Below are the tests I've run, along with the results for different configurations.

System Details

MacBook Pro 2020 (M1, 16GB RAM)
Model: Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf
OS: 15.4.1 (24E263)

Notes: I made a boo-boo and updated both my OS and koboldcpp in the same night, and something has definitely changed in how my Mac is now managing memory. I am trying to get to the bottom of it, but was hoping some other Mac users may have some insight. I understand that my machine is not a power house, however I believe that AI performance is not solely dependent on powerful hardware but also relies heavily on efficient algorithms, software optimization, and system architecture. (Think how DeekSeek came to be). Prior to the updates I was able to run this exact same model with 128k context using some flags I found on reddit which I now understand were contradictory, but trying to run them post OS update to test, I get a segmentation fault error.

Test Data

Configuration 0: Pre-Updates (I understand the flags are whack - I am learning and got this from a reddit post, it does not change the reality that it was working)
python koboldcpp.py --noblas --threads 4 --blasthreads 4 --blasbatchsize 1024 --contextsize 128000 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf --nommap --usemlock --quiet

Result: Ran at 'reading' speeds. Unfortunately I am unable to reproduce due to when attempting to run post OS update with the original flags I get a segmentation fault error.

Configuration 1: Post-Update
./koboldcpp-mac-arm64 --usemmap --threads 4 --contextsize 128000 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf

Result: Fails with the following errors.

ggml_metal_graph_compute: command buffer 1 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
graph_compute: ggml_backend_sched_graph_compute_async failed with error -1
llama_decode: failed to decode, ret = -3
Failed to predict at token position 0! Check your context buffer sizes!

Configuration 2: Post-Update
./koboldcpp-mac-arm64 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf --gpulayers 5 --contextsize 128000 --threads 4 --flashattention

Result: Passes
Processing Prompt: [BLAS] (113 / 113 tokens)
Generating Output: (29 / 330 tokens)
Performance:
[14:09:44] CtxLimit:142/128000, Amt:29/330, Init:0.01s, Process:16.41s (6.89T/s), Generate:400.01s (0.07T/s), Total:416.42s

Configuration 3: Post-Update
./koboldcpp-mac-arm64 --usemmap --threads 4 --contextsize 128000 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf --gpulayers 5 --flashattention

Result: Passes
Processing Prompt: [BLAS] (113 / 113 tokens)
Generating Output: (19 / 330 tokens)
Performance:
[13:44:38] CtxLimit:132/128000, Amt:19/330, Init:0.01s, Process:12.29s (9.19T/s), Generate:112.75s (0.17T/s), Total:125.04s

Configuration 4: Post-Update
./koboldcpp-mac-arm64 --usemmap --threads 4 --contextsize 128000 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf --gpulayers 10 --flashattention

Result: Fails with the following errors.

ggml_metal_graph_compute: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
graph_compute: ggml_backend_sched_graph_compute_async failed with error -1
llama_decode: failed to decode, ret = -3
Failed to predict at token position 0! Check your context buffer sizes!

Configuration 5: Post-Update
./koboldcpp-mac-arm64 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf --contextsize 128000 --threads 4 --flashattention

Result: Fails - Whole system freezes.

Configuration 6: Post-Update
./koboldcpp-mac-arm64 --usemmap --threads 6 --contextsize 128000 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf --gpulayers 5 --flashattention

Result: Passes
Processing Prompt: [BLAS] (113 / 113 tokens)
Generating Output: (29 / 330 tokens)
Performance:
[14:45:57] CtxLimit:142/128000, Amt:29/330, Init:0.01s, Process:18.37s (6.15T/s), Generate:267.75s (0.11T/s), Total:286.11s

Configuration 7: Post-Update
./koboldcpp-mac-arm64 --usemmap --threads 8 --contextsize 128000 --model models/Gemma-3-12b-it-MAX-HORROR-D_AU-Q6_K-imat.gguf --gpulayers 5 --flashattention

Result: Passes
Processing Prompt: [BLAS] (113 / 113 tokens)
Generating Output: (33 / 330 tokens)
Performance:
[16:08:26] CtxLimit:146/128000, Amt:33/330, Init:0.01s, Process:15.74s (7.18T/s), Generate:293.36s (0.11T/s), Total:309.09s

What I’ve Tried So Far:

Tested with and without --usemmap.
Experimented with --gpulayers (values: 1-10), anything above 5 results in system crashing
Adjusted --threads (4 vs. 6 vs. 8).
Enabled --flashattention.

Questions:

How can I further optimize these configurations to avoid memory errors and freezing while maintaining the 128k context size?
Are there any recommended flags or alternative settings for better performance on an M1 MacBook Pro?

Thanks in advance for any guidance!

LostRuins · 2025-04-25T06:17:22Z

LostRuins
Apr 25, 2025
Maintainer

Just a thought - by not offloading most layers (e.g. Configuration 2 or Configuration 3 with --gpulayers 5), its possible that you are hitting disk swap as it runs out of memory. That would allow it to continue but at a very bad speed due to the memory swapping, as you seem to be seeing.

0 replies

crystal-coding-time · 2025-04-25T13:30:26Z

crystal-coding-time
Apr 25, 2025
Author

Thank you for your insight! I wanted to provide an update based on further testing. When I increase --gpulayers higher than 5 (e.g., --gpulayers 10), the configuration fails with the following error:

Processing Prompt [BLAS] (113 / 113 tokens)
ggml_metal_graph_compute: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
graph_compute: ggml_backend_sched_graph_compute_async failed with error -1
llama_decode: failed to decode, ret = -3

Failed to predict at token position 0! Check your context buffer sizes!

It seems my system runs out of memory when attempting to offload more layers, so I’ve been sticking to configurations with --gpulayers 5 or fewer. I greatly appreciate all the work you do on this project and don’t want to send you on a wild goose chase. For now, I’m hoping to gather experiences from fellow Mac users to see if anyone has encountered a similar issue or found a workaround - I feel it in my bones that something has changed, I am just struggling to identify what. Thanks again for everything!

0 replies

crystal-coding-time · 2025-04-26T22:21:46Z

crystal-coding-time
Apr 26, 2025
Author

I see that Apple made adjustments to bounds checking with the most recent security content update, macOS 15.4.1.

I have no idea to what extent, if at all, this would impact koboldcpp but I am going to look into that tomorrow.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Request for Optimization Help: Running Large Context Models on M1 MacBook Pro (2020) #1501

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Request for Optimization Help: Running Large Context Models on M1 MacBook Pro (2020) #1501

Uh oh!

Uh oh!

crystal-coding-time Apr 24, 2025

Replies: 3 comments

Uh oh!

LostRuins Apr 25, 2025 Maintainer

Uh oh!

crystal-coding-time Apr 25, 2025 Author

Uh oh!

crystal-coding-time Apr 26, 2025 Author

crystal-coding-time
Apr 24, 2025

LostRuins
Apr 25, 2025
Maintainer

crystal-coding-time
Apr 25, 2025
Author

crystal-coding-time
Apr 26, 2025
Author