Skip to content

Conversation

@tc-mb
Copy link
Contributor

@tc-mb tc-mb commented Aug 12, 2025

As stated in #14983, I have integrated Apple NPU (ANE) acceleration into llama.cpp.

Using MiniCPM-V 4.0 as an example, I will introduce a simple way to use ANE and hope we can discuss a better approach.

  1. Build llama.cpp locally,I added an -DENABLE_COREML option to control whether ANE is used.
cmake -B build --DENABLE_COREML=ON
cmake --build build --config Release -j 8
  1. Download ane in Hugging Face or Modelscope, If you downloaded the zip file, please unzip it.

  2. Used like mmproj, I added the "--ane" interface. The path is the downloaded ane_minicpmv4_vit_f16.mlmodelc file address.

./build/bin/llama-mtmd-cli -m {dir_path}/ggml-model-Q4_0.gguf --mmproj {dir_path}/mmproj-model-f16.gguf --ane {dir_path}/ane_minicpmv4_vit_f16.mlmodelc -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image {dir_path}/xx.png -p "Describe the content of the image in detail." 

I tested ANE acceleration on several devices. The benchmark results are as follows:

mac M2   image size   MiniCPM-V 4.0(ANE) MiniCPM-V 4.0
q4_K_M 1 448×448 prefill time(ms) 790.26 5716.77
  2 600×600 prefill time(ms) 1894.24 17961.35
  3 700×700 prefill time(ms) 2954.34 27866.59
  4 800×800 prefill time(ms) 2964.44 27946.48
  5 1024×625 prefill time(ms) 2977.56 30111.43
  6 1024×768 prefill time(ms) 2975.98 30415.11
  7 1280×960 prefill time(ms) 4065.79 41889.12
mac M4       MiniCPM-V 4.0(ane) MiniCPM-V 4.0
q4_K_M 1 448×448 prefill time(ms) 412.57 736.57
  2 600×600 prefill time(ms) 989.44 3365.09
  3 700×700 prefill time(ms) 1564.61 4031.90
  4 800×800 prefill time(ms) 1555.85 4124.81
  5 1024×625 prefill time(ms) 1563.65 5405.13
  6 1024×768 prefill time(ms) 1567.45 5169.05
  7 1280×960 prefill time(ms) 2141.54 7544.96

A point worth noting: The first time ANE is used, there is a loading time and it will be slightly slower. After that, as long as ANE is not updated, it will remain ready and waiting in the system.

@github-actions github-actions bot added examples python python script changes labels Aug 12, 2025
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks OK. Need to improve encapsulation of the CoreML code (see comments). Would need a review from @ngxson.

Also:

  • Use "CoreML" instead of "ANE"
  • Would eventually need instructions for generating the CoreML inference code - can add those after the PR is approved

Comment on lines 115 to 117

// ANE support functions
void clip_set_ane_model_path(struct clip_ctx * ctx, const char * ane_model_path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should find a way to avoid this. Maybe we can do something similar to whisper.cpp:

https://github.com/ggml-org/whisper.cpp/blob/f7502dca872866a310fe69d30b163fa87d256319/src/whisper.cpp#L3351-L3373

Comment on lines 3845 to 3852

static int flag = 0;
static const void* coremlEncoder = NULL;
static std::string cached_model_path = "";

// Check if we need to load a new model
if (flag == 0 || (ane_model_path && cached_model_path != ane_model_path)) {
if (coremlEncoder) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid this global state. Figure out a way to move this to the clip context.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global idea is good. However, I think we should take time to make sure this can be useful in the long term.

The biggest issue atm is that many TODO are being copied in the PR, which will make refactoring very difficult in the future. We must resolve this problem first.

Related to UX, if we cannot have embeddings and resampler all in one CoreML model, I think we should separate it into 2 repos on hugging face or modelscope. One having only ggml implementation and one have CoreML. Having everything in the same place seems very confusing for most users, and most of them don't even have time to look at this PR.

Comment on lines 3881 to 3883
ane_embedding(ctx, n_threads, &imgs, vit_embedding1);
clip_image_encode_ane(vit_embedding1, vit_embedding2, ctx->ane_model_path.c_str());
ane_resampler(ctx, n_threads, &imgs, vit_embedding2, vec);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like only the ViT part is done by ANE, the rest (embeddings, resampler) is sill done by ggml. Any reason why we can't do the rest with ANE? I think it could be a cleaner approach as we can now be able to load only .mlmodelc file and no more mmproj.gguf file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe we should try ggml_custom_4d and inject the clip_image_encode_ane as a node on ggml cgraph. If that works, it will make everything looks much cleaner. Do you think it's a valid use case of ggml_custom_4d @ggerganov ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson Yes, only the vit is currently being replaced with ane now.
Because the embed calculations aren't yet correctly calculated with ane, I've bypassed the two embed calculations and only replaced the vit itself.
I'm also still trying other methods to see if there's a solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe we should try ggml_custom_4d and inject the clip_image_encode_ane as a node on ggml cgraph. If that works, it will make everything looks much cleaner. Do you think it's a valid use case of ggml_custom_4d

Haven't considered such use case for ggml_custom_4d. Sounds like worth exploring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson I'm sorry I was delayed a bit last week while preparing for the release of V4.5.
However, your suggestion reminded me, and I've found a way to convert the entire VIT to CoreML. This will require a lot of changes, though, so I'll probably submit it next week.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson I have integrated all vit+resampler into coreml for calculation. The code has been modified. Please review it in your free time.

@tc-mb
Copy link
Contributor Author

tc-mb commented Aug 13, 2025

@ggerganov @ngxson Yes, I understand that introducing a new feature requires more time to discuss its design, including its name, structure, and interface definition. All of this takes time. I have plenty of time to prepare for this. I will follow the discussion and ensure that this feature is incorporated into llama.cpp in a proper manner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants