-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Add OpenVINO backend #15307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add OpenVINO backend #15307
Conversation
Hello, in this repo https://github.com/yangsu2022/GGUF-to-OpenVINO and the article https://blog.openvino.ai/blog-posts/openvino-genai-supports-gguf-models only a small set of models are supported. Will this feature in llama.cpp offer wider gguf coverage via something like the parameter mapping described here, A few other questions:
Thank you for your work! |
Hi @SearchSavior ,
Instead of converting GGUF models to PyTorch format with parameter mapping, this implementation uses OpenVINO's GGML frontend to directly translate GGML computation graphs to OpenVINO operations at runtime. The translation happens through a comprehensive operation mapping system that covers the core GGML operations. Since it works at the GGML operation level, it should support any model architecture that llama.cpp supports (assuming we map/translate all the GGML operators to OpenVINO.)
The immediate focus is on runtime acceleration: kernel fusion, optimized graph execution,memory optimizations and hardware scheduling on CPU, GPU, and NPU.
The scope of this PR is primarily performance enablement using OpenVINO runtime to accelerate llama.cpp inference while preserving compatibility with the GGUF ecosystem. It’s not introducing a new model conversion flow, so everything remains driven by GGUF models in llama.cpp.
We are currently reviewing this. llama.cpp already has infrastructure for pipeline parallelism, and the OpenVINO backend exposes async operations and events, so it should be possible. Further evaluation is needed to confirm integration details. |
Hey @ravi9 , Thanks for the detailed answer. It's nice to see more serious work bringing OpenVINO to the rest of the ecosystem. |
I can't wait for openVINO support to get upstreamed |
e180b86
to
80f0969
Compare
76ab76e
to
2e1dd8d
Compare
e727c65
to
66e503b
Compare
Is there any way to build this on Ubuntu 25.04? OpenVINO doesn't seem to support this version of Ubuntu. I tried anyway, installing the dependencies manually, but the build fails due to a missing header |
.github/workflows/build.yml
Outdated
sudo mkdir -p /opt/intel | ||
wget -O openvino_${OPENVINO_VERSION_MAJOR}.tgz https://storage.openvinotoolkit.org/repositories/openvino/packages/${OPENVINO_VERSION_MAJOR}/linux/openvino_toolkit_ubuntu24_${OPENVINO_VERSION_FULL}_x86_64.tgz | ||
tar -xf openvino_${OPENVINO_VERSION_MAJOR}.tgz | ||
sudo mv openvino_toolkit_ubuntu24_${OPENVINO_VERSION_FULL}_x86_64 /opt/intel/openvino_${OPENVINO_VERSION_MAJOR} | ||
rm openvino_${OPENVINO_VERSION_MAJOR}.tgz | ||
cd /opt/intel/openvino_${OPENVINO_VERSION_MAJOR} | ||
echo "Y" | sudo -E ./install_dependencies/install_openvino_dependencies.sh && cd - | ||
sudo ln -s /opt/intel/openvino_${OPENVINO_VERSION_MAJOR} /opt/intel/openvino | ||
- name: Build | ||
id: cmake_build | ||
run: | | ||
source /opt/intel/openvino/setupvars.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please cache this similarly to vulkan and spacemit SDKs:
llama.cpp/.github/workflows/build.yml
Lines 449 to 466 in 8415f61
- name: Use Vulkan SDK Cache | |
uses: actions/cache@v4 | |
id: cache-sdk | |
with: | |
path: ./vulkan_sdk | |
key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }} | |
- name: Setup Vulkan SDK | |
if: steps.cache-sdk.outputs.cache-hit != 'true' | |
uses: ./.github/actions/linux-setup-vulkan | |
with: | |
path: ./vulkan_sdk | |
version: ${{ env.VULKAN_SDK_VERSION }} | |
- name: Build | |
id: cmake_build | |
run: | | |
source ./vulkan_sdk/setup-env.sh |
llama.cpp/.github/workflows/build-cache.yml
Lines 26 to 38 in 8415f61
- name: Setup Cache | |
uses: actions/cache@v4 | |
id: cache-sdk | |
with: | |
path: ./vulkan_sdk | |
key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }} | |
- name: Setup Vulkan SDK | |
if: steps.cache-sdk.outputs.cache-hit != 'true' | |
uses: ./.github/actions/linux-setup-vulkan | |
with: | |
path: ./vulkan_sdk | |
version: ${{ env.VULKAN_SDK_VERSION }} |
llama.cpp/.github/actions/linux-setup-vulkan/action.yml
Lines 14 to 20 in 8415f61
- name: Setup Vulkan SDK | |
id: setup | |
uses: ./.github/actions/unarchive-tar | |
with: | |
url: https://sdk.lunarg.com/sdk/download/${{ inputs.version }}/linux/vulkan_sdk.tar.xz | |
path: ${{ inputs.path }} | |
strip: 1 |
(add
type: z
for gzip)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slaren We have a fix to support Ubuntu25.04, will update soon. |
@slaren : Could you try again. Fixed CMakeLists.txt to resolve TBB issue.
|
…e model * Add OpenVINO ADD operator to Llama.cpp. The output is somewhat abnormal and needs further debugging.
…ize, iSWA model not working
ade4a2d
to
f89292d
Compare
Thanks. I was able to build it now, but I get different exceptions when trying to run it.
|
@slaren Thanks for testing.
|
Overview
This PR introduces an OpenVINO backend for
llama.cpp
, enabling hardware-accelerated inference on Intel® CPUs, GPUs, and NPUs. The backend leverages OpenVINO to deliver optimized inference with the existing llama.cpp GGUF model ecosystem. Enables performance improvements via OpenVINO’s graph compilation and kernel fusion.Key Features:
New backend implementation
ggml/src/ggml-openvino
.Supported precisions
Supported devices
Tested Models
The following models are validated for functionality.
Accuracy and performance are WIP.
Llama-3.2-1B-Instruct-GGUF
Llama-3.1-8B-Instruct
microsoft/Phi-3-mini-4k-instruct-gguf
Qwen/Qwen2.5-1.5B-Instruct-GGUF
Qwen/Qwen3-8B
openbmb/MiniCPM-1B-sft-bf16
tencent/Hunyuan-7B-Instruct
mistralai/Mistral-7B-Instruct-v0.3
Note:
llama-cli
andllama-server
needs to run with--no-warmup
for now.Work in Progress
Notes on quantization support
NOTE: Optimum-intel converts the fp16/bf16 token embedding tensor and the weight tensor in the last matmul to int8 asym channel-wise (config code).
CPU
GPU
NPU