-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Add OpenVINO backend #15307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add OpenVINO backend #15307
Conversation
|
Hello, in this repo https://github.com/yangsu2022/GGUF-to-OpenVINO and the article https://blog.openvino.ai/blog-posts/openvino-genai-supports-gguf-models only a small set of models are supported. Will this feature in llama.cpp offer wider gguf coverage via something like the parameter mapping described here, A few other questions:
Thank you for your work! |
|
Hi @SearchSavior ,
Instead of converting GGUF models to PyTorch format with parameter mapping, this implementation uses OpenVINO's GGML frontend to directly translate GGML computation graphs to OpenVINO operations at runtime. The translation happens through a comprehensive operation mapping system that covers the core GGML operations. Since it works at the GGML operation level, it should support any model architecture that llama.cpp supports (assuming we map/translate all the GGML operators to OpenVINO.)
The immediate focus is on runtime acceleration: kernel fusion, optimized graph execution,memory optimizations and hardware scheduling on CPU, GPU, and NPU.
The scope of this PR is primarily performance enablement using OpenVINO runtime to accelerate llama.cpp inference while preserving compatibility with the GGUF ecosystem. It’s not introducing a new model conversion flow, so everything remains driven by GGUF models in llama.cpp.
We are currently reviewing this. llama.cpp already has infrastructure for pipeline parallelism, and the OpenVINO backend exposes async operations and events, so it should be possible. Further evaluation is needed to confirm integration details. |
|
Hey @ravi9 , Thanks for the detailed answer. It's nice to see more serious work bringing OpenVINO to the rest of the ecosystem. |
|
I can't wait for openVINO support to get upstreamed |
e180b86 to
80f0969
Compare
76ab76e to
2e1dd8d
Compare
e727c65 to
66e503b
Compare
|
Is there any way to build this on Ubuntu 25.04? OpenVINO doesn't seem to support this version of Ubuntu. I tried anyway, installing the dependencies manually, but the build fails due to a missing header |
.github/workflows/build.yml
Outdated
| sudo mkdir -p /opt/intel | ||
| wget -O openvino_${OPENVINO_VERSION_MAJOR}.tgz https://storage.openvinotoolkit.org/repositories/openvino/packages/${OPENVINO_VERSION_MAJOR}/linux/openvino_toolkit_ubuntu24_${OPENVINO_VERSION_FULL}_x86_64.tgz | ||
| tar -xf openvino_${OPENVINO_VERSION_MAJOR}.tgz | ||
| sudo mv openvino_toolkit_ubuntu24_${OPENVINO_VERSION_FULL}_x86_64 /opt/intel/openvino_${OPENVINO_VERSION_MAJOR} | ||
| rm openvino_${OPENVINO_VERSION_MAJOR}.tgz | ||
| cd /opt/intel/openvino_${OPENVINO_VERSION_MAJOR} | ||
| echo "Y" | sudo -E ./install_dependencies/install_openvino_dependencies.sh && cd - | ||
| sudo ln -s /opt/intel/openvino_${OPENVINO_VERSION_MAJOR} /opt/intel/openvino | ||
| - name: Build | ||
| id: cmake_build | ||
| run: | | ||
| source /opt/intel/openvino/setupvars.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please cache this similarly to vulkan and spacemit SDKs:
llama.cpp/.github/workflows/build.yml
Lines 449 to 466 in 8415f61
| - name: Use Vulkan SDK Cache | |
| uses: actions/cache@v4 | |
| id: cache-sdk | |
| with: | |
| path: ./vulkan_sdk | |
| key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }} | |
| - name: Setup Vulkan SDK | |
| if: steps.cache-sdk.outputs.cache-hit != 'true' | |
| uses: ./.github/actions/linux-setup-vulkan | |
| with: | |
| path: ./vulkan_sdk | |
| version: ${{ env.VULKAN_SDK_VERSION }} | |
| - name: Build | |
| id: cmake_build | |
| run: | | |
| source ./vulkan_sdk/setup-env.sh |
llama.cpp/.github/workflows/build-cache.yml
Lines 26 to 38 in 8415f61
| - name: Setup Cache | |
| uses: actions/cache@v4 | |
| id: cache-sdk | |
| with: | |
| path: ./vulkan_sdk | |
| key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }} | |
| - name: Setup Vulkan SDK | |
| if: steps.cache-sdk.outputs.cache-hit != 'true' | |
| uses: ./.github/actions/linux-setup-vulkan | |
| with: | |
| path: ./vulkan_sdk | |
| version: ${{ env.VULKAN_SDK_VERSION }} |
llama.cpp/.github/actions/linux-setup-vulkan/action.yml
Lines 14 to 20 in 8415f61
| - name: Setup Vulkan SDK | |
| id: setup | |
| uses: ./.github/actions/unarchive-tar | |
| with: | |
| url: https://sdk.lunarg.com/sdk/download/${{ inputs.version }}/linux/vulkan_sdk.tar.xz | |
| path: ${{ inputs.path }} | |
| strip: 1 |
(add
type: z for gzip)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slaren We have a fix to support Ubuntu25.04, will update soon. |
|
@slaren : Could you try again. Fixed CMakeLists.txt to resolve TBB issue. |
ade4a2d to
f89292d
Compare
Thanks. I was able to build it now, but I get different exceptions when trying to run it. |
|
@slaren Thanks for testing.
|
…ize, iSWA model not working
* Stateless. Fix llama-cli llama-server * Simplify broadcast op in attention * Replace get_output_tensor+memcpy with set_output_tensor * NPU unify PD. Unify dynamic and static dims
956dbf7 to
d5038aa
Compare
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Hitting error while compiling on windows: error C3861: 'unsetenv': identifier not found Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it. Proposed fix: Use _putenv_s() (Windows equivalent) This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment. This keeps cross-platform compatibility.
Fix - unsetenv()
Overview
This PR introduces an OpenVINO backend for
llama.cpp, enabling hardware-accelerated inference on Intel® CPUs, GPUs, and NPUs. The backend leverages OpenVINO to deliver optimized inference with the existing llama.cpp GGUF model ecosystem. Enables performance improvements via OpenVINO’s graph compilation and kernel fusion.Key Features:
New backend implementation
ggml/src/ggml-openvino.Supported precisions
Supported devices
For NPU: currently prompt processing is slow, a smaller context size is recommended for better performance, e.g.,
-c 512.For llama-bench:
-fa 1is required.Tested Models
The following models are validated for functionality.
Accuracy and performance are WIP.
Llama-3.2-1B-Instruct-GGUFLlama-3.1-8B-Instructmicrosoft/Phi-3-mini-4k-instruct-ggufQwen/Qwen2.5-1.5B-Instruct-GGUFQwen/Qwen3-8Bopenbmb/MiniCPM-1B-sft-bf16tencent/Hunyuan-7B-Instructmistralai/Mistral-7B-Instruct-v0.3Work in Progress
Notes on quantization support
CPU
GPU
NPU
Other notes:
NOTE: Optimum-intel converts the fp16/bf16 token embedding tensor and the weight tensor in the last matmul to int8 asym channel-wise (config code).