Add OpenVINO backend #15307

wine99 · 2025-08-14T09:09:11Z

Overview

This PR introduces an OpenVINO backend for llama.cpp, enabling hardware-accelerated inference on Intel® CPUs, GPUs, and NPUs. The backend leverages OpenVINO to deliver optimized inference with the existing llama.cpp GGUF model ecosystem. Enables performance improvements via OpenVINO’s graph compilation and kernel fusion.

llama.cpp with OpenVINO backend: Build Instructions

Key Features:

New backend implementation
- Added OpenVINO backend in ggml/src/ggml-openvino.
- Implemented translations for core GGML operations
Supported precisions
- FP16/BF16 GGUF models supported.
- Q4_0, Q4_1, Q4_K_M, Q6_K models partially supported. (See notes below)
Supported devices
- Intel CPUs
- Intel integrated and discrete GPUs
- Intel NPUs (requires UD32+ driver).

For NPU: currently prompt processing is slow, a smaller context size is recommended for better performance, e.g., -c 512.

For llama-bench: -fa 1 is required.

Tested Models

The following models are validated for functionality.

Accuracy and performance are WIP.

Work in Progress

Performance and memory optimizations
Broader quantization coverage.
Support for additional model architectures.
Extensive accuracy testing.

Notes on quantization support

CPU

Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.
Q6_K tensors (6bit gs16 sym) are converted to int8 gs16 sym.
Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.

GPU

Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.
Q6_K tensors (6bit gs16 sym) are requantized to int8 gs32 sym.
Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.

NPU

Main quantization scheme for the supported models in this PR is Q4_0.
Q4_0 and Q4_1 tensors are requantized to int4 gs128 sym.
Q6_K tensors are dequantized to fp16.

Other notes:

Both Q4_0 and Q4_1 models use Q6_K for the token_embedding tensor and the weight tensor in the last matmul (in most models it is the same tensor as token_emb).
Q4_0 models will produce some Q4_1 tensors if imatrix is provided as part of the quantization of the model using llama-quantize utility.
Q4_K_M models additionally have Q6_K tensors and Q5_K tensors (only in Phi3 in the validated model list of this PR).

NOTE: Optimum-intel converts the fp16/bf16 token embedding tensor and the weight tensor in the last matmul to int8 asym channel-wise (config code).

SearchSavior · 2025-08-19T01:33:55Z

Hello,

in this repo https://github.com/yangsu2022/GGUF-to-OpenVINO and the article https://blog.openvino.ai/blog-posts/openvino-genai-supports-gguf-models only a small set of models are supported.

Will this feature in llama.cpp offer wider gguf coverage via something like the parameter mapping described here,

https://github.com/yangsu2022/GGUF-to-OpenVINO/blob/405a95e300f8307fb4b779a12d46cf86adf5a441/convert_llama3.1_gguf_to_torch.py#L14

A few other questions:

What parts of OpenVINO feature set are intended to be brought into llama.cpp?
Is this PR trying to bring in only performance from openvino runtime to support llama.cpp usecase?
Pipeline parallel is coming in next release (I think), will that be implemented here for heterogeneous execution in llama.cpp?

Thank you for your work!

ravi9 · 2025-08-21T06:27:50Z

Hi @SearchSavior ,

Q: Will this feature in llama.cpp offer wider GGUF coverage via something like parameter mapping?

Instead of converting GGUF models to PyTorch format with parameter mapping, this implementation uses OpenVINO's GGML frontend to directly translate GGML computation graphs to OpenVINO operations at runtime. The translation happens through a comprehensive operation mapping system that covers the core GGML operations. Since it works at the GGML operation level, it should support any model architecture that llama.cpp supports (assuming we map/translate all the GGML operators to OpenVINO.)

Q: What parts of the OpenVINO feature set are intended to be brought into llama.cpp?

The immediate focus is on runtime acceleration: kernel fusion, optimized graph execution,memory optimizations and hardware scheduling on CPU, GPU, and NPU.

Q: Is this PR trying to bring in only performance from openvino runtime to support llama.cpp usecase?

The scope of this PR is primarily performance enablement using OpenVINO runtime to accelerate llama.cpp inference while preserving compatibility with the GGUF ecosystem. It’s not introducing a new model conversion flow, so everything remains driven by GGUF models in llama.cpp.

Q: Will pipeline parallel / heterogeneous execution be supported here?

We are currently reviewing this. llama.cpp already has infrastructure for pipeline parallelism, and the OpenVINO backend exposes async operations and events, so it should be possible. Further evaluation is needed to confirm integration details.

SearchSavior · 2025-08-21T11:34:19Z

Hey @ravi9 ,

Thanks for the detailed answer. It's nice to see more serious work bringing OpenVINO to the rest of the ecosystem.

Bionic-Squash · 2025-08-24T15:31:13Z

I can't wait for openVINO support to get upstreamed

slaren · 2025-10-11T14:57:05Z

Is there any way to build this on Ubuntu 25.04? OpenVINO doesn't seem to support this version of Ubuntu. I tried anyway, installing the dependencies manually, but the build fails due to a missing header tbb/blocked_range.h. The file exists, but the include directory does not seem to be setup correctly.

CISC · 2025-10-14T07:42:04Z

.github/workflows/build.yml

+            sudo mkdir -p /opt/intel
+            wget -O openvino_${OPENVINO_VERSION_MAJOR}.tgz https://storage.openvinotoolkit.org/repositories/openvino/packages/${OPENVINO_VERSION_MAJOR}/linux/openvino_toolkit_ubuntu24_${OPENVINO_VERSION_FULL}_x86_64.tgz
+            tar -xf openvino_${OPENVINO_VERSION_MAJOR}.tgz
+            sudo mv openvino_toolkit_ubuntu24_${OPENVINO_VERSION_FULL}_x86_64 /opt/intel/openvino_${OPENVINO_VERSION_MAJOR}
+            rm openvino_${OPENVINO_VERSION_MAJOR}.tgz
+            cd /opt/intel/openvino_${OPENVINO_VERSION_MAJOR}
+            echo "Y" | sudo -E ./install_dependencies/install_openvino_dependencies.sh && cd -
+            sudo ln -s /opt/intel/openvino_${OPENVINO_VERSION_MAJOR} /opt/intel/openvino
+
+        - name: Build
+          id: cmake_build
+          run: |
+            source /opt/intel/openvino/setupvars.sh


Please cache this similarly to vulkan and spacemit SDKs:

llama.cpp/.github/workflows/build.yml

Lines 449 to 466 in 8415f61

- name: Use Vulkan SDK Cache

uses: actions/cache@v4

id: cache-sdk

with:

path: ./vulkan_sdk

key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}

- name: Setup Vulkan SDK

if: steps.cache-sdk.outputs.cache-hit != 'true'

uses: ./.github/actions/linux-setup-vulkan

with:

path: ./vulkan_sdk

version: ${{ env.VULKAN_SDK_VERSION }}

- name: Build

id: cmake_build

run: |

source ./vulkan_sdk/setup-env.sh

llama.cpp/.github/workflows/build-cache.yml

Lines 26 to 38 in 8415f61

- name: Setup Cache

uses: actions/cache@v4

id: cache-sdk

with:

path: ./vulkan_sdk

key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}

- name: Setup Vulkan SDK

if: steps.cache-sdk.outputs.cache-hit != 'true'

uses: ./.github/actions/linux-setup-vulkan

with:

path: ./vulkan_sdk

version: ${{ env.VULKAN_SDK_VERSION }}

llama.cpp/.github/actions/linux-setup-vulkan/action.yml

Lines 14 to 20 in 8415f61

- name: Setup Vulkan SDK

id: setup

uses: ./.github/actions/unarchive-tar

with:

url: https://sdk.lunarg.com/sdk/download/${{ inputs.version }}/linux/vulkan_sdk.tar.xz

path: ${{ inputs.path }}

strip: 1

(add type: z for gzip)

@CISC Thanks for the suggestion. I have made the changes. @ravi9 Please also review this

ggerganov · 2025-10-14T07:50:58Z

@wine99 Could you address the comment by @slaren earlier?

ravi9 · 2025-10-14T14:44:27Z

Is there any way to build this on Ubuntu 25.04? OpenVINO doesn't seem to support this version of Ubuntu. I tried anyway, installing the dependencies manually, but the build fails due to a missing header tbb/blocked_range.h. The file exists, but the include directory does not seem to be setup correctly.

@slaren We have a fix to support Ubuntu25.04, will update soon.

ravi9 · 2025-10-15T00:05:16Z

@slaren : Could you try again. Fixed CMakeLists.txt to resolve TBB issue.
Also, created a patch to install OV-2025.3 on Ubuntu25.04.

# Script to Install OpenVINO from archive
wget https://raw.githubusercontent.com/ravi9/misc-scripts/main/openvino/ov-archive-install/install-openvino-from-archive.sh
chmod +x install-openvino-from-archive.sh
./install-openvino-from-archive.sh

.github/actions/linux-setup-openvino/action.yml

.github/workflows/build-cache.yml

.github/workflows/build.yml

.github/workflows/release.yml

slaren · 2025-10-15T20:04:14Z

@slaren : Could you try again. Fixed CMakeLists.txt to resolve TBB issue.
Also, created a patch to install OV-2025.3 on Ubuntu25.04.

Thanks. I was able to build it now, but I get different exceptions when trying to run it.

$ llama-bench -m models/Llama-3.2-1B-Instruct.Q4_K_M.gguf
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56     ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56      in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x00007fa0509c1b63 in __internal_syscall_cancel (a1=2057930, a2=0, a3=0, a4=0, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
warning: 49     ./nptl/cancellation.c: No such file or directory
#2  __syscall_cancel (a1=a1@entry=2057930, a2=a2@entry=0, a3=a3@entry=0, a4=a4@entry=0, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75      in ./nptl/cancellation.c
#3  0x00007fa050a3de9f in __GI___wait4 (pid=pid@entry=2057930, stat_loc=stat_loc@entry=0x0, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4  0x00007fa050a3deeb in __GI___waitpid (pid=pid@entry=2057930, stat_loc=stat_loc@entry=0x0, options=options@entry=0) at ./posix/waitpid.c:38
warning: 38     ./posix/waitpid.c: No such file or directory
#5  0x00007fa050f1cf23 in ggml_print_backtrace () at /home/diego/code/llama.cpp/ggml/src/ggml.c:196
196             waitpid(child_pid, NULL, 0);
#6  0x00007fa050f2b3af in ggml_uncaught_exception () at /home/diego/code/llama.cpp/ggml/src/ggml.cpp:9
9           ggml_print_backtrace();
#7  0x00007fa050d290aa in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007fa050d12a9e in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007fa050d29361 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007fa04f9d7d38 in ov::frontend::NotImplementedFailure::create(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/intel/openvino_2025.3/runtime/lib/intel64/libopenvino.so.2530
#11 0x00007fa050797e99 in ov::frontend::ggml::op::translate_permute (context=...) at /usr/include/c++/14/bits/basic_string.tcc:242
242               ~_Guard() { if (_M_guarded) _M_guarded->_M_dispose(); }
#12 0x00007fa0507afa52 in std::__invoke_impl<std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > >, std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > > (*&)(ov::frontend::ggml::NodeContext const&), ov::frontend::ggml::NodeContext const&> (__f=<optimized out>) at /usr/include/c++/14/bits/invoke.h:60
60          __invoke_impl(__invoke_other, _Fn&& __f, _Args&&... __args)
#13 std::__invoke_r<std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > >, std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > > (*&)(ov::frontend::ggml::NodeContext const&), ov::frontend::ggml::NodeContext const&> (__fn=<optimized out>) at /usr/include/c++/14/bits/invoke.h:116
116                                               std::forward<_Args>(__args)...);
#14 std::_Function_handler<std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > >(ov::frontend::ggml::NodeContext const&), std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > > (*)(ov::frontend::ggml::NodeContext const&)>::_M_invoke (__functor=..., __args#0=...) at /usr/include/c++/14/bits/std_function.h:291
291                                          std::forward<_ArgTypes>(__args)...);
#15 0x00007fa0507c3bbb in std::function<std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > >(ov::frontend::ggml::NodeContext const&)>::operator() (this=0x560333e2cb18, __args#0=...) at /usr/include/c++/14/bits/std_function.h:591
591             return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...);
#16 operator() (__closure=0x7ffc0bc79240, node=std::shared_ptr<ov::frontend::ggml::GgmlDecoder> (use count 3, weak count 0) = {...}) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/openvino/translate_session.cpp:207
207             converted_outputs = it->second(node_context);
#17 0x00007fa0507c43ed in std::__invoke_impl<void, ov::frontend::ggml::TranslateSession::translate_graph(const ov::frontend::InputModel::Ptr&)::<lambda(std::shared_ptr<ov::frontend::ggml::GgmlDecoder>)>&, std::shared_ptr<ov::frontend::ggml::GgmlDecoder> > (__f=...) at /usr/include/c++/14/bits/shared_ptr_base.h:1095
1095          _M_swap(__shared_count& __r) noexcept
#18 std::__invoke_r<void, ov::frontend::ggml::TranslateSession::translate_graph(const ov::frontend::InputModel::Ptr&)::<lambda(std::shared_ptr<ov::frontend::ggml::GgmlDecoder>)>&, std::shared_ptr<ov::frontend::ggml::GgmlDecoder> > (__fn=...) at /usr/include/c++/14/bits/invoke.h:111
111             std::__invoke_impl<__type>(__tag{}, std::forward<_Callable>(__fn),
#19 std::_Function_handler<void(std::shared_ptr<ov::frontend::ggml::GgmlDecoder>), ov::frontend::ggml::TranslateSession::translate_graph(const ov::frontend::InputModel::Ptr&)::<lambda(std::shared_ptr<ov::frontend::ggml::GgmlDecoder>)> >::_M_invoke(const std::_Any_data &, std::shared_ptr<ov::frontend::ggml::GgmlDecoder> &&) (__functor=..., __args#0=...) at /usr/include/c++/14/bits/std_function.h:290
290             return std::__invoke_r<_Res>(*_Base::_M_get_pointer(__functor),
#20 0x00007fa05076935c in std::function<void(std::shared_ptr<ov::frontend::ggml::GgmlDecoder>)>::operator() (this=0x7ffc0bc79240, __args#0=std::shared_ptr<ov::frontend::ggml::GgmlDecoder> (empty) = {...}) at /usr/include/c++/14/bits/std_function.h:591
591             return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...);
#21 GgmlOvDecoder::visit_subgraph (this=0x5603373a9440, node_visitor=...) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/ggml-decoder.cpp:761
761             node_visitor(decoder);
#22 0x00007fa0507c7143 in ov::frontend::ggml::TranslateSession::translate_graph (this=this@entry=0x7ffc0bc79450, input_model=std::shared_ptr<ov::frontend::InputModel> (use count 5, weak count 0) = {...}) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/openvino/translate_session.cpp:230
230         ggml_model_decoder->visit_subgraph(node_visitor);
#23 0x00007fa0507c8d46 in ov::frontend::ggml::TranslateSession::get_converted_model (this=this@entry=0x7ffc0bc79450) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/openvino/translate_session.cpp:167
167         m_ov_model = translate_graph(m_input_model);
#24 0x00007fa05078127b in ov::frontend::ggml::FrontEnd::convert (model=std::shared_ptr<ov::frontend::InputModel> (use count 5, weak count 0) = {...}, naive=naive@entry=false) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/openvino/frontend.cpp:20
20              converted_model = translate_session.get_converted_model();
#25 0x00007fa0507db371 in openvino_frontend_compute (backend=<optimized out>, cgraph=<optimized out>) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/utils.cpp:173
173                     model = ov::frontend::ggml::FrontEnd::convert(input_model);
#26 0x00007fa05076d75d in ggml_backend_openvino_graph_compute (backend=<optimized out>, cgraph=<optimized out>) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/ggml-openvino.cpp:54
54          openvino_frontend_compute(backend, cgraph);
#27 0x00007fa050f32e60 in ggml_backend_sched_compute_splits (sched=0x560333e04ec0) at /home/diego/code/llama.cpp/ggml/src/ggml-backend.cpp:1553
1553                enum ggml_status ec = ggml_backend_graph_compute_async(split_backend, &split->graph);
#28 ggml_backend_sched_graph_compute_async (sched=0x560333e04ec0, graph=<optimized out>) at /home/diego/code/llama.cpp/ggml/src/ggml-backend.cpp:1753
1753        return ggml_backend_sched_compute_splits(sched);
#29 0x00007fa05102bf31 in llama_context::graph_compute (this=this@entry=0x5603373ab970, gf=0x5603342642f0, batched=<optimized out>) at /usr/include/c++/14/bits/unique_ptr.h:193
193           pointer    _M_ptr() const noexcept { return std::get<0>(_M_t); }
#30 0x00007fa05102cf8a in llama_context::process_ubatch (this=this@entry=0x5603373ab970, ubatch=..., gtype=gtype@entry=LLM_GRAPH_TYPE_DECODER, mctx=mctx@entry=0x560333e03000, ret=@0x7ffc0bc7db58: 1360825900) at /home/diego/code/llama.cpp/src/llama-context.cpp:784
784         const auto status = graph_compute(res->get_gf(), ubatch.n_tokens > 1);
#31 0x00007fa0510303bf in llama_context::decode (this=0x5603373ab970, batch_inp=...) at /home/diego/code/llama.cpp/src/llama-context.cpp:1088
1088            const auto * res = process_ubatch(ubatch, LLM_GRAPH_TYPE_DECODER, mctx.get(), status);
#32 0x00007fa05103126f in llama_decode (ctx=<optimized out>, batch=...) at /home/diego/code/llama.cpp/src/llama-context.cpp:2747
2747        const int ret = ctx->decode(batch);
#33 0x0000560326cc51c1 in test_prompt (ctx=ctx@entry=0x5603373ab970, n_prompt=512, n_batch=2048, n_threads=<optimized out>) at /home/diego/code/llama.cpp/tools/llama-bench/llama-bench.cpp:1939
1939            int res = llama_decode(ctx, llama_batch_get_one(tokens.data(), n_tokens));
#34 0x0000560326cc0131 in main (argc=<optimized out>, argv=<optimized out>) at /home/diego/code/llama.cpp/tools/llama-bench/llama-bench.cpp:2115
2115                    bool res = test_prompt(ctx, t.n_prompt, t.n_batch, t.n_threads);
[Inferior 1 (process 2057893) detached]
terminate called after throwing an instance of 'ov::frontend::NotImplementedFailure'
  what():  Check '(op_case == 1 || op_case == 2 || op_case == 3)' failed at openvino/op/permute.cpp:25:
FrontEnd API failed with NotImplementedFailure:
"Unsupported PERMUTE case" is not implemented for this FrontEnd class


$ llama-cli -m models/Llama-3.2-1B-Instruct.Q4_K_M.gguf
[...]
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56     ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56      in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x00007ff5bae77b63 in __internal_syscall_cancel (a1=2058374, a2=0, a3=0, a4=0, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
warning: 49     ./nptl/cancellation.c: No such file or directory
#2  __syscall_cancel (a1=a1@entry=2058374, a2=a2@entry=0, a3=a3@entry=0, a4=a4@entry=0, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75      in ./nptl/cancellation.c
#3  0x00007ff5baef3e9f in __GI___wait4 (pid=pid@entry=2058374, stat_loc=stat_loc@entry=0x0, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4  0x00007ff5baef3eeb in __GI___waitpid (pid=pid@entry=2058374, stat_loc=stat_loc@entry=0x0, options=options@entry=0) at ./posix/waitpid.c:38
warning: 38     ./posix/waitpid.c: No such file or directory
#5  0x00007ff5bb4cff23 in ggml_print_backtrace () at /home/diego/code/llama.cpp/ggml/src/ggml.c:196
196             waitpid(child_pid, NULL, 0);
#6  0x00007ff5bb4de3af in ggml_uncaught_exception () at /home/diego/code/llama.cpp/ggml/src/ggml.cpp:9
9           ggml_print_backtrace();
#7  0x00007ff5bb1df0aa in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007ff5bb1c8a9e in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007ff5bb1df361 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007ff5b936f710 in ov::Exception::create(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/intel/openvino_2025.3/runtime/lib/intel64/libopenvino.so.2530
#11 0x00007ff5b93dee79 in ?? () from /opt/intel/openvino_2025.3/runtime/lib/intel64/libopenvino.so.2530
#12 0x00007ff5bac8fde7 in openvino_frontend_compute (backend=<optimized out>, cgraph=<optimized out>) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/utils.cpp:223
223         infer_request.infer();
#13 0x00007ff5bac2375d in ggml_backend_openvino_graph_compute (backend=<optimized out>, cgraph=<optimized out>) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/ggml-openvino.cpp:54
54          openvino_frontend_compute(backend, cgraph);
#14 0x00007ff5bb4e5e60 in ggml_backend_sched_compute_splits (sched=0x56052ad349e0) at /home/diego/code/llama.cpp/ggml/src/ggml-backend.cpp:1553
1553                enum ggml_status ec = ggml_backend_graph_compute_async(split_backend, &split->graph);
#15 ggml_backend_sched_graph_compute_async (sched=0x56052ad349e0, graph=<optimized out>) at /home/diego/code/llama.cpp/ggml/src/ggml-backend.cpp:1753
1753        return ggml_backend_sched_compute_splits(sched);
#16 0x00007ff5bb5def31 in llama_context::graph_compute (this=this@entry=0x56052aef8b60, gf=0x56052b1fabb0, batched=<optimized out>) at /usr/include/c++/14/bits/unique_ptr.h:193
193           pointer    _M_ptr() const noexcept { return std::get<0>(_M_t); }
#17 0x00007ff5bb5dff8a in llama_context::process_ubatch (this=this@entry=0x56052aef8b60, ubatch=..., gtype=gtype@entry=LLM_GRAPH_TYPE_DECODER, mctx=mctx@entry=0x560532946fc0, ret=@0x7ffdfd4892e8: -45570912) at /home/diego/code/llama.cpp/src/llama-context.cpp:784
784         const auto status = graph_compute(res->get_gf(), ubatch.n_tokens > 1);
#18 0x00007ff5bb5e33bf in llama_context::decode (this=0x56052aef8b60, batch_inp=...) at /home/diego/code/llama.cpp/src/llama-context.cpp:1088
1088            const auto * res = process_ubatch(ubatch, LLM_GRAPH_TYPE_DECODER, mctx.get(), status);
#19 0x00007ff5bb5e426f in llama_decode (ctx=<optimized out>, batch=...) at /home/diego/code/llama.cpp/src/llama-context.cpp:2747
2747        const int ret = ctx->decode(batch);
#20 0x00005604f48c2e16 in main (argc=<optimized out>, argv=<optimized out>) at /home/diego/code/llama.cpp/tools/main/main.cpp:671
671                     if (llama_decode(ctx, llama_batch_get_one(&embd[i], n_eval))) {
[Inferior 1 (process 2058328) detached]
terminate called after throwing an instance of 'ov::Exception'
  what():  Exception from src/inference/src/cpp/infer_request.cpp:223:
Exception from src/plugins/intel_cpu/src/node.cpp:792:
[CPU] Add node with name 'Add_19058' Check 'input_shape[j] == 1' failed at src/plugins/intel_cpu/src/shape_inference/custom/eltwise.cpp:52:
Eltwise shape infer input shapes dim index: 3 mismatch

.github/actions/linux-setup-openvino/action.yml

ravi9 · 2025-10-15T20:40:43Z

@slaren Thanks for testing.

We are working on fixing llama-bench.
For llama-cli and llama-server, please run with --no-warmup for now. The input shapes for the warmup need to be fixed. We are working on a solution for it. llama-cli -m models/Llama-3.2-1B-Instruct.Q4_K_M.gguf --no-warmup
llama-simple should work fine.

…ize, iSWA model not working

* Stateless. Fix llama-cli llama-server * Simplify broadcast op in attention * Replace get_output_tensor+memcpy with set_output_tensor * NPU unify PD. Unify dynamic and static dims

.github/workflows/docker.yml

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Hitting error while compiling on windows: error C3861: 'unsetenv': identifier not found Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it. Proposed fix: Use _putenv_s() (Windows equivalent) This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment. This keeps cross-platform compatibility.

Fix - unsetenv()

…l decoder

wine99 requested review from ggerganov and ngxson as code owners August 14, 2025 09:09

wine99 marked this pull request as draft August 14, 2025 09:09

github-actions bot added documentation Improvements or additions to documentation testing Everything test related devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Aug 14, 2025

wine99 force-pushed the dev_backend_openvino branch 2 times, most recently from e180b86 to 80f0969 Compare September 5, 2025 08:36

wine99 force-pushed the dev_backend_openvino branch 2 times, most recently from 76ab76e to 2e1dd8d Compare September 28, 2025 03:25

wine99 force-pushed the dev_backend_openvino branch from e727c65 to 66e503b Compare October 11, 2025 05:45

wine99 marked this pull request as ready for review October 14, 2025 00:03

wine99 requested review from CISC and slaren as code owners October 14, 2025 00:03

CISC reviewed Oct 14, 2025

View reviewed changes

wine99 force-pushed the dev_backend_openvino branch 2 times, most recently from ade4a2d to f89292d Compare October 15, 2025 08:06

CISC reviewed Oct 15, 2025

View reviewed changes

.github/actions/linux-setup-openvino/action.yml Outdated Show resolved Hide resolved

wine99 and others added 13 commits November 4, 2025 16:50

Fix llama-cli (need to run with --no-warmup)

de961a0

Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_s…

fa18b7b

…ize, iSWA model not working

fix after rebasing

4c1f60f

Fix llama-3-8b and phi3-mini q4_0 NPU

8cc6cd0

Update to OV-2025.3 and CMakeLists.txt

8af46c4

Add OV CI cache

509c5f4

Apply CISC review and update CI to OV2025.3

cfd40a9

Update CI to run OV dep install before build

4c280cc

Update OV dockerfile to use OV2025.3 and update build docs

3feac74

Style: use switch in supports_ops

7ac02a8

Style: middle ptr and ref align, omit optional struct keyword

7c8a4a5

NPU Unify PD (#14)

0f97715

* Stateless. Fix llama-cli llama-server * Simplify broadcast op in attention * Replace get_output_tensor+memcpy with set_output_tensor * NPU unify PD. Unify dynamic and static dims

Clean placeholders in ggml-openvino.cpp

d5038aa

wine99 force-pushed the dev_backend_openvino branch from 956dbf7 to d5038aa Compare November 4, 2025 08:51

CISC reviewed Nov 4, 2025

View reviewed changes

.github/workflows/docker.yml Outdated Show resolved Hide resolved

Update .github/workflows/docker.yml

e866ed0

Co-authored-by: Sigbjørn Skjæret <[email protected]>

DajanaV mentioned this pull request Nov 4, 2025

UPSTREAM PR #15307: Add OpenVINO backend auroralabs-loci/llama.cpp#74

Open

wine99 and others added 13 commits November 6, 2025 10:54

NPU unify PD (handled internally)

75c720a

Update ggml-decoder.cpp

02eb109

Update ggml-decoder.cpp

546cabd

Update ggml-decoder.cpp

6b2153d

Update ggml-decoder.cpp

51167ab

Update ggml-decoder.cpp

b8d0e2a

change graph to 4d, support multi sequences

5070d2d

Fix llama-bench

6be0146

Fix NPU

bbecac0

Merge pull request #17 from I-N-T-E-L/fix---unsetenv()

1c05c32

Fix - unsetenv()

Remove the second decoder for node. Moving the function into the mode…

5d433c8

…l decoder

Fix error for naive

33a5b45

	- name: Use Vulkan SDK Cache
	uses: actions/cache@v4
	id: cache-sdk
	with:
	path: ./vulkan_sdk
	key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}

	- name: Setup Vulkan SDK
	if: steps.cache-sdk.outputs.cache-hit != 'true'
	uses: ./.github/actions/linux-setup-vulkan
	with:
	path: ./vulkan_sdk
	version: ${{ env.VULKAN_SDK_VERSION }}

	- name: Build
	id: cmake_build
	run: \|
	source ./vulkan_sdk/setup-env.sh

	- name: Setup Cache
	uses: actions/cache@v4
	id: cache-sdk
	with:
	path: ./vulkan_sdk
	key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}

	- name: Setup Vulkan SDK
	if: steps.cache-sdk.outputs.cache-hit != 'true'
	uses: ./.github/actions/linux-setup-vulkan
	with:
	path: ./vulkan_sdk
	version: ${{ env.VULKAN_SDK_VERSION }}

	- name: Setup Vulkan SDK
	id: setup
	uses: ./.github/actions/unarchive-tar
	with:
	url: https://sdk.lunarg.com/sdk/download/${{ inputs.version }}/linux/vulkan_sdk.tar.xz
	path: ${{ inputs.path }}
	strip: 1

Add OpenVINO backend #15307

Are you sure you want to change the base?

Add OpenVINO backend #15307

Conversation

wine99 commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Features:

Tested Models

Work in Progress

Notes on quantization support

CPU

GPU

NPU

Uh oh!

SearchSavior commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ravi9 commented Aug 21, 2025

Uh oh!

SearchSavior commented Aug 21, 2025

Uh oh!

Bionic-Squash commented Aug 24, 2025

Uh oh!

slaren commented Oct 11, 2025

Uh oh!

CISC Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

wine99 Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Oct 14, 2025

Uh oh!

ravi9 commented Oct 14, 2025

Uh oh!

ravi9 commented Oct 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slaren commented Oct 15, 2025

Uh oh!

Uh oh!

ravi9 commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

wine99 commented Aug 14, 2025 •

edited

Loading

SearchSavior commented Aug 19, 2025 •

edited

Loading