Skip to content

Conversation

wine99
Copy link

@wine99 wine99 commented Aug 14, 2025

Overview

This PR introduces an OpenVINO backend for llama.cpp, enabling hardware-accelerated inference on Intel® CPUs, GPUs, and NPUs. The backend leverages OpenVINO to deliver optimized inference with the existing llama.cpp GGUF model ecosystem. Enables performance improvements via OpenVINO’s graph compilation and kernel fusion.

Key Features:

  • New backend implementation

    • Added OpenVINO backend in ggml/src/ggml-openvino.
    • Implemented translations for core GGML operations
  • Supported precisions

    • FP16/BF16 GGUF models supported.
    • Q4_0, Q4_1, Q4_K_M, Q6_K models partially supported. (See notes below)
  • Supported devices

    • Intel CPUs
    • Intel integrated and discrete GPUs
    • Intel NPUs (requires UD32+ driver).

Tested Models

The following models are validated for functionality.

Accuracy and performance are WIP.

Note: llama-cli and llama-server needs to run with --no-warmup for now.

Work in Progress

  • Performance and memory optimizations
  • Broader quantization coverage.
  • Support for additional model architectures.
  • Extensive accuracy testing.

Notes on quantization support

  • Both Q4_0 and Q4_1 models use Q6_K for the token_embedding tensor and the weight tensor in the last matmul (in most models it is the same tensor as token_emb).
  • Q4_0 models will produce some Q4_1 tensors if imatrix is provided as part of the quantization of the model using llama-quantize utility.
  • Q4_K_M models additionally have Q6_K tensors and Q5_K tensors (only in Phi3 in the validated model list of this PR).

NOTE: Optimum-intel converts the fp16/bf16 token embedding tensor and the weight tensor in the last matmul to int8 asym channel-wise (config code).

CPU

  • Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.
  • Q6_K tensors (6bit gs16 sym) are converted to int8 gs16 sym.
  • Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.

GPU

  • Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.
  • Q6_K tensors (6bit gs16 sym) are requantized to int8 gs32 sym.
  • Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.

NPU

  • Main quantization scheme for the supported models in this PR is Q4_0.
  • Q4_0 and Q4_1 tensors are requantized to int4 gs128 sym.
  • Q6_K tensors are dequantized to fp16.

@wine99 wine99 marked this pull request as draft August 14, 2025 09:09
@github-actions github-actions bot added documentation Improvements or additions to documentation testing Everything test related devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Aug 14, 2025
@SearchSavior
Copy link

SearchSavior commented Aug 19, 2025

Hello,

in this repo https://github.com/yangsu2022/GGUF-to-OpenVINO and the article https://blog.openvino.ai/blog-posts/openvino-genai-supports-gguf-models only a small set of models are supported.

Will this feature in llama.cpp offer wider gguf coverage via something like the parameter mapping described here,

https://github.com/yangsu2022/GGUF-to-OpenVINO/blob/405a95e300f8307fb4b779a12d46cf86adf5a441/convert_llama3.1_gguf_to_torch.py#L14

A few other questions:

  • What parts of OpenVINO feature set are intended to be brought into llama.cpp?

  • Is this PR trying to bring in only performance from openvino runtime to support llama.cpp usecase?

  • Pipeline parallel is coming in next release (I think), will that be implemented here for heterogeneous execution in llama.cpp?

Thank you for your work!

@ravi9
Copy link

ravi9 commented Aug 21, 2025

Hi @SearchSavior ,

Q: Will this feature in llama.cpp offer wider GGUF coverage via something like parameter mapping?

Instead of converting GGUF models to PyTorch format with parameter mapping, this implementation uses OpenVINO's GGML frontend to directly translate GGML computation graphs to OpenVINO operations at runtime. The translation happens through a comprehensive operation mapping system that covers the core GGML operations. Since it works at the GGML operation level, it should support any model architecture that llama.cpp supports (assuming we map/translate all the GGML operators to OpenVINO.)

Q: What parts of the OpenVINO feature set are intended to be brought into llama.cpp?

The immediate focus is on runtime acceleration: kernel fusion, optimized graph execution,memory optimizations and hardware scheduling on CPU, GPU, and NPU.

Q: Is this PR trying to bring in only performance from openvino runtime to support llama.cpp usecase?

The scope of this PR is primarily performance enablement using OpenVINO runtime to accelerate llama.cpp inference while preserving compatibility with the GGUF ecosystem. It’s not introducing a new model conversion flow, so everything remains driven by GGUF models in llama.cpp.

Q: Will pipeline parallel / heterogeneous execution be supported here?

We are currently reviewing this. llama.cpp already has infrastructure for pipeline parallelism, and the OpenVINO backend exposes async operations and events, so it should be possible. Further evaluation is needed to confirm integration details.

@SearchSavior
Copy link

Hey @ravi9 ,

Thanks for the detailed answer. It's nice to see more serious work bringing OpenVINO to the rest of the ecosystem.

@Bionic-Squash
Copy link

I can't wait for openVINO support to get upstreamed

@wine99 wine99 force-pushed the dev_backend_openvino branch 2 times, most recently from e180b86 to 80f0969 Compare September 5, 2025 08:36
@wine99 wine99 force-pushed the dev_backend_openvino branch 2 times, most recently from 76ab76e to 2e1dd8d Compare September 28, 2025 03:25
@wine99 wine99 force-pushed the dev_backend_openvino branch from e727c65 to 66e503b Compare October 11, 2025 05:45
@slaren
Copy link
Member

slaren commented Oct 11, 2025

Is there any way to build this on Ubuntu 25.04? OpenVINO doesn't seem to support this version of Ubuntu. I tried anyway, installing the dependencies manually, but the build fails due to a missing header tbb/blocked_range.h. The file exists, but the include directory does not seem to be setup correctly.

@wine99 wine99 marked this pull request as ready for review October 14, 2025 00:03
@wine99 wine99 requested review from CISC and slaren as code owners October 14, 2025 00:03
Comment on lines 691 to 703
sudo mkdir -p /opt/intel
wget -O openvino_${OPENVINO_VERSION_MAJOR}.tgz https://storage.openvinotoolkit.org/repositories/openvino/packages/${OPENVINO_VERSION_MAJOR}/linux/openvino_toolkit_ubuntu24_${OPENVINO_VERSION_FULL}_x86_64.tgz
tar -xf openvino_${OPENVINO_VERSION_MAJOR}.tgz
sudo mv openvino_toolkit_ubuntu24_${OPENVINO_VERSION_FULL}_x86_64 /opt/intel/openvino_${OPENVINO_VERSION_MAJOR}
rm openvino_${OPENVINO_VERSION_MAJOR}.tgz
cd /opt/intel/openvino_${OPENVINO_VERSION_MAJOR}
echo "Y" | sudo -E ./install_dependencies/install_openvino_dependencies.sh && cd -
sudo ln -s /opt/intel/openvino_${OPENVINO_VERSION_MAJOR} /opt/intel/openvino
- name: Build
id: cmake_build
run: |
source /opt/intel/openvino/setupvars.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please cache this similarly to vulkan and spacemit SDKs:

- name: Use Vulkan SDK Cache
uses: actions/cache@v4
id: cache-sdk
with:
path: ./vulkan_sdk
key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}
- name: Setup Vulkan SDK
if: steps.cache-sdk.outputs.cache-hit != 'true'
uses: ./.github/actions/linux-setup-vulkan
with:
path: ./vulkan_sdk
version: ${{ env.VULKAN_SDK_VERSION }}
- name: Build
id: cmake_build
run: |
source ./vulkan_sdk/setup-env.sh

- name: Setup Cache
uses: actions/cache@v4
id: cache-sdk
with:
path: ./vulkan_sdk
key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}
- name: Setup Vulkan SDK
if: steps.cache-sdk.outputs.cache-hit != 'true'
uses: ./.github/actions/linux-setup-vulkan
with:
path: ./vulkan_sdk
version: ${{ env.VULKAN_SDK_VERSION }}

- name: Setup Vulkan SDK
id: setup
uses: ./.github/actions/unarchive-tar
with:
url: https://sdk.lunarg.com/sdk/download/${{ inputs.version }}/linux/vulkan_sdk.tar.xz
path: ${{ inputs.path }}
strip: 1

(add type: z for gzip)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CISC Thanks for the suggestion. I have made the changes. @ravi9 Please also review this

@ggerganov
Copy link
Member

@wine99 Could you address the comment by @slaren earlier?

@ravi9
Copy link

ravi9 commented Oct 14, 2025

Is there any way to build this on Ubuntu 25.04? OpenVINO doesn't seem to support this version of Ubuntu. I tried anyway, installing the dependencies manually, but the build fails due to a missing header tbb/blocked_range.h. The file exists, but the include directory does not seem to be setup correctly.

@slaren We have a fix to support Ubuntu25.04, will update soon.

@ravi9
Copy link

ravi9 commented Oct 15, 2025

@slaren : Could you try again. Fixed CMakeLists.txt to resolve TBB issue.
Also, created a patch to install OV-2025.3 on Ubuntu25.04.

# Script to Install OpenVINO from archive
wget https://raw.githubusercontent.com/ravi9/misc-scripts/main/openvino/ov-archive-install/install-openvino-from-archive.sh
chmod +x install-openvino-from-archive.sh
./install-openvino-from-archive.sh

@wine99 wine99 force-pushed the dev_backend_openvino branch from ade4a2d to f89292d Compare October 15, 2025 08:06
@slaren
Copy link
Member

slaren commented Oct 15, 2025

@slaren : Could you try again. Fixed CMakeLists.txt to resolve TBB issue.
Also, created a patch to install OV-2025.3 on Ubuntu25.04.

Thanks. I was able to build it now, but I get different exceptions when trying to run it.

$ llama-bench -m models/Llama-3.2-1B-Instruct.Q4_K_M.gguf
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56     ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56      in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x00007fa0509c1b63 in __internal_syscall_cancel (a1=2057930, a2=0, a3=0, a4=0, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
warning: 49     ./nptl/cancellation.c: No such file or directory
#2  __syscall_cancel (a1=a1@entry=2057930, a2=a2@entry=0, a3=a3@entry=0, a4=a4@entry=0, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75      in ./nptl/cancellation.c
#3  0x00007fa050a3de9f in __GI___wait4 (pid=pid@entry=2057930, stat_loc=stat_loc@entry=0x0, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4  0x00007fa050a3deeb in __GI___waitpid (pid=pid@entry=2057930, stat_loc=stat_loc@entry=0x0, options=options@entry=0) at ./posix/waitpid.c:38
warning: 38     ./posix/waitpid.c: No such file or directory
#5  0x00007fa050f1cf23 in ggml_print_backtrace () at /home/diego/code/llama.cpp/ggml/src/ggml.c:196
196             waitpid(child_pid, NULL, 0);
#6  0x00007fa050f2b3af in ggml_uncaught_exception () at /home/diego/code/llama.cpp/ggml/src/ggml.cpp:9
9           ggml_print_backtrace();
#7  0x00007fa050d290aa in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007fa050d12a9e in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007fa050d29361 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007fa04f9d7d38 in ov::frontend::NotImplementedFailure::create(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/intel/openvino_2025.3/runtime/lib/intel64/libopenvino.so.2530
#11 0x00007fa050797e99 in ov::frontend::ggml::op::translate_permute (context=...) at /usr/include/c++/14/bits/basic_string.tcc:242
242               ~_Guard() { if (_M_guarded) _M_guarded->_M_dispose(); }
#12 0x00007fa0507afa52 in std::__invoke_impl<std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > >, std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > > (*&)(ov::frontend::ggml::NodeContext const&), ov::frontend::ggml::NodeContext const&> (__f=<optimized out>) at /usr/include/c++/14/bits/invoke.h:60
60          __invoke_impl(__invoke_other, _Fn&& __f, _Args&&... __args)
#13 std::__invoke_r<std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > >, std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > > (*&)(ov::frontend::ggml::NodeContext const&), ov::frontend::ggml::NodeContext const&> (__fn=<optimized out>) at /usr/include/c++/14/bits/invoke.h:116
116                                               std::forward<_Args>(__args)...);
#14 std::_Function_handler<std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > >(ov::frontend::ggml::NodeContext const&), std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > > (*)(ov::frontend::ggml::NodeContext const&)>::_M_invoke (__functor=..., __args#0=...) at /usr/include/c++/14/bits/std_function.h:291
291                                          std::forward<_ArgTypes>(__args)...);
#15 0x00007fa0507c3bbb in std::function<std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > >(ov::frontend::ggml::NodeContext const&)>::operator() (this=0x560333e2cb18, __args#0=...) at /usr/include/c++/14/bits/std_function.h:591
591             return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...);
#16 operator() (__closure=0x7ffc0bc79240, node=std::shared_ptr<ov::frontend::ggml::GgmlDecoder> (use count 3, weak count 0) = {...}) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/openvino/translate_session.cpp:207
207             converted_outputs = it->second(node_context);
#17 0x00007fa0507c43ed in std::__invoke_impl<void, ov::frontend::ggml::TranslateSession::translate_graph(const ov::frontend::InputModel::Ptr&)::<lambda(std::shared_ptr<ov::frontend::ggml::GgmlDecoder>)>&, std::shared_ptr<ov::frontend::ggml::GgmlDecoder> > (__f=...) at /usr/include/c++/14/bits/shared_ptr_base.h:1095
1095          _M_swap(__shared_count& __r) noexcept
#18 std::__invoke_r<void, ov::frontend::ggml::TranslateSession::translate_graph(const ov::frontend::InputModel::Ptr&)::<lambda(std::shared_ptr<ov::frontend::ggml::GgmlDecoder>)>&, std::shared_ptr<ov::frontend::ggml::GgmlDecoder> > (__fn=...) at /usr/include/c++/14/bits/invoke.h:111
111             std::__invoke_impl<__type>(__tag{}, std::forward<_Callable>(__fn),
#19 std::_Function_handler<void(std::shared_ptr<ov::frontend::ggml::GgmlDecoder>), ov::frontend::ggml::TranslateSession::translate_graph(const ov::frontend::InputModel::Ptr&)::<lambda(std::shared_ptr<ov::frontend::ggml::GgmlDecoder>)> >::_M_invoke(const std::_Any_data &, std::shared_ptr<ov::frontend::ggml::GgmlDecoder> &&) (__functor=..., __args#0=...) at /usr/include/c++/14/bits/std_function.h:290
290             return std::__invoke_r<_Res>(*_Base::_M_get_pointer(__functor),
#20 0x00007fa05076935c in std::function<void(std::shared_ptr<ov::frontend::ggml::GgmlDecoder>)>::operator() (this=0x7ffc0bc79240, __args#0=std::shared_ptr<ov::frontend::ggml::GgmlDecoder> (empty) = {...}) at /usr/include/c++/14/bits/std_function.h:591
591             return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...);
#21 GgmlOvDecoder::visit_subgraph (this=0x5603373a9440, node_visitor=...) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/ggml-decoder.cpp:761
761             node_visitor(decoder);
#22 0x00007fa0507c7143 in ov::frontend::ggml::TranslateSession::translate_graph (this=this@entry=0x7ffc0bc79450, input_model=std::shared_ptr<ov::frontend::InputModel> (use count 5, weak count 0) = {...}) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/openvino/translate_session.cpp:230
230         ggml_model_decoder->visit_subgraph(node_visitor);
#23 0x00007fa0507c8d46 in ov::frontend::ggml::TranslateSession::get_converted_model (this=this@entry=0x7ffc0bc79450) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/openvino/translate_session.cpp:167
167         m_ov_model = translate_graph(m_input_model);
#24 0x00007fa05078127b in ov::frontend::ggml::FrontEnd::convert (model=std::shared_ptr<ov::frontend::InputModel> (use count 5, weak count 0) = {...}, naive=naive@entry=false) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/openvino/frontend.cpp:20
20              converted_model = translate_session.get_converted_model();
#25 0x00007fa0507db371 in openvino_frontend_compute (backend=<optimized out>, cgraph=<optimized out>) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/utils.cpp:173
173                     model = ov::frontend::ggml::FrontEnd::convert(input_model);
#26 0x00007fa05076d75d in ggml_backend_openvino_graph_compute (backend=<optimized out>, cgraph=<optimized out>) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/ggml-openvino.cpp:54
54          openvino_frontend_compute(backend, cgraph);
#27 0x00007fa050f32e60 in ggml_backend_sched_compute_splits (sched=0x560333e04ec0) at /home/diego/code/llama.cpp/ggml/src/ggml-backend.cpp:1553
1553                enum ggml_status ec = ggml_backend_graph_compute_async(split_backend, &split->graph);
#28 ggml_backend_sched_graph_compute_async (sched=0x560333e04ec0, graph=<optimized out>) at /home/diego/code/llama.cpp/ggml/src/ggml-backend.cpp:1753
1753        return ggml_backend_sched_compute_splits(sched);
#29 0x00007fa05102bf31 in llama_context::graph_compute (this=this@entry=0x5603373ab970, gf=0x5603342642f0, batched=<optimized out>) at /usr/include/c++/14/bits/unique_ptr.h:193
193           pointer    _M_ptr() const noexcept { return std::get<0>(_M_t); }
#30 0x00007fa05102cf8a in llama_context::process_ubatch (this=this@entry=0x5603373ab970, ubatch=..., gtype=gtype@entry=LLM_GRAPH_TYPE_DECODER, mctx=mctx@entry=0x560333e03000, ret=@0x7ffc0bc7db58: 1360825900) at /home/diego/code/llama.cpp/src/llama-context.cpp:784
784         const auto status = graph_compute(res->get_gf(), ubatch.n_tokens > 1);
#31 0x00007fa0510303bf in llama_context::decode (this=0x5603373ab970, batch_inp=...) at /home/diego/code/llama.cpp/src/llama-context.cpp:1088
1088            const auto * res = process_ubatch(ubatch, LLM_GRAPH_TYPE_DECODER, mctx.get(), status);
#32 0x00007fa05103126f in llama_decode (ctx=<optimized out>, batch=...) at /home/diego/code/llama.cpp/src/llama-context.cpp:2747
2747        const int ret = ctx->decode(batch);
#33 0x0000560326cc51c1 in test_prompt (ctx=ctx@entry=0x5603373ab970, n_prompt=512, n_batch=2048, n_threads=<optimized out>) at /home/diego/code/llama.cpp/tools/llama-bench/llama-bench.cpp:1939
1939            int res = llama_decode(ctx, llama_batch_get_one(tokens.data(), n_tokens));
#34 0x0000560326cc0131 in main (argc=<optimized out>, argv=<optimized out>) at /home/diego/code/llama.cpp/tools/llama-bench/llama-bench.cpp:2115
2115                    bool res = test_prompt(ctx, t.n_prompt, t.n_batch, t.n_threads);
[Inferior 1 (process 2057893) detached]
terminate called after throwing an instance of 'ov::frontend::NotImplementedFailure'
  what():  Check '(op_case == 1 || op_case == 2 || op_case == 3)' failed at openvino/op/permute.cpp:25:
FrontEnd API failed with NotImplementedFailure:
"Unsupported PERMUTE case" is not implemented for this FrontEnd class


$ llama-cli -m models/Llama-3.2-1B-Instruct.Q4_K_M.gguf
[...]
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56     ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56      in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x00007ff5bae77b63 in __internal_syscall_cancel (a1=2058374, a2=0, a3=0, a4=0, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
warning: 49     ./nptl/cancellation.c: No such file or directory
#2  __syscall_cancel (a1=a1@entry=2058374, a2=a2@entry=0, a3=a3@entry=0, a4=a4@entry=0, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75      in ./nptl/cancellation.c
#3  0x00007ff5baef3e9f in __GI___wait4 (pid=pid@entry=2058374, stat_loc=stat_loc@entry=0x0, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4  0x00007ff5baef3eeb in __GI___waitpid (pid=pid@entry=2058374, stat_loc=stat_loc@entry=0x0, options=options@entry=0) at ./posix/waitpid.c:38
warning: 38     ./posix/waitpid.c: No such file or directory
#5  0x00007ff5bb4cff23 in ggml_print_backtrace () at /home/diego/code/llama.cpp/ggml/src/ggml.c:196
196             waitpid(child_pid, NULL, 0);
#6  0x00007ff5bb4de3af in ggml_uncaught_exception () at /home/diego/code/llama.cpp/ggml/src/ggml.cpp:9
9           ggml_print_backtrace();
#7  0x00007ff5bb1df0aa in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007ff5bb1c8a9e in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007ff5bb1df361 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007ff5b936f710 in ov::Exception::create(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/intel/openvino_2025.3/runtime/lib/intel64/libopenvino.so.2530
#11 0x00007ff5b93dee79 in ?? () from /opt/intel/openvino_2025.3/runtime/lib/intel64/libopenvino.so.2530
#12 0x00007ff5bac8fde7 in openvino_frontend_compute (backend=<optimized out>, cgraph=<optimized out>) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/utils.cpp:223
223         infer_request.infer();
#13 0x00007ff5bac2375d in ggml_backend_openvino_graph_compute (backend=<optimized out>, cgraph=<optimized out>) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/ggml-openvino.cpp:54
54          openvino_frontend_compute(backend, cgraph);
#14 0x00007ff5bb4e5e60 in ggml_backend_sched_compute_splits (sched=0x56052ad349e0) at /home/diego/code/llama.cpp/ggml/src/ggml-backend.cpp:1553
1553                enum ggml_status ec = ggml_backend_graph_compute_async(split_backend, &split->graph);
#15 ggml_backend_sched_graph_compute_async (sched=0x56052ad349e0, graph=<optimized out>) at /home/diego/code/llama.cpp/ggml/src/ggml-backend.cpp:1753
1753        return ggml_backend_sched_compute_splits(sched);
#16 0x00007ff5bb5def31 in llama_context::graph_compute (this=this@entry=0x56052aef8b60, gf=0x56052b1fabb0, batched=<optimized out>) at /usr/include/c++/14/bits/unique_ptr.h:193
193           pointer    _M_ptr() const noexcept { return std::get<0>(_M_t); }
#17 0x00007ff5bb5dff8a in llama_context::process_ubatch (this=this@entry=0x56052aef8b60, ubatch=..., gtype=gtype@entry=LLM_GRAPH_TYPE_DECODER, mctx=mctx@entry=0x560532946fc0, ret=@0x7ffdfd4892e8: -45570912) at /home/diego/code/llama.cpp/src/llama-context.cpp:784
784         const auto status = graph_compute(res->get_gf(), ubatch.n_tokens > 1);
#18 0x00007ff5bb5e33bf in llama_context::decode (this=0x56052aef8b60, batch_inp=...) at /home/diego/code/llama.cpp/src/llama-context.cpp:1088
1088            const auto * res = process_ubatch(ubatch, LLM_GRAPH_TYPE_DECODER, mctx.get(), status);
#19 0x00007ff5bb5e426f in llama_decode (ctx=<optimized out>, batch=...) at /home/diego/code/llama.cpp/src/llama-context.cpp:2747
2747        const int ret = ctx->decode(batch);
#20 0x00005604f48c2e16 in main (argc=<optimized out>, argv=<optimized out>) at /home/diego/code/llama.cpp/tools/main/main.cpp:671
671                     if (llama_decode(ctx, llama_batch_get_one(&embd[i], n_eval))) {
[Inferior 1 (process 2058328) detached]
terminate called after throwing an instance of 'ov::Exception'
  what():  Exception from src/inference/src/cpp/infer_request.cpp:223:
Exception from src/plugins/intel_cpu/src/node.cpp:792:
[CPU] Add node with name 'Add_19058' Check 'input_shape[j] == 1' failed at src/plugins/intel_cpu/src/shape_inference/custom/eltwise.cpp:52:
Eltwise shape infer input shapes dim index: 3 mismatch

@ravi9
Copy link

ravi9 commented Oct 15, 2025

@slaren Thanks for testing.

  • We are working on fixing llama-bench.
  • For llama-cli and llama-server, please run with --no-warmup for now. The input shapes for the warmup need to be fixed. We are working on a solution for it. llama-cli -m models/Llama-3.2-1B-Instruct.Q4_K_M.gguf --no-warmup
  • llama-simple should work fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.