UPSTREAM PR #15307: Add OpenVINO backend #74

DajanaV · 2025-11-04T09:38:09Z

Overview

This PR introduces an OpenVINO backend for llama.cpp, enabling hardware-accelerated inference on Intel® CPUs, GPUs, and NPUs. The backend leverages OpenVINO to deliver optimized inference with the existing llama.cpp GGUF model ecosystem. Enables performance improvements via OpenVINO’s graph compilation and kernel fusion.

llama.cpp with OpenVINO backend: Build Instructions

Key Features:

New backend implementation
- Added OpenVINO backend in ggml/src/ggml-openvino.
- Implemented translations for core GGML operations
Supported precisions
- FP16/BF16 GGUF models supported.
- Q4_0, Q4_1, Q4_K_M, Q6_K models partially supported. (See notes below)
Supported devices
- Intel CPUs
- Intel integrated and discrete GPUs
- Intel NPUs (requires UD32+ driver).

For NPU: -ub 1 is required for llama-cli and llama-server (a smaller context size is also recommended for better performance, e.g., -c 512).

For llama-bench: -fa 1 is required on all devices.

Tested Models

The following models are validated for functionality.

Accuracy and performance are WIP.

Work in Progress

Performance and memory optimizations
Broader quantization coverage.
Support for additional model architectures.
Extensive accuracy testing.

Notes on quantization support

CPU

Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.
Q6_K tensors (6bit gs16 sym) are converted to int8 gs16 sym.
Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.

GPU

Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.
Q6_K tensors (6bit gs16 sym) are requantized to int8 gs32 sym.
Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.

NPU

Main quantization scheme for the supported models in this PR is Q4_0.
Q4_0 and Q4_1 tensors are requantized to int4 gs128 sym.
Q6_K tensors are dequantized to fp16.

Other notes:

Both Q4_0 and Q4_1 models use Q6_K for the token_embedding tensor and the weight tensor in the last matmul (in most models it is the same tensor as token_emb).
Q4_0 models will produce some Q4_1 tensors if imatrix is provided as part of the quantization of the model using llama-quantize utility.
Q4_K_M models additionally have Q6_K tensors and Q5_K tensors (only in Phi3 in the validated model list of this PR).

NOTE: Optimum-intel converts the fp16/bf16 token embedding tensor and the weight tensor in the last matmul to int8 asym channel-wise (config code).

…e model * Add OpenVINO ADD operator to Llama.cpp. The output is somewhat abnormal and needs further debugging.

… operator

…ontend-utils, GraphIterator, Decoder

…on openvino device

…view op.

…end of llama.cpp

…ckend

YangleiZouIntel and others added 30 commits November 4, 2025 16:45

Add ggml-openvino base files

5ef91fc

add openvino as optional backend for Llama.cpp ggml

2e42b6c

* Configure the device(default CPU) that uses OpenVINO to compile th…

bfba5b9

…e model * Add OpenVINO ADD operator to Llama.cpp. The output is somewhat abnormal and needs further debugging.

Solve the issue of abnormal model output caused by using OpenVINO ADD…

e3b1386

… operator

Add OpenVINO MUL operator to GGML of Llama.cpp.

9ba111e

Add compile options

543d929

add OpenVINO frontend convert process steps

684086c

add get openvino available ops function

e71e41a

Add PoC of integration of openvino frontend. Main changes: ggml-ov-fr…

311674e

…ontend-utils, GraphIterator, Decoder

Implement GgmlOvDecoder. Add dump functions.

51ecdf4

Convert subgraph with add, sub, mul, div op to ov model and do infer …

727246e

…on openvino device

Add GGML_OV_FRONTEND option. Add readme.

0802220

Change output for infer request to set output tensor. Support scale, …

d6c148b

…view op.

add GET_ROWS operator of OpenVINO to GGML of llama.cpp

8769d9e

Update build.md and add operation mapping(GGML to OpenVINO)

e4754ab

add the rms_norm operator implemented using OpenVINO to the GGML back…

76ee005

…end of llama.cpp

Fix issue for output memory copy of infer request

2a86e7b

Change to implementation following pytorch frontend

0689ee3

Add support for UNARY SILU op . Fix pytorch impl bugs.

1c301ce

Support Softmax op

6028316

Support Softmax op

213761e

Support ROPE op.

b0406c2

Add support for RMS_NORM OP

9b4d445

Add MUL_MAT,CPY,CONT as operators implemented in OpenVINO for GGML ba…

60e899c

…ckend

Move CPY from GGML OV Backend to OV Frontend

5749e82

add implementation of MUL_MAT, CPY, CONT of GGML ops using OV ops

ad57734

add implementation of CPY when the output tensor is non-contiguous

015f11e

add tmp source code files

cc3066b

Execute singel CONT operator is OK

81f8c75

Execute CONT & VIEW operators in OV Frontend is OK

28acc0e

loci-dev force-pushed the main branch 30 times, most recently from f72076f to 3f5e1ff Compare December 8, 2025 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #15307: Add OpenVINO backend #74

UPSTREAM PR #15307: Add OpenVINO backend #74

Uh oh!

DajanaV commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

UPSTREAM PR #15307: Add OpenVINO backend #74

Are you sure you want to change the base?

UPSTREAM PR #15307: Add OpenVINO backend #74

Uh oh!

Conversation

DajanaV commented Nov 4, 2025

Overview

Key Features:

Tested Models

Work in Progress

Notes on quantization support

CPU

GPU

NPU

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants