Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 4, 2025

Mirrored from ggml-org/llama.cpp#15307

Overview

This PR introduces an OpenVINO backend for llama.cpp, enabling hardware-accelerated inference on Intel® CPUs, GPUs, and NPUs. The backend leverages OpenVINO to deliver optimized inference with the existing llama.cpp GGUF model ecosystem. Enables performance improvements via OpenVINO’s graph compilation and kernel fusion.

Key Features:

  • New backend implementation

    • Added OpenVINO backend in ggml/src/ggml-openvino.
    • Implemented translations for core GGML operations
  • Supported precisions

    • FP16/BF16 GGUF models supported.
    • Q4_0, Q4_1, Q4_K_M, Q6_K models partially supported. (See notes below)
  • Supported devices

    • Intel CPUs
    • Intel integrated and discrete GPUs
    • Intel NPUs (requires UD32+ driver).

For NPU: -ub 1 is required for llama-cli and llama-server (a smaller context size is also recommended for better performance, e.g., -c 512).

For llama-bench: -fa 1 is required on all devices.

Tested Models

The following models are validated for functionality.

Accuracy and performance are WIP.

Work in Progress

  • Performance and memory optimizations
  • Broader quantization coverage.
  • Support for additional model architectures.
  • Extensive accuracy testing.

Notes on quantization support

CPU

  • Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.
  • Q6_K tensors (6bit gs16 sym) are converted to int8 gs16 sym.
  • Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.

GPU

  • Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.
  • Q6_K tensors (6bit gs16 sym) are requantized to int8 gs32 sym.
  • Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.

NPU

  • Main quantization scheme for the supported models in this PR is Q4_0.
  • Q4_0 and Q4_1 tensors are requantized to int4 gs128 sym.
  • Q6_K tensors are dequantized to fp16.

Other notes:

  • Both Q4_0 and Q4_1 models use Q6_K for the token_embedding tensor and the weight tensor in the last matmul (in most models it is the same tensor as token_emb).
  • Q4_0 models will produce some Q4_1 tensors if imatrix is provided as part of the quantization of the model using llama-quantize utility.
  • Q4_K_M models additionally have Q6_K tensors and Q5_K tensors (only in Phi3 in the validated model list of this PR).

NOTE: Optimum-intel converts the fp16/bf16 token embedding tensor and the weight tensor in the last matmul to int8 asym channel-wise (config code).

YangleiZouIntel and others added 30 commits November 4, 2025 16:45
…e model

 * Add OpenVINO ADD operator to Llama.cpp. The output is somewhat abnormal and needs further debugging.
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from f72076f to 3f5e1ff Compare December 8, 2025 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants