OpenVINO Export Llama Support #14022

cavusmustafa · 2025-09-05T19:33:53Z

Summary

This PR includes the changes below:

Initial export llama support with OpenVINO backend. Please find the instructions on how to export and execute inference in OpenVINO backend examples documentation: https://github.com/cavusmustafa/executorch/blob/openvino_llama_support/examples/openvino/README.md#export-llama-with-openvino-backend
Improvements for NNCF quantizer for exporting quantized models using export llama.
Build script improvements for better user experience. (backends/openvnio/scripts/openvino_build.sh)
Documentation updates related to the changes provided in this PR.

Release notes: backends

Test plan

The features provided in this PR are tested using the instructions provided in the readme files below. Export llama functionality with OpenVINO backend is not part of the backend test suite for the time being due to model file dependencies.
OpenVINO Backend Setup:
https://github.com/cavusmustafa/executorch/blob/openvino_llama_support/backends/openvino/README.md
OpenVINO Backend Example for Llama:
https://github.com/cavusmustafa/executorch/blob/openvino_llama_support/examples/openvino/llama/README.md

Performance Results

Model: meta-llama/Llama-3.2-1B-Instruct

	tokens per second	ms per token
XNNPACK CPU FP32	14.1	68
OpenVINO CPU FP32	16.6	60.3
OpenVINO CPU INT4	63.6	15.7
OpenVINO NPU INT4	42.9	23.3
OpenVINO NPU INT4 Channel Wise Quant	50.4	19.8

System Config:
Hardware: Intel(R) Core(TM) Ultra 7 258V (Intel Lunar Lake)
Architecture: x86_64, 8 cores, 4.8 GHz max
Memory: 32GB RAM
OS: Ubuntu 24.04

CC: @ynimmaga @suryasidd @anzr299 @daniil-lyakhov @MaximProshin @kimishpatel @cbilgin

[NNCF] NNCF WC Support

[NNCF] WC Support in OVQuantizer

Update nncf_observers.py

kimishpatel · 2025-10-03T02:55:33Z

Thanks for the PR. will do a review. I do see some CI failures.
Also why NPU int4 is actually slower? is there a config where NPU is actually faster than CPU

Hey Kimish, for NPU we have updated the numbers. We are seeing much better performance than previously reported with different config options. With channel wise quantization we are getting about 50 tok/s. We're currently working on additional optimizations to further improve NPU performance to get it closer to CPU.

Ok let me review the diff. However, I would expect NPU to significantly outperform CPU

swolchok · 2025-10-03T21:07:16Z

However, I would expect NPU to significantly outperform CPU

for token generation, it depends on how the NPU accesses memory and thus the memory bandwidth, doesn't? Wouldn't the expected gain be in prompt processing? I only see one set of performance numbers, so I assume the numbers given are token generation rather than processing time for a longer prompt.

kimishpatel · 2025-10-03T21:29:03Z

However, I would expect NPU to significantly outperform CPU

for token generation, it depends on how the NPU accesses memory and thus the memory bandwidth, doesn't? Wouldn't the expected gain be in prompt processing? I only see one set of performance numbers, so I assume the numbers given are token generation rather than processing time for a longer prompt.

Yes Scott. Thats right. But besides the issue you mentioned about not having reported prefill numbers, I also see taht NPU is slower than the CPU counterpart.

backends/openvino/README.md

backends/openvino/quantizer/observers.py

kimishpatel · 2025-10-06T16:37:26Z

backends/openvino/scripts/openvino_build.sh

+          -DEXECUTORCH_BUILD_EXTENSION_RUNNER_UTIL=ON \
+          -DEXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR=ON \
+          -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
+          -DEXECUTORCH_BUILD_OPENVINO_EXECUTOR_RUNNER=ON \


if we remove the needs for separate runners it will also be easier since yoiu wont have to maintain your own build scripts. You should really only be requiring to install the deps

Yes in openvino_build.sh we are just adding the normal executor now. This build script is just to make it easier for users to build the stack with openvino instead of manually using the cmake commands. This is the feedback we got from other users as well when they were trying out the build instructions for openvino.

examples/models/llama/export_llama_lib.py

kimishpatel

Overall I have two major comments

quantization: can we leverage existing quantization API instead of doing somethign custom
For runner, why do we need a separate runner. Can we not use existing llm runner?

cc: @jackzhxng

kimishpatel · 2025-10-08T14:29:42Z

@cavusmustafa please request review again once the comments are addressed

cavusmustafa · 2025-10-08T16:54:00Z

@cavusmustafa please request review again once the comments are addressed

@kimishpatel, thank you for taking the time to review and share your feedback! We’ll revise and update based on your suggestions and let you know once the updates are ready.

CC: @suryasidd

jackzhxng

Everything under examples/ looks good to me

Use defualt runner for OpenVINO backend as well

cavusmustafa · 2025-10-10T22:38:20Z

@cavusmustafa please request review again once the comments are addressed

@kimishpatel, we have updated the runner code, and it is ready for further review. Please let us know if any additional changes are needed. In the meantime, we are running further experiments with NPU and will share our findings soon. Thank you.

cc: @suryasidd

suryasidd · 2025-10-13T20:39:53Z

@cavusmustafa please request review again once the comments are addressed

@kimishpatel, we have updated the runner code, and it is ready for further review. Please let us know if any additional changes are needed. In the meantime, we are running further experiments with NPU and will share our findings soon. Thank you.

cc: @suryasidd

Added one more commit with very minor change. Changing the quantization scheme for 4 bit to Symmetric to extract more performance for Llama models on NPU and ratio to 1 to compress all the nodes.

mergennachin · 2025-10-13T21:38:28Z

@cavusmustafa don't merge just yet, let me import and do some sanity check on our internal unit tests

cavusmustafa · 2025-10-13T21:42:11Z

@cavusmustafa don't merge just yet, let me import and do some sanity check on our internal unit tests

Sure @kimishpatel, thank you for reviewing the PR. Let us know of any issues

meta-codesync · 2025-10-13T21:46:00Z

@mergennachin has imported this pull request. If you are a Meta employee, you can view this in D84544001.

cavusmustafa and others added 30 commits June 20, 2025 17:13

Runtime support for openvino quantized models

e489d6c

openvino export_llama_lib support

f0d901f

nncf pattern checker in openvino partitioner

24f2d93

nncf compression init

7dd8d0f

Merge pull request #5 from anzr299/origin/nncf_compression

ce0e809

[NNCF] NNCF WC Support

openvino backend llama nncf support

1716834

openvino quantizer init

198190e

Merge pull request #7 from anzr299/origin/nncf_compression

9a1dff2

[NNCF] WC Support in OVQuantizer

Moved all openvino llama example changes into export_llama_lib

3d88a4e

Removed openvino utils.py since it is not needed anymore

e81f60d

Update nncf_observers.py

457a868

Merge pull request #8 from anzr299/patch-1

050448e

Update nncf_observers.py

Add export llama runner build option into openvino build script

d1e9330

Update README.md

cedab9d

Merge branch 'main' into openvino_llama_support

1010323

Merge branch 'main' into openvino_llama_support

cf0d3b7

Added CMAKE EXPORT Changes

e54f4c7

code formating updates

c12a4ba

code formating changes

bf65943

openvino quantizer refactored

30a1a25

fixes

4cc7694

support all_layers, backup mode in OVQuantizer

5da40a5

clean up and use new nncf method for obtaining compression parameters

9e65a7e

review changes & update method names according to wc algo

53e0f4c

review changes

bf95930

review changes

2d4bec7

Update export_llama_lib.py

0a2e361

enable group_size parameter for nncf compression

4c86a9c

Update README.md

46ed3f6

Update README.md

0a1256e

Merge branch 'main' into openvino_llama_support

3b358d5