-
Notifications
You must be signed in to change notification settings - Fork 696
OpenVINO Export Llama Support #14022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenVINO Export Llama Support #14022
Conversation
[NNCF] NNCF WC Support
[NNCF] WC Support in OVQuantizer
Update nncf_observers.py
Ok let me review the diff. However, I would expect NPU to significantly outperform CPU |
for token generation, it depends on how the NPU accesses memory and thus the memory bandwidth, doesn't? Wouldn't the expected gain be in prompt processing? I only see one set of performance numbers, so I assume the numbers given are token generation rather than processing time for a longer prompt. |
Yes Scott. Thats right. But besides the issue you mentioned about not having reported prefill numbers, I also see taht NPU is slower than the CPU counterpart. |
| -DEXECUTORCH_BUILD_EXTENSION_RUNNER_UTIL=ON \ | ||
| -DEXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR=ON \ | ||
| -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \ | ||
| -DEXECUTORCH_BUILD_OPENVINO_EXECUTOR_RUNNER=ON \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we remove the needs for separate runners it will also be easier since yoiu wont have to maintain your own build scripts. You should really only be requiring to install the deps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes in openvino_build.sh we are just adding the normal executor now. This build script is just to make it easier for users to build the stack with openvino instead of manually using the cmake commands. This is the feedback we got from other users as well when they were trying out the build instructions for openvino.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I have two major comments
- quantization: can we leverage existing quantization API instead of doing somethign custom
- For runner, why do we need a separate runner. Can we not use existing llm runner?
cc: @jackzhxng
|
@cavusmustafa please request review again once the comments are addressed |
@kimishpatel, thank you for taking the time to review and share your feedback! We’ll revise and update based on your suggestions and let you know once the updates are ready. CC: @suryasidd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything under examples/ looks good to me
Use defualt runner for OpenVINO backend as well
@kimishpatel, we have updated the runner code, and it is ready for further review. Please let us know if any additional changes are needed. In the meantime, we are running further experiments with NPU and will share our findings soon. Thank you. cc: @suryasidd |
Added one more commit with very minor change. Changing the quantization scheme for 4 bit to Symmetric to extract more performance for Llama models on NPU and ratio to 1 to compress all the nodes. |
|
@cavusmustafa don't merge just yet, let me import and do some sanity check on our internal unit tests |
Sure @kimishpatel, thank you for reviewing the PR. Let us know of any issues |
|
@mergennachin has imported this pull request. If you are a Meta employee, you can view this in D84544001. |
Summary
This PR includes the changes below:
backends/openvnio/scripts/openvino_build.sh)Release notes: backends
Test plan
The features provided in this PR are tested using the instructions provided in the readme files below. Export llama functionality with OpenVINO backend is not part of the backend test suite for the time being due to model file dependencies.
OpenVINO Backend Setup:
https://github.com/cavusmustafa/executorch/blob/openvino_llama_support/backends/openvino/README.md
OpenVINO Backend Example for Llama:
https://github.com/cavusmustafa/executorch/blob/openvino_llama_support/examples/openvino/llama/README.md
Performance Results
Model: meta-llama/Llama-3.2-1B-Instruct
System Config:
Hardware: Intel(R) Core(TM) Ultra 7 258V (Intel Lunar Lake)
Architecture: x86_64, 8 cores, 4.8 GHz max
Memory: 32GB RAM
OS: Ubuntu 24.04
CC: @ynimmaga @suryasidd @anzr299 @daniil-lyakhov @MaximProshin @kimishpatel @cbilgin