Skip to content

Commit cf0e71c

Browse files
authored
Merge pull request #10 from anzr299/patch-3
Update README.md with Quantization Docs
2 parents bba4a01 + 1421921 commit cf0e71c

File tree

1 file changed

+18
-0
lines changed

1 file changed

+18
-0
lines changed

examples/openvino/llama/README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,24 @@ python -m executorch.extension.llm.export.export_llm \
2424
+base.tokenizer_path="${LLAMA_TOKENIZER:?}"
2525
```
2626

27+
### Compress Model Weights and Export
28+
OpenVINO backend also offers Quantization support for llama models when exporting the model. The different quantization modes that are offered are INT4 groupwise & per-channel weights compression and INT8 per-channel weights compression. It can be achieved using the `--pt2e_quantize opevnino_4wo` flag. For modifying the group size `--group_size` can be used. By default group size 128 is used to achieve optimal performance with the NPU.
29+
30+
```
31+
LLAMA_CHECKPOINT=<path/to/model/folder>/consolidated.00.pth
32+
LLAMA_PARAMS=<path/to/model/folder>/params.json
33+
LLAMA_TOKENIZER=<path/to/model/folder>/tokenizer.model
34+
35+
python -m executorch.extension.llm.export.export_llm \
36+
--config llama3_2_ov_4wo.yaml \
37+
+backend.openvino.device="CPU" \
38+
+base.model_class="llama3_2" \
39+
+pt2e_quantize opevnino_4wo \
40+
+base.checkpoint="${LLAMA_CHECKPOINT:?}" \
41+
+base.params="${LLAMA_PARAMS:?}" \
42+
+base.tokenizer_path="${LLAMA_TOKENIZER:?}"
43+
```
44+
2745
## Build OpenVINO C++ Runtime with Llama Runner:
2846
First, build the backend libraries by executing the script below in `<executorch_root>/backends/openvino/scripts` folder:
2947
```bash

0 commit comments

Comments
 (0)