Merge pull request #10 from anzr299/patch-3

cavusmustafa · web-flow · commit cf0e71cc1ba3 · 2025-09-14T20:05:20.000-07:00
Update README.md with Quantization Docs
diff --git a/examples/openvino/llama/README.md b/examples/openvino/llama/README.md
@@ -24,6 +24,24 @@ python -m executorch.extension.llm.export.export_llm \
   +base.tokenizer_path="${LLAMA_TOKENIZER:?}"
 ```
 
+### Compress Model Weights and Export
+OpenVINO backend also offers Quantization support for llama models when exporting the model. The different quantization modes that are offered are INT4 groupwise & per-channel weights compression and INT8 per-channel weights compression. It can be achieved using the `--pt2e_quantize opevnino_4wo` flag. For modifying the group size `--group_size` can be used. By default group size 128 is used to achieve optimal performance with the NPU.
+
+```
+LLAMA_CHECKPOINT=<path/to/model/folder>/consolidated.00.pth
+LLAMA_PARAMS=<path/to/model/folder>/params.json
+LLAMA_TOKENIZER=<path/to/model/folder>/tokenizer.model
+
+python -m executorch.extension.llm.export.export_llm \
+  --config llama3_2_ov_4wo.yaml \
+  +backend.openvino.device="CPU" \
+  +base.model_class="llama3_2" \
+  +pt2e_quantize opevnino_4wo \
+  +base.checkpoint="${LLAMA_CHECKPOINT:?}" \
+  +base.params="${LLAMA_PARAMS:?}" \
+  +base.tokenizer_path="${LLAMA_TOKENIZER:?}"
+```
+
 ## Build OpenVINO C++ Runtime with Llama Runner:
 First, build the backend libraries by executing the script below in `<executorch_root>/backends/openvino/scripts` folder:
 ```bash