This folder contains examples of Olive recipes for Phi-4-mini-instruct optimization.
The olive recipe microsoft-Phi-4-mini-instruct_nvmo_ptq_mixed_precision_awq_lite.json produces INT4 + INT8 mixed precision quantized model using NVIDIA's TensorRT Model Optimizer toolkit with AWQ algorithm.
-
Install Olive with NVIDIA TensorRT Model Optimizer toolkit
- Run following command to install Olive with TensorRT Model Optimizer.
pip install olive-ai[nvmo]
-
If TensorRT Model Optimizer needs to be installed from a local wheel, then follow below steps.
pip install olive-ai pip install <modelopt-wheel>[onnx]
-
Make sure that TensorRT Model Optimizer is installed correctly.
python -c "from modelopt.onnx.quantization.int4 import quantize as quantize_int4" -
Refer TensorRT Model Optimizer documentation for its detailed installation instructions and setup dependencies.
-
Install suitable onnxruntime and onnxruntime-genai packages
- Install the onnxruntime and onnxruntime-genai packages that have NvTensorRTRTXExecutionProvider support. Refer documentation for NvTensorRtRtx execution-provider to setup its dependencies/requirements.
- Note that by default, TensorRT Model Optimizer comes with onnxruntime-directml. And onnxrutime-genai-cuda package comes with onnxruntime-gpu. So, in order to use onnxruntime package with NvTensorRTRTXExecutionProvider support, one might need to uninstall existing other onnxruntime packages.
- Make sure that at the end, there is only one onnxruntime package installed. Use command like following for validating the onnxruntime package installation.
python -c "import onnxruntime as ort; print(ort.get_available_providers())" -
Install additional requirements.
- Install packages provided in requirements text file.
pip install -r requirements-nvmo.txt
olive run --config microsoft-Phi-4-mini-instruct_nvmo_ptq_mixed_precision_awq_lite.jsonThe olive recipe microsoft-Phi-4-mini-instruct_nvmo_ptq_mixed_precision_awq_lite.json has 2 passes: (a) ModelBuilder and (b) NVModelOptQuantization. The ModelBuilder pass is used to generate the FP16 model for NvTensorRTRTXExecutionProvider (aka NvTensorRtRtx EP). Subsequently, the NVModelOptQuantization pass performs INT4 + INT8 mixed precision quantization using AWQ algorithm with AWQ Lite calibration method to produce the optimized model.
In case of any issue related to quantization using TensorRT Model Optimizer toolkit, refer its FAQs for potential help or suggestions.