Large Language Models Weight Compression Example

This example demonstrates how to optimize Large Language Models (LLMs) using NNCF weight compression API. The example applies 4/8-bit mixed-precision quantization to weights of Linear (Fully-connected) layers of TinyLlama/TinyLlama-1.1B-Chat-v1.0 model after converting it into a TorchFX representation. This leads to a significant decrease in model footprint and performance improvement with OpenVINO.

Prerequisites

To use this example:

Create a separate Python* environment and activate it: python3 -m venv nncf_env && source nncf_env/bin/activate
Install dependencies:

pip install -U pip
pip install -r requirements.txt
pip install ../../../../

Run Example

To run example:

python main.py

It will automatically download the dataset and baseline model then run the model with a sample prompt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Language Models Weight Compression Example

Prerequisites

Run Example

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Large Language Models Weight Compression Example

Prerequisites

Run Example