This repository contains materials and code examples from the DeepLearning.AI short course "Quantization in Depth", instructed by Marc Sun and Younes Belkada from Hugging Face.
"Quantization in Depth" teaches advanced techniques to compress neural network models, reducing their size to a fraction of the original while maintaining performance. By implementing customized quantization methods from scratch, you'll gain deep insights into the tradeoffs between model size and accuracy, enabling faster inference and broader deployment of AI models.
- ⚙️ Implement and compare different variants of linear quantization (symmetric vs. asymmetric mode)
- 🔍 Apply varying granularity levels: per-tensor, per-channel, and per-group quantization
- 🛠️ Build a general-purpose quantizer in PyTorch that can compress any open source model's dense layers by up to 4x
- 📦 Implement weights packing techniques to compress weights from 32 bits to as low as 2 bits
This course includes 18 lessons with 13 code examples:
- 🧠 Introduction - Overview of quantization importance and applications
- 📖 Overview - Core concepts and techniques covered in the course
- 🔢 Quantize and De-quantize a Tensor - Fundamental operations in quantization
- 📏 Get the Scale and Zero Point - Understanding key quantization parameters
- ⚖️ Symmetric vs Asymmetric Mode - Comparing different quantization approaches
- 🎯 Finer Granularity for more Precision - Introduction to granularity concepts
- 📊 Per Channel Quantization - Implementing channel-wise quantization
- 🧩 Per Group Quantization - Implementing group-wise quantization
- 🚀 Quantizing Weights & Activations for Inference - Practical application to inference
- 🛠️ Custom Build an 8-Bit Quantizer - Developing a custom quantization solution
- 🔄 Replace PyTorch layers with Quantized Layers - Practical integration with PyTorch
- 🌐 Quantize any Open Source PyTorch Model - Building a general-purpose solution
- 🤝 Load your Quantized Weights from HuggingFace Hub - Working with the Hugging Face ecosystem
- 📦 Weights Packing - Theory behind extreme compression
- 🧮 Packing 2-bit Weights - Implementation of 2-bit weight compression
- 🔓 Unpacking 2-Bit Weights - Recovering usable weights from compressed format
- 🚧 Beyond Linear Quantization - Introduction to advanced quantization methods
- 🏁 Conclusion - Summary and future directions
This repository contains 13 code examples that correspond to the course lessons:
- 🔢 Tensor Quantization - Basic quantize/dequantize operations
- 📏 Computing Scale and Zero Point - Determining quantization parameters
- ⚖️ Symmetric vs. Asymmetric Modes - Implementing both quantization modes
- 🎯 Per-Tensor Quantization - Basic granularity implementation
- 📊 Per-Channel Quantization - Channel-wise implementation
- 🧩 Per-Group Quantization - Group-wise implementation
- 🚀 Weights & Activations Quantization - Full inference-ready quantization
- 🛠️ 8-Bit Quantizer Implementation - Complete 8-bit solution
- 🔄 Quantized Layer Replacement - Integration with PyTorch
- 🌐 General-Purpose Model Quantizer - Quantizing any PyTorch model
- 🤝 Hugging Face Integration - Loading and saving quantized weights
- 📦 2-Bit Weight Packing - Ultra-low bit compression implementation
- 🔓 2-Bit Weight Unpacking - Efficient decompression implementation
# Clone the repository
git clone https://github.com/duybaohuynhtan/Quantization-in-Depth.git
cd Quantization-in-Depth
# Install required packages
pip install -r requirements.txtEach code example is presented as a Jupyter notebook:
jupyter notebookNavigate to the notebooks/ directory and open the desired example.
- 🐍 Python 3.9.18
- ⚡ Accelerate 0.26.1
- 📊 Seaborn 0.13.1
- 🔥 Torch 2.1.1
- 🤗 Transformers 4.35.0
- Marc Sun - Machine Learning Engineer at Hugging Face
- Younes Belkada - Machine Learning Engineer at Hugging Face
Special thanks to DeepLearning.AI and Hugging Face for creating such comprehensive learning materials.