A comprehensive learning path for building, compressing, evaluating, and deploying efficient AI models. From fundamentals to advanced techniques, this course combines theoretical knowledge with practical exercises. Perfect for students, engineers, and researchers looking to master efficient AI development.
- π Overview - Course overview
- π Lectures - Comprehensive materials
- π» Exercises - Hands-on coding practice
- βοΈ Setup - Environment configuration
- π€ Community - Connect with other learners
- ποΈ Resources - Detailed references
Introduction to Efficient AI | |
---|---|
π Slides | Introduction to the course concepts |
π― Learning Outcomes:
- How does the course work?
- Who is target audience of the course?
- What are the references for the course?
Language Model Architectures | |
---|---|
π Slides | Learn about LLM building blocks and architectures |
π₯ Video | Coming soon |
π» Exercise | Analyze LLM architectures |
π― Learning Outcomes: In this chapter, you will learn what are the building blocks, variations, and recent advancements on language models.
- Foundations of language models: tokens, embeddings,...
- Autoregressive language models: transformer, (flash, multi-head, paged) attention, KV cache,...
- State space language models: continuous, recursive, convolution,...
- Diffusion language models: discrete diffusion,...
- Advancements on language models: encoder/decoder, mixture-of-experts,...
Compression of Language Models | |
---|---|
π Slides | Learn about model compression techniques |
π₯ Video | Coming soon |
π» Exercise | Run LLM on CPU vs GPU |
π― Learning Outcomes: In this chapter, you will learn about the motivations and have an overview of model compression.
- Why do we need efficient models? Money, time, memory, Energy/CO2,...
- How to compress models? Quantization, pruning, distillation, compilation,..
- How do compression methdos help efficiency? memory reduction, latency reduction,...
Evaluation of Language Models | |
---|---|
π Slides | Learn how to evaluate LLM efficiency |
π₯ Video | Coming soon |
π» Exercise | Measure LLM efficiency |
π― Learning Outcomes: In this chapter, you will learn how to evaluate the different efficiency aspects of language models.
- Quality evaluation: perplexity, accuracy,...
- Memory evaluation: #Parameters/#Activations, disk/inference/training memory, scaling laws,...
- Compute evaluation: MAC, FLOP, OP, scaling laws...
- Real-world evaluation: latency, througput, money, energy,...
Quantization of Language Models | |
---|---|
π Slides | Learn about model quantization methods |
π₯ Video | Coming soon |
π» Exercise 1 | Benchmark LLM Quantization methods |
π» Exercise 2 | Benchmark LLM bit precision |
π» Exercise 3 | Use data during quantization |
π― Learning Outcomes: In this chapter, you will learn how to quantize models from basic to advanced quantization methods.
- Foundations of quantization: data types, quantization procedure, static/dynamic, linear/codebook, tensor/channel/group,...
- Advancements on quantization: post-training/quantization-aware training, outliers handling, iteratives methods, usage of data,...
- Overview of SOTA quantization: GPTQ, AWQ, HQQ, AQLM, Higgs, Quanto,...
Finetuning of Language Models | |
---|---|
π Slides | Learn how to finetune models to improve or recover performance |
π₯ Video | Coming soon |
π» Exercise | Finetune compressed models |
π― Learning Outcomes: In this chapter, you will learn how to finetune models to improve or recover performance.
- Foundations of finetuning: finetuning procedure,...
- Advancements on finetuning: finetuning of all parameters, new parameters, selected parameters, quantized parameters,...
- Overview of SOTA finetuning: LoRA, QLoRA, Perp, P-tuing, DiffPruning,...
The lecture content is based on multiple sources (incl. papers, books, and lectures). If you find it helpful, please β star the repository!
Topic | Description | Slides |
---|---|---|
Introduction | Introduction to efficient AI | slides |
Architectures for LLMs | Model design and optimization | slides |
Evaluation for LLMs | Performance metrics and analysis | slides |
Compression for LLMs | Model size reduction techniques | slides |
Quantization for LLMs | Precision optimization | slides |
Finetuning for LLMs | Model adaptation strategies | slides |
π‘ Tip: Access the most recent version of the lecture materials through this URL.
Located in exercises/
and solutions/
directories, our hands-on modules include:
Exercise | Description | Exercise Notebook | Solution Notebook | Difficulty | Hardware |
---|---|---|---|---|---|
Core Exercises | |||||
π Analyze LLM architectures | Study model design patterns and optimization techniques | notebook | solution | π’ | CPU |
π Measure LLM efficiency | Evaluate model performance and resource usage | notebook | solution | π’ | CPU |
βοΈ Run LLM on CPU vs GPU | Compare usage of CPU and GPU for LLM inference | notebook | solution | π‘ | CPU+GPU |
π’ Benchmark LLM Quantization methods | Analyze impact of different quantization methods | notebook | solution | π‘ | GPU |
Advanced Topics | |||||
π Benchmark LLM bit precision | Analyze impact of different bit precisions | notebook | solution | π΄ | GPU |
π Use data during quantization | Leverage calibration data for better quantization | notebook | solution | π΄ | GPU |
π― Finetune compressed models | Adapt quantized models for specific tasks | notebook | solution | π΄ | GPU |
You can easily setup your coding environment with the options below. Dependencies are specified in the pyproject.toml
. More specifically, you can complete the exercises with the pruna
package, and go further with the pruna_pro
. While pruna
enable productive exploration of efficient AI topics, pruna_pro
package allow to address more advanced topics.
bash setup_exercises.sh
# Install UV if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.cargo/env
# Setup the project
uv python install 3.10
uv sync
uv add pruna_pro==0.2.2.post1 --index-url https://prunaai.pythonanywhere.com/simple/
# Activate the environment
source .venv/bin/activate
-
Hugging Face Token:
- Set your Hugging Face access token as an environment variable so you can download models and datasets.
export HF_TOKEN=your_huggingface_token
You can find or create your token at https://huggingface.co/settings/tokens.
- Do not forget to login to hf and accept model terms if you want if you want to access to gated models.
hf auth login --token $HF_TOKEN --add-to-git-credential
- Loading models can take some space. We recommend to update your cache directory for the downloaded models to not fill disk:
export CACHE_PATH="<path_to_cache>" export TORCH_HOME="$CACHE_PATH" export HF_HOME="$CACHE_PATH" export HUGGINGFACE_HUB_CACHE="$CACHE_PATH" export HUGGINGFACE_ASSETS_CACHE="$CACHE_PATH" export TRANSFORMERS_CACHE="$CACHE_PATH"
- Set your Hugging Face access token as an environment variable so you can download models and datasets.
-
Pruna Token (optional): If you want to use advanced features from the
pruna_pro
package, set your Pruna token as an environment variable:export PRUNA_TOKEN=your_pruna_token
You can obtain a token by signing up at https://pruna.ai.
-
Google Colab Integration (optional): All notebooks include Google Colab buttons for free GPU access. Click the "Open in Colab" button on any notebook to get started.
- Free Tier: Tesla T4/K80/P100 GPUs, 12GB RAM, limited hours/day
- Colab Pro ($9.99/month): Priority GPU access, longer runtime, 32GB RAM
- Colab Pro+ ($49.99/month): A100 GPUs, maximum runtime, 52GB RAM
π‘ Tip: Use Runtime β Change runtime type β GPU for best performance
- Minimum: Modest GPU (1080Ti, 2080Ti)
- Ideal: High-end GPU (V100, A100)
- Note: Exercises are optimized for accessibility with 20+ selected small models to work on modest setup.
Note:
- Exercises have been tested with these models but might work with models which are not listed in this table.
- Gated models require authentication with Hugging Face token (HF_TOKEN).
- Estimated memory assumes FP16 precision. Actual memory usage may vary based on implementation and overhead.
- Memory can be further reduced using quantization techniques covered in the exercises.
Connect with us across platforms:
You can find the main resources in the Awesome AI efficiency repository. It includes complete reference including:
- Facts π
- Tools π οΈ
- News Articles π°
- Reports π
- Research Articles π
- Blogs π°
- Books π
- Lectures π
- People π§βπ»
- Organizations π
β Support the Project: If you find these resources valuable, please star this repository and the Awesome AI efficiency collection!