Skip to content

PrunaAI/ai-efficiency-courses

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

86 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Efficient AI - Courses

A comprehensive learning path for building, compressing, evaluating, and deploying efficient AI models. From fundamentals to advanced techniques, this course combines theoretical knowledge with practical exercises. Perfect for students, engineers, and researchers looking to master efficient AI development.

Table of Contents

Overview

0. Introduction to Efficient AI

Introduction to Efficient AI
πŸ“Š Slides Introduction to the course concepts

🎯 Learning Outcomes:

  • How does the course work?
  • Who is target audience of the course?
  • What are the references for the course?

1. Language Model Architectures

Language Model Architectures
πŸ“Š Slides Learn about LLM building blocks and architectures
πŸŽ₯ Video Coming soon
πŸ’» Exercise Analyze LLM architectures

🎯 Learning Outcomes: In this chapter, you will learn what are the building blocks, variations, and recent advancements on language models.

  • Foundations of language models: tokens, embeddings,...
  • Autoregressive language models: transformer, (flash, multi-head, paged) attention, KV cache,...
  • State space language models: continuous, recursive, convolution,...
  • Diffusion language models: discrete diffusion,...
  • Advancements on language models: encoder/decoder, mixture-of-experts,...

2. Compression of Language Models

Compression of Language Models
πŸ“Š Slides Learn about model compression techniques
πŸŽ₯ Video Coming soon
πŸ’» Exercise Run LLM on CPU vs GPU

🎯 Learning Outcomes: In this chapter, you will learn about the motivations and have an overview of model compression.

  • Why do we need efficient models? Money, time, memory, Energy/CO2,...
  • How to compress models? Quantization, pruning, distillation, compilation,..
  • How do compression methdos help efficiency? memory reduction, latency reduction,...

3. Evaluation of Language Models

Evaluation of Language Models
πŸ“Š Slides Learn how to evaluate LLM efficiency
πŸŽ₯ Video Coming soon
πŸ’» Exercise Measure LLM efficiency

🎯 Learning Outcomes: In this chapter, you will learn how to evaluate the different efficiency aspects of language models.

  • Quality evaluation: perplexity, accuracy,...
  • Memory evaluation: #Parameters/#Activations, disk/inference/training memory, scaling laws,...
  • Compute evaluation: MAC, FLOP, OP, scaling laws...
  • Real-world evaluation: latency, througput, money, energy,...

4. Quantization of Language Models

Quantization of Language Models
πŸ“Š Slides Learn about model quantization methods
πŸŽ₯ Video Coming soon
πŸ’» Exercise 1 Benchmark LLM Quantization methods
πŸ’» Exercise 2 Benchmark LLM bit precision
πŸ’» Exercise 3 Use data during quantization

🎯 Learning Outcomes: In this chapter, you will learn how to quantize models from basic to advanced quantization methods.

  • Foundations of quantization: data types, quantization procedure, static/dynamic, linear/codebook, tensor/channel/group,...
  • Advancements on quantization: post-training/quantization-aware training, outliers handling, iteratives methods, usage of data,...
  • Overview of SOTA quantization: GPTQ, AWQ, HQQ, AQLM, Higgs, Quanto,...

5. Finetuning of Language Models

Finetuning of Language Models
πŸ“Š Slides Learn how to finetune models to improve or recover performance
πŸŽ₯ Video Coming soon
πŸ’» Exercise Finetune compressed models

🎯 Learning Outcomes: In this chapter, you will learn how to finetune models to improve or recover performance.

  • Foundations of finetuning: finetuning procedure,...
  • Advancements on finetuning: finetuning of all parameters, new parameters, selected parameters, quantized parameters,...
  • Overview of SOTA finetuning: LoRA, QLoRA, Perp, P-tuing, DiffPruning,...

Lectures

The lecture content is based on multiple sources (incl. papers, books, and lectures). If you find it helpful, please ⭐ star the repository!

Topic Description Slides
Introduction Introduction to efficient AI slides
Architectures for LLMs Model design and optimization slides
Evaluation for LLMs Performance metrics and analysis slides
Compression for LLMs Model size reduction techniques slides
Quantization for LLMs Precision optimization slides
Finetuning for LLMs Model adaptation strategies slides

πŸ’‘ Tip: Access the most recent version of the lecture materials through this URL.

Exercises

Located in exercises/ and solutions/ directories, our hands-on modules include:

Exercise Description Exercise Notebook Solution Notebook Difficulty Hardware
Core Exercises
πŸ” Analyze LLM architectures Study model design patterns and optimization techniques notebook solution 🟒 CPU
πŸ“Š Measure LLM efficiency Evaluate model performance and resource usage notebook solution 🟒 CPU
βš–οΈ Run LLM on CPU vs GPU Compare usage of CPU and GPU for LLM inference notebook solution 🟑 CPU+GPU
πŸ”’ Benchmark LLM Quantization methods Analyze impact of different quantization methods notebook solution 🟑 GPU
Advanced Topics
πŸš€ Benchmark LLM bit precision Analyze impact of different bit precisions notebook solution πŸ”΄ GPU
πŸ“ˆ Use data during quantization Leverage calibration data for better quantization notebook solution πŸ”΄ GPU
🎯 Finetune compressed models Adapt quantized models for specific tasks notebook solution πŸ”΄ GPU

Setup

You can easily setup your coding environment with the options below. Dependencies are specified in the pyproject.toml. More specifically, you can complete the exercises with the pruna package, and go further with the pruna_pro. While pruna enable productive exploration of efficient AI topics, pruna_pro package allow to address more advanced topics.

Option 1: Automated Setup (Recommended)

bash setup_exercises.sh

Option 2: Manual Setup with UV

# Install UV if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.cargo/env

# Setup the project
uv python install 3.10
uv sync
uv add pruna_pro==0.2.2.post1 --index-url https://prunaai.pythonanywhere.com/simple/

# Activate the environment
source .venv/bin/activate

Configuration

  • Hugging Face Token:

    • Set your Hugging Face access token as an environment variable so you can download models and datasets.
      export HF_TOKEN=your_huggingface_token

    You can find or create your token at https://huggingface.co/settings/tokens.

    • Do not forget to login to hf and accept model terms if you want if you want to access to gated models.
      hf auth login --token $HF_TOKEN --add-to-git-credential
    • Loading models can take some space. We recommend to update your cache directory for the downloaded models to not fill disk:
      export CACHE_PATH="<path_to_cache>"
      export TORCH_HOME="$CACHE_PATH"
      export HF_HOME="$CACHE_PATH" 
      export HUGGINGFACE_HUB_CACHE="$CACHE_PATH"
      export HUGGINGFACE_ASSETS_CACHE="$CACHE_PATH"
      export TRANSFORMERS_CACHE="$CACHE_PATH"
  • Pruna Token (optional): If you want to use advanced features from the pruna_pro package, set your Pruna token as an environment variable:

    export PRUNA_TOKEN=your_pruna_token

    You can obtain a token by signing up at https://pruna.ai.

  • Google Colab Integration (optional): All notebooks include Google Colab buttons for free GPU access. Click the "Open in Colab" button on any notebook to get started.

    • Free Tier: Tesla T4/K80/P100 GPUs, 12GB RAM, limited hours/day
    • Colab Pro ($9.99/month): Priority GPU access, longer runtime, 32GB RAM
    • Colab Pro+ ($49.99/month): A100 GPUs, maximum runtime, 52GB RAM

    πŸ’‘ Tip: Use Runtime β†’ Change runtime type β†’ GPU for best performance

Hardware Requirements

  • Minimum: Modest GPU (1080Ti, 2080Ti)
  • Ideal: High-end GPU (V100, A100)
  • Note: Exercises are optimized for accessibility with 20+ selected small models to work on modest setup.

Supported Models

Model Name Parameters Est. Memory Access
facebook/opt-125m 125M 250MB Public
facebook/opt-350m 350M 700MB Public
facebook/opt-1.3b 1.3B 2.6GB Public
facebook/opt-2.7b 2.7B 5.4GB Public
meta-llama/Llama-3.2-1B 1B 2GB Gated
meta-llama/Llama-3.2-1B-Instruct 1B 2GB Gated
meta-llama/Llama-3.2-3B-Instruct 3B 6GB Gated
google/gemma-3-1b-it 1B 2GB Gated
google/gemma-3-4b-it 4B 8GB Gated
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B 1.5B 3GB Public
microsoft/Phi-4-mini-instruct 350M 700MB Public
HuggingFaceTB/SmolLM-135M 135M 270MB Public
HuggingFaceTB/SmolLM-135M-instruct 135M 270MB Public
HuggingFaceTB/SmolLM-360M 360M 720MB Public
HuggingFaceTB/SmolLM-360M-Instruct 360M 720MB Public
HuggingFaceTB/SmolLM-1.7B 1.7B 3.4GB Public
HuggingFaceTB/SmolLM-1.7B-Instruct 1.7B 3.4GB Public
HuggingFaceTB/SmolLM2-135M 135M 270MB Public
HuggingFaceTB/SmolLM2-135M-Instruct 135M 270MB Public
HuggingFaceTB/SmolLM2-360M 360M 720MB Public
HuggingFaceTB/SmolLM2-360M-Instruct 360M 720MB Public
HuggingFaceTB/SmolLM2-1.7B 1.7B 3.4GB Public
HuggingFaceTB/SmolLM2-1.7B-Instruct 1.7B 3.4GB Public
PleIAs/Pleias-350m-Preview 350M 700MB Public
PleIAs/Pleias-Pico 350M 700MB Public
PleIAs/Pleias-1.2b-Preview 1.2B 2.4GB Public
PleIAs/Pleias-Nano 1.2B 2.4GB Public
PleIAs/Pleias-3b-Preview 3B 6GB Public

Note:

  • Exercises have been tested with these models but might work with models which are not listed in this table.
  • Gated models require authentication with Hugging Face token (HF_TOKEN).
  • Estimated memory assumes FP16 precision. Actual memory usage may vary based on implementation and overhead.
  • Memory can be further reduced using quantization techniques covered in the exercises.

Community

Connect with us across platforms:

Website X/Twitter Dev.to Reddit Discord Hugging Face Replicate

Resources

You can find the main resources in the Awesome AI efficiency repository. It includes complete reference including:

  • Facts πŸ“Š
  • Tools πŸ› οΈ
  • News Articles πŸ“°
  • Reports πŸ“ˆ
  • Research Articles πŸ“„
  • Blogs πŸ“°
  • Books πŸ“š
  • Lectures πŸŽ“
  • People πŸ§‘β€πŸ’»
  • Organizations 🌍

⭐ Support the Project: If you find these resources valuable, please star this repository and the Awesome AI efficiency collection!

About

Courses on building, compressing, evaluating, and deploying efficient AI models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages