The "Large Scale Deep Learning Models" course focuses on the methodologies and techniques used to train large models on extensive datasets across various data domains, including images, text, and audio. The course provides in-depth coverage of self-supervised learning approaches, which have become crucial for leveraging vast amounts of unlabeled data. Topics include data preprocessing and augmentation for different modalities, architectural considerations for scaling deep learning models, and strategies for distributed and parallel training.
Instructors: Alexander Shabalin, Ildus Sadrtdinov
Classes: on Mondays offline in the classroom EH-4 in time slots 14:15 - 15:30 and 15:45 - 17:00
Telegram chat for questions and discussion: link
Practical assignments: all asssignments are given and checked in the corresponding Teams space. If you don't have access to Teams space, please write directly to one of the instructors or in the course chat.
Assessment Component 1: written examination, Duration: 60 min, Weight: 50 %
Assessment Component 2: programming assignments, Weight: 50 %
Completion: To pass this module, the examination of each module component must be passed with at least 45%.
| Date | Number | Topic | Materials |
|---|---|---|---|
| 08.09.25 | 01 | Introduction to the course. Large models, large datasets and self-supervised learning. What to do with a pretrained model? Linear probing, Fine-tuning, in-distribution (ID) and out-of-distribution (OOD) performance. CLIP model, Zero-shot and WiSE-FT (robust weights ensemble). | Fine-tuning distorts features, Comparing pre-training algorithms, CLIP, WiSE-FT, Do ImageNet Classifiers Generalize to ImageNet? |
| 15.09.25 | 02 | Classical pretext tasks for images: inpainting, colorization, jigsaw puzzles | Exemplar, Context Prediction, Inpainting, Jigsaw Puzzles, Colorization, Rotations, Damaged Jigsaw Puzzles, Task Ensemble |
| 22.09.25 | 03 | Modern architectures for images: ViT, DeiT, MLP Mixer, Swin, ConvNeXt, Neighborhood Attention Transformer (NAT). Efficient training & inference: Automatic Mixed Precision (AMP), Data-Parallel and Model-Parallel training |
Big Transfer, ViT, DeiT, MLP Mixer, Swin, ConvNeXt, NAT |
| 06.10.25 | 04 | Contrastive learning for images. Mutual information, SimCLR, MoCo, BYOL, SimSiam, DeepCluster, SwAV. Deriving contrastive loss | SimCLR, MoCo, BYOL, SimSiam, DeepCluster, SwAV |
| 13.10.25 | 05 | Self-supervised learning for ViT. Masked image modeling. DINO, BEiT, MAE, MaskFeat. Different approaches for improving contrastive learning. | DINO, BEiT, MAE, MaskFeat Dense CL, Supervised CL, DiLo, LooC |
| 01.12.25 | 12 | Introduction to audio processing. Text-to-speech (TTS): WaveNet, Tacotron 2, WaveGlow, HiFi-GAN. Automatic Speech Recognition (ASR): CTC Loss, Jasper, Whisper. Self-supervised learning for audio: CPC, Wav2Vec 2.0, HUBERT, Multi-format contrastive learning, BYOL-A, CLAP | WaveNet, Tacotron 2, WaveGlow, HiFi-GAN, CTC Loss, Jasper, Whisper, CPC, Wav2Vec 2.0, HuBERT, Multi-format CL, BYOL-A, CLAP |
| 01.12.25 | 13 | Mode connectivity and Linear mode connectivity (LMC). Ensembling: Deep Ensemble (DE), SSE, FGE, cSGLD, KFAC-Laplace, SWAG, SPRO, StarSSE. Model averaging: SWA, model soups. | LMC, LMC in transfer learning, DE, DE and loss landscape, DE and distribution shifts, SSE, FGE, cSGLD, KFAC-Laplace, SWAG, SPRO, DE Equivalent, StarSSE, SWA, model soups |
| Number | Release date | Deadline | Topic |
|---|---|---|---|
| 01 | 11.09.2025 | 28.09.2025 23:59 | Robust fine-tuning of CLIP |
| 02 | 01.10.2025 | 18.10.2025 23:59 | Pre-text tasks for images |
| 03 | 28.10.2025 | 16.11.2025 23:59 | Contrastive learning |
| bonus | 02.12.2025 | 09.12.2025 23:59 | Self-supervised learning for audio |