This repository reimplements classic and cutting-edge (also interesting) Large Language Models (LLMs) and multimodal models from scratch, with hands-on tutorials and unit tests for each model. The goal is to help beginners understand key concepts of these models while learning by doing.
- ViT (Vision Transformer) - Applies the Transformer architecture to image patches for vision tasks.
- CLIP (Contrastive Language-Image Pre-training) - Learns joint image-text representations using contrastive pretraining.
- Stable Diffusion - Generates high-quality images from text prompts using Latent Diffusion Models (LDMs).
- Qwen3MoE - Selectively routes each input to a subset of expert networks for faster, more efficient inference.
- Z-Image -Generates images based on the Scalable Single-Stream Diffusion Transformer (S3-DiT), which processes text and image tokens together in one transformer, avoiding cross-attention.
- Wan - An open video generative model built on diffusion transformers with spatio-temporal VAE.
- DeepSeek-OCR - A vision‑based OCR model that compresses high‑resolution pages into compact vision tokens and decodes them to recover text with high precision, enabling efficient long‑context document understanding.
- ...
Step-by-step tutorials and unit tests for each model are coming soon.