Cross-referenced against latest academic research (2023-2025).
[EXISTS]= already implemented,[ ]= not yet implemented.
- [EXISTS] DQN (Deep Q-Network)
- [EXISTS] Double DQN (DDQN)
- [EXISTS] Dueling DQN
- [EXISTS] Rainbow DQN — Combines 6 DQN improvements (prioritized replay, n-step, distributional, noisy nets, dueling, double)
- [EXISTS] C51 (Categorical DQN) — Distributional RL with categorical value distribution
- [EXISTS] QR-DQN (Quantile Regression DQN) — Distributional RL with quantile-based value estimation
- [EXISTS] REINFORCE (with baseline)
- [EXISTS] A2C (Advantage Actor-Critic)
- [EXISTS] PPO (Proximal Policy Optimization)
- [EXISTS] DDPG (Deep Deterministic Policy Gradient)
- [EXISTS] TD3 (Twin Delayed DDPG)
- [EXISTS] SAC (Soft Actor-Critic)
- [EXISTS] TRPO (Trust Region Policy Optimization) — Predecessor to PPO with guaranteed monotonic improvement
- EPO (Evolutionary Policy Optimization) — Hybrid evolutionary + policy gradient (2025)
- [EXISTS] IQL (Implicit Q-Learning)
- [EXISTS] CQL (Conservative Q-Learning)
- [EXISTS] Cal-QL (Calibrated Q-Learning) — Conservative but calibrated values for offline-to-online (NeurIPS 2023)
- [EXISTS] IDQL (Implicit Diffusion Q-Learning) — Diffusion policies + IQL (2023)
- [EXISTS] ReBRAC — Minimalist TD3+BC with design improvements (NeurIPS 2023)
- [EXISTS] EDAC (Ensemble-Diversified Actor Critic) — Diversified Q-ensemble for offline RL
- [EXISTS] TD3+BC — Simple offline RL baseline (TD3 + behavior cloning regularization)
- [EXISTS] AWR (Advantage Weighted Regression) — Simple weighted behavior cloning for offline RL
- [EXISTS] DPO (Direct Preference Optimization)
- [EXISTS] GRPO (Group Relative Policy Optimization)
- [EXISTS] Reward Model (Ensemble-based)
- [EXISTS] KTO (Kahneman-Tversky Optimization) — Binary labels only, prospect theory-based (ICML 2024)
- [EXISTS] SimPO (Simple Preference Optimization) — Reference-free, length-normalized reward (NeurIPS 2024)
- [EXISTS] ORPO (Odds Ratio Preference Optimization) — Monolithic SFT+alignment in one stage (EMNLP 2024)
- [EXISTS] IPO (Identity Preference Optimization) — Bounded DPO with explicit regularization
- [EXISTS] SPIN (Self-Play Fine-Tuning) — Self-play without additional human data (ICML 2024)
- [EXISTS] RLOO (REINFORCE Leave-One-Out) — Simple REINFORCE with leave-one-out baseline (ACL 2024)
- [EXISTS] ReMax — Greedy-decoding baseline for RLHF, 46% memory savings (ICML 2024)
- [EXISTS] RLVR (RL with Verifiable Rewards) — Binary correctness from symbolic verifiers (DeepSeek-R1, 2025)
- [EXISTS] MAPPO (Multi-Agent PPO) — PPO with centralized critic for cooperative MARL
- [EXISTS] QMIX — Monotonic value factorization for cooperative MARL
- [EXISTS] WQMIX (Weighted QMIX) — Relaxed monotonicity via importance weighting
- [EXISTS] MADDPG (Multi-Agent DDPG) — Centralized training, decentralized execution for mixed cooperative-competitive
- [EXISTS] QPLEX — Duplex dueling for complete IGM decomposition
- [EXISTS] DreamerV3 — Universal world model, learns latent dynamics, 150+ tasks (Nature 2025)
- [EXISTS] TD-MPC2 — Scalable MPC with learned latent dynamics (ICLR 2024)
- [EXISTS] MBPO (Model-Based Policy Optimization) — Short branched rollouts for sample efficiency
- [EXISTS] ICM (Intrinsic Curiosity Module)
- [EXISTS] RND (Random Network Distillation) — Exploration via prediction error on random network
- [EXISTS] Go-Explore — Archive-based exploration for hard-exploration tasks
- [EXISTS] Decision Transformer
- [EXISTS] Elastic Decision Transformer (EDT) — Adaptive history length for trajectory stitching (NeurIPS 2023)
- [EXISTS] Online Decision Transformer — DT fine-tuned with online interaction
- [EXISTS] Enhanced Reward Model (ensemble)
- [EXISTS] PRM (Probabilistic Roadmap)
- [EXISTS] Process Reward Model (PRM) — Step-level verification for reasoning (OpenAI, 2023)
- [EXISTS] Outcome Reward Model (ORM) — Final-answer-only reward evaluation
- [EXISTS] Generative Reward Model — LLM-as-judge reward signal (RLAIF)
- [EXISTS] MCTS (Monte Carlo Tree Search)
- [EXISTS] PRM Agent (A* planning)
- [EXISTS] AlphaZero-style Self-Play — Combines MCTS with neural evaluation
- [EXISTS] Multi-Head Attention (with RoPE)
- [EXISTS] Flash Attention
- [EXISTS] Cross Attention
- [EXISTS] Self Attention (with caching)
- [EXISTS] Grouped Query Attention (GQA)
- [EXISTS] Sparse Attention
- [EXISTS] Linear Attention
- [EXISTS] Efficient Attention
- [EXISTS] Sliding Window Attention
- [EXISTS] Ring Attention
- [EXISTS] Differential Attention
- [EXISTS] Latent Attention
- [EXISTS] Context Compression
- [EXISTS] Path Attention
- [EXISTS] Spatial Attention
- [EXISTS] Temporal Attention
- [EXISTS] Unified Attention
- [EXISTS] FlashAttention-3 — Hopper GPU asynchrony + FP8 (2024)
- [EXISTS] PagedAttention — OS-inspired virtual memory for KV cache (vLLM, 2023)
- [EXISTS] Multi-Head Latent Attention (MLA) — Low-rank KV compression, 93% cache reduction (DeepSeek-V2, 2024)
- [EXISTS] Neighborhood Attention (NATTEN) — Sliding-window self-attention for vision (CVPR 2023)
- [EXISTS] SwitchHead (MoE Attention) — Expert routing for attention heads (NeurIPS 2024)
- [EXISTS] Chunked Prefill
- [EXISTS] Mamba (Selective SSM / S6)
- [EXISTS] HGRN (Hierarchical GRN)
- [EXISTS] DeltaNet
- [EXISTS] RetNet
- [EXISTS] Linear RNN
- [EXISTS] Mamba-2 (SSD) — State Space Duality, 2-8x faster than Mamba-1 (ICML 2024)
- [EXISTS] S4 (Structured State Space) — Foundational SSM with HiPPO initialization
- [EXISTS] S4D (Diagonal State Spaces) — Simplified S4 with diagonal matrices
- [EXISTS] S5 (Simplified State Space) — MIMO SSM with parallel scans (ICLR 2023 Oral)
- [EXISTS] Liquid-S4 — Input-dependent state transitions (ICLR 2023)
- [EXISTS] Gated Delta Networks (GDN) — Combines Mamba2 gating + DeltaNet delta rule (ICLR 2025)
- [EXISTS] RWKV-6 (Finch) — RNN with matrix-valued states and dynamic recurrence (COLM 2024)
- [EXISTS] RWKV-7 (Goose) — Generalized delta rule with vector-valued gating (COLM 2025)
- [EXISTS] RWKV (basic WKV component)
- [EXISTS] Griffin — Gated linear recurrences + local attention hybrid (Google DeepMind, 2024)
- [EXISTS] Hyena — Subquadratic attention replacement via long convolutions (ICML 2023)
- [EXISTS] Based — Linear attention + sliding window, 24x throughput vs FlashAttention-2 (ICML 2024)
- [EXISTS] Jamba — Transformer-Mamba-MoE hybrid, 256K context (AI21, ICLR 2025)
- [EXISTS] StripedHyena — Attention-Hyena hybrid, 128K context (Together AI, 2023)
- [EXISTS] Zamba — Mamba backbone + shared attention block (Zyphra, 2024)
- [EXISTS] GoldFinch — RWKV-Transformer hybrid, 756-2550x KV cache compression (2024)
- [EXISTS] RecurrentGemma — Griffin-based open model (Google, 2024)
- [EXISTS] Hawk — Pure gated linear recurrence model (Google, 2024)
- [EXISTS] Sinusoidal PE
- [EXISTS] Learned PE
- [EXISTS] Rotary Embeddings (RoPE)
- [EXISTS] Relative Bias
- [EXISTS] ALiBi
- [EXISTS] Multiscale RoPE
- [EXISTS] YaRN
- [EXISTS] CoPE (Contextual Position Encoding)
- [EXISTS] NTK-Aware RoPE Scaling — Non-uniform frequency scaling for context extension
- [EXISTS] LongRoPE — Evolutionary search for 2M+ token context (Microsoft, 2024)
- [EXISTS] FIRE — Functional interpolation for relative positions (CMU/Google, 2023)
- [EXISTS] Resonance RoPE — Integer-wavelength snapping for better extrapolation (ACL 2024)
- [EXISTS] CLEX — Continuous length extrapolation (EMNLP 2024)
- [EXISTS] MoE Layer / Router / Experts / Gating
- [EXISTS] Enhanced MoE / MoE Transformer
- [EXISTS] Mixture-of-Depths (MoD) — Skip layers per-token via routing (Google DeepMind, 2024)
- [EXISTS] Mixture-of-Agents (MoA) — Multi-LLM layered collaboration (Together AI, 2024)
- [EXISTS] DeepSeek MoE — Shared + routed experts with fine-grained segmentation (2024)
- [EXISTS] Switch Transformer — Top-1 expert routing with load balancing
- [EXISTS] SwitchAll — Fully MoE attention + FFN (NeurIPS 2024)
- [EXISTS] Normalization layers (basic)
- [EXISTS] RMSNorm — Root mean square normalization (standard for modern LLMs)
- [EXISTS] DeepNorm — Scaling to 1000-layer transformers (Microsoft, 2022)
- [EXISTS] QK-Norm — Query-Key normalization for attention stability
- [EXISTS] HybridNorm — Pre-Norm attention + Post-Norm FFN (2025)
- [EXISTS] DyT (Dynamic Tanh) — Normalization-free transformers (CVPR 2025)
- [EXISTS] Custom activations (basic)
- [EXISTS] SwiGLU — Swish-gated linear unit (dominant in modern LLMs)
- [EXISTS] GeGLU — GELU-gated linear unit (used in Gemma)
- [EXISTS] ReGLU — ReLU-gated linear unit
- [EXISTS] Depthwise Separable Conv
- [EXISTS] SE Block
- [EXISTS] Drop Path
- [EXISTS] KV Cache (with prefix caching)
- [EXISTS] Speculative Decoding
- [EXISTS] Continuous Batching
- [EXISTS] Multi-Token Prediction
- [EXISTS] EAGLE-3 — Feature-based speculative decoding, 3-6.5x speedup (NeurIPS 2025)
- [EXISTS] Medusa — Multi-head parallel token prediction (ICML 2024)
- [EXISTS] Lookahead Decoding — Jacobi iteration for parallel n-gram generation (ICML 2024)
- [EXISTS] PagedAttention — Virtual memory for KV cache management (vLLM)
- [EXISTS] Quantized KV Cache — INT8/FP8 KV cache for memory reduction
- [EXISTS] ViT
- [EXISTS] Swin Transformer
- [EXISTS] HiViT
- [EXISTS] EfficientNet
- [EXISTS] Compact CNN
- [EXISTS] ResNet / VGG
- [EXISTS] DINOv2 — Self-supervised ViT with self-distillation (Meta, 2023)
- [EXISTS] SigLIP — Sigmoid loss CLIP, efficient vision-language pre-training (Google, 2023)
- [EXISTS] EVA-02 — MIM pre-trained ViT with RoPE + SwiGLU (BAAI, 2023)
- [EXISTS] InternVL — Open-source multimodal ViT family (CVPR 2024)
- [EXISTS] DETR
- [EXISTS] Faster R-CNN / Cascade R-CNN / Mask R-CNN / Keypoint R-CNN
- [EXISTS] RT-DETR — Real-time DETR beating YOLOs (CVPR 2024)
- [EXISTS] YOLO-World — Open-vocabulary YOLO with text input (CVPR 2024)
- [EXISTS] Grounding DINO — Open-set detection with text queries (ECCV 2024)
- [EXISTS] YOLOv10 — NMS-free end-to-end detection (2024)
- [EXISTS] SAM (Segment Anything Model) — Promptable segmentation foundation model (Meta, 2023)
- [EXISTS] SAM 2 — Video segmentation with streaming memory (Meta, 2024)
- [EXISTS] MedSAM — Universal medical image segmentation (Nature Comms, 2024)
- [EXISTS] NeRF / Fast NeRF / Mip-NeRF / NeRF++
- [EXISTS] Gaussian Splatting (with loss, covariance, renderer)
- [EXISTS] Zip-NeRF — Anti-aliased grid-based NeRF, 24x faster (ICCV 2023)
- [EXISTS] DreamGaussian — Single-image to 3D in 2 minutes (ICLR 2024)
- [EXISTS] SuGaR — Surface-aligned Gaussian Splatting for mesh extraction (CVPR 2024)
- [EXISTS] GaussianEditor — Text-guided 3D scene editing (CVPR 2024)
- [EXISTS] LRM (Large Reconstruction Model) — Single-image to NeRF in 5 seconds (ICLR 2024)
- [EXISTS] ProlificDreamer — VSD for high-fidelity text-to-3D (NeurIPS 2023)
- [EXISTS] Base Diffusion / Conditional Diffusion / Stable Diffusion / UNet
- [EXISTS] DiT (Diffusion Transformer) — Transformer replaces U-Net in diffusion (ICCV 2023)
- [EXISTS] MMDiT (Multimodal Diffusion Transformer) — Dual-stream text-image transformer (SD3, 2024)
- [EXISTS] Consistency Models — Single-step generation from diffusion ODE (ICML 2023)
- [EXISTS] Latent Consistency Models (LCM) — 2-4 step generation from SD (2023)
- [EXISTS] Rectified Flow — Straight-path ODE transport (ICLR 2023, powers SD3/FLUX)
- [EXISTS] Flow Matching — Simulation-free CNF training (ICLR 2023)
- [EXISTS] PixArt-alpha — Efficient DiT training at 10% cost (2023)
- [EXISTS] Base GAN / Conditional GAN / CycleGAN / WGAN
- [EXISTS] VAE (encoder/decoder)
- [EXISTS] CogVideoX — Expert transformer for text-to-video (ICLR 2025)
- [EXISTS] VideoPoet — Autoregressive multimodal video LLM (Google, ICML 2024)
- [EXISTS] VALL-E — Neural codec language model for zero-shot TTS (Microsoft, 2023)
- [EXISTS] Voicebox — Flow-matching speech generation (Meta, ICLR 2024)
- [EXISTS] SoundStorm — Non-autoregressive parallel audio generation (Google, 2023)
- [EXISTS] MusicGen — Controllable music generation (Meta, NeurIPS 2023)
- [EXISTS] NaturalSpeech 3 — Factorized codec + diffusion TTS (ICML 2024)
- [EXISTS] GPT / GPT-4o / LLaMA / Falcon / Bloom / Qwen / Edge LLM / Base LLM
- [EXISTS] T5 / Longformer
- [EXISTS] LSTM / GRU / Bidirectional RNN
- [EXISTS] Chain-of-Thought
- [EXISTS] Tree of Thoughts (ToT) — BFS/DFS search over reasoning paths (NeurIPS 2023)
- [EXISTS] Graph of Thoughts (GoT) — DAG-based reasoning with merging/refining (2023)
- [EXISTS] Self-Consistency (CoT-SC) — Majority vote over multiple reasoning chains (ICLR 2023)
- [EXISTS] ReAct — Interleaved reasoning + action loop (ICLR 2023)
- [EXISTS] RAG Module / Document Encoder / Retriever
- [EXISTS] Self-RAG — Self-reflective retrieval with reflection tokens (ICLR 2024)
- [EXISTS] CRAG (Corrective RAG) — Confidence-gated retrieval actions (ICML 2024)
- [EXISTS] GraphRAG — Knowledge graph + community-based retrieval (Microsoft, 2024)
- [EXISTS] RAPTOR — Recursive tree-structured summarization for retrieval (ICLR 2024)
- [EXISTS] Adaptive RAG — Complexity-based retrieval routing (2024)
- [EXISTS] SFT (Supervised Fine-Tuning)
- [EXISTS] LoRA — Low-rank adaptation (ICLR 2022, foundational)
- [EXISTS] QLoRA — 4-bit quantized LoRA (NeurIPS 2023)
- [EXISTS] DoRA — Weight-decomposed LoRA (ICML 2024 Oral)
- [EXISTS] LoRA+ — Asymmetric learning rates (ICML 2024)
- [EXISTS] GaLore — Gradient low-rank projection, 7B on single 24GB GPU (ICML 2024)
- [EXISTS] LISA — Layerwise importance sampled AdamW (NeurIPS 2024)
- [EXISTS] AdaLoRA — Adaptive rank allocation (ICLR 2023)
- [EXISTS] rsLoRA — Rank-stabilized scaling (2023)
- [EXISTS] GPTQ — Second-order weight quantization (ICLR 2023)
- [EXISTS] AWQ — Activation-aware weight quantization (MLSys 2024 Best Paper)
- [EXISTS] QuIP# — 2-bit quantization with E8 lattice codebooks (ICML 2024)
- [EXISTS] SqueezeLLM — Dense-and-sparse hybrid quantization (ICML 2024)
- [EXISTS] AQLM — Additive multi-codebook quantization (ICML 2024)
- [EXISTS] SparseGPT — One-shot 50-60% pruning (ICML 2023)
- [EXISTS] Wanda — Prune by weight x activation magnitude (ICLR 2024)
- [EXISTS] SliceGPT — Delete rows/columns via orthogonal transforms (ICLR 2024)
- [EXISTS] ShortGPT — Layer removal by block influence score (2024)
- [EXISTS] Knowledge Distiller (soft/hard targets)
- [EXISTS] Rationale-based KD — Distill reasoning chains, not just outputs (2023)
- [EXISTS] Minitron — Pruning + KD for compact LLMs (NVIDIA, NeurIPS 2024)
- [EXISTS] Grammar-Constrained Decoding — Formal grammar constraints during generation
- [EXISTS] JSON Schema Decoder — FSM-based structured output (Outlines-style)
- [EXISTS] Matryoshka Representation Learning — Nested multi-granularity embeddings (NeurIPS 2022)
- [EXISTS] BGE-M3 — Dense + sparse + ColBERT in one model (BAAI, 2024)
- [EXISTS] Byte Latent Transformer (BLT) — Tokenizer-free, entropy-based dynamic patching (Meta, 2024)
- [EXISTS] MambaByte — Byte-level Mamba without tokenization (COLM 2024)
- [EXISTS] Adam / AdamW (used in agents)
- [EXISTS] Lion — Sign-based momentum, halves optimizer memory (NeurIPS 2023)
- [EXISTS] Sophia — Second-order with diagonal Hessian, 2x speedup (ICLR 2024)
- [EXISTS] Prodigy — Learning-rate-free optimizer (ICML 2024)
- [EXISTS] Schedule-Free AdamW — Eliminates LR schedules entirely (2024)
- [EXISTS] SOAP — Adam + Shampoo preconditioner (2024)
- [EXISTS] Muon — Momentum orthogonalized by Newton-Schulz (2024-2025)
- [EXISTS] Cosine Annealing (in PPO)
- [EXISTS] WSD (Warmup-Stable-Decay) — No fixed compute budget needed (DeepSeek-V3, 2024)
- [EXISTS] Cosine with Warm Restarts (SGDR) — Periodic LR resets
- [EXISTS] FP8 / Gradient Scaling / Config
- [EXISTS] MXFP8 (Microscaling FP8) — Block-level scaling for better FP8 (OCP, 2024)
- [EXISTS] FP4 / MXFP4 Training — 4-bit training with microscaling (2025)
- [EXISTS] Distributed training support
- [EXISTS] FSDP2 — Next-gen PyTorch fully sharded data parallelism
- [EXISTS] ZeRO++ — 4x communication reduction (Microsoft, 2023)
- [EXISTS] Context Parallelism — Sequence-length parallelism for long context
- [EXISTS] Cross-entropy / KL / Distillation losses
- [EXISTS] InfoNCE — Foundational contrastive loss
- [EXISTS] SigLIP Loss — Sigmoid contrastive (no global softmax)
- [EXISTS] VICReg — Variance-invariance-covariance regularization
- [EXISTS] Selective Activation Checkpointing — Per-operation save/recompute (PyTorch 2.4+)
- [EXISTS] Activation Offloading — CPU offload during forward, prefetch for backward
- [EXISTS] DINOv2 — Self-distillation ViT (Meta, 2023)
- [EXISTS] I-JEPA — Joint-embedding prediction in representation space (Meta, CVPR 2023)
- [EXISTS] V-JEPA 2 — Video world model from 1M+ hours (Meta, 2025)
- [EXISTS] data2vec 2.0 — Efficient multimodal SSL (Meta, ICML 2023)
- [EXISTS] MAE (Masked Autoencoder) — Masked image/video/audio modeling
- [EXISTS] Barlow Twins — Self-supervised via redundancy reduction
- [EXISTS] VICReg — Non-contrastive SSL with variance-covariance regularization
- [EXISTS] LLaVA-RLHF / PaLM-E / HiViLT / Enhanced Transformer
- [EXISTS] NVLM (base + processor)
- [EXISTS] LLaVA-NeXT / LLaVA-OneVision — Open-source multimodal LLM family (2024)
- [EXISTS] Qwen2-VL — Dynamic resolution + M-RoPE (Alibaba, 2024)
- [EXISTS] Molmo — Fully open VLM (AI2, 2024)
- [EXISTS] Phi-3-Vision — Lightweight multimodal with 128K context (Microsoft, 2024)
- [EXISTS] BiomedCLIP — Biomedical vision-language (Microsoft, 2023)
- [EXISTS] Base GNN / Message Passing / Hierarchical / Attention-based
- [EXISTS] GPS (Graph Transformer) — Modular graph transformer framework (NeurIPS 2022)
- [EXISTS] Exphormer — Sparse graph transformer with expander graphs (ICML 2023)
- [EXISTS] Graph Attention Network v2 (GATv2) — Dynamic attention for graphs
- [EXISTS] GraphSAGE — Inductive representation learning on large graphs
- [EXISTS] DreamerV3 — Latent world model, 150+ tasks (Nature, 2025)
- [EXISTS] I-JEPA — Image joint-embedding predictive architecture (Meta, 2023)
- [EXISTS] V-JEPA 2 — Video world model for robot planning (Meta, 2025)
- [EXISTS] Genie / Genie 2 — Interactive environment generation from video (Google, 2024)
- [EXISTS] EVCL — Elastic variational continual learning (ICML 2024)
- [EXISTS] Self-Synthesized Rehearsal — LLM self-generates replay data (ACL 2024)
- [EXISTS] Prompt-Based CL (L2P, DualPrompt, CODA-Prompt) — Learnable prompts per task
- [EXISTS] EWC (Elastic Weight Consolidation) — Classic regularization approach
- [EXISTS] Perception / Motion Prediction / Behavior / Planning / Decision / Sensor Fusion
- [EXISTS] UniAD — Unified end-to-end driving (CVPR 2023 Best Paper)
- [EXISTS] VAD — Vectorized autonomous driving (ICCV 2023)
- [EXISTS] DriveTransformer — Unified transformer for scalable E2E driving (2025)
- [EXISTS] GAIL (Generative Adversarial Imitation Learning) — Adversarial policy matching
- [EXISTS] DAgger — Dataset aggregation with expert queries
- [EXISTS] MEGA-DAgger — Multi-expert DAgger for imperfect demos (2024)
- [EXISTS] AIRL — Adversarial inverse RL with transferable rewards
- [EXISTS] Test-Time Training (TTT) Layers — Hidden state as ML model updated at inference (2024)
- [EXISTS] Compute-Optimal Scaling — Adaptive per-prompt test-time compute allocation (ICLR 2025)
- [EXISTS] Best-of-N with PRM — Sample multiple, verify with process reward model
- [EXISTS] Flow Matching — Simulation-free CNF training (ICLR 2023)
- [EXISTS] Rectified Flow — Straight-path ODE (ICLR 2023)
- [EXISTS] Consistency Training (iCT) — Improved standalone consistency training (2023)
- [EXISTS] Easy Consistency Tuning (ECT) — Bootstrap from diffusion models (ICLR 2025)
- LoRA / QLoRA / DoRA (PEFT methods)
- Mamba-2 (SSD)
- SwiGLU / GeGLU activations
- RMSNorm
- DiT (Diffusion Transformer)
- SAM (Segment Anything)
- Flow Matching / Rectified Flow
- DreamerV3 (world model)
- KTO / SimPO / ORPO (alignment)
- Lion / Sophia optimizers
- RWKV-6/7
- Griffin / Jamba hybrid architectures
- Self-RAG / GraphRAG
- GPTQ / AWQ quantization
- SparseGPT / Wanda pruning
- Tree of Thoughts / ReAct
- MAPPO / QMIX (multi-agent RL)
- Consistency Models
- PagedAttention / EAGLE speculative decoding
- GPS graph transformer
- V-JEPA / I-JEPA
- MLA (Multi-Head Latent Attention)
- GaLore / LISA (memory-efficient training)
- Byte Latent Transformer
- Mixture-of-Depths
- Gated Delta Networks
- Cal-QL / IDQL (offline RL)
- RT-DETR / YOLO-World
- DreamGaussian / SuGaR
- VALL-E / Voicebox (speech)