we are not bound to the time duration but vibes
- CUDA:
- Programming Massive Parallel Systems
- CUDA Core Compute Libraries (Thrust, CUB, libcudacxx)
- Multi-GPU programming, NCCL
- CUTLASS & CUTE
- Flash Attention (1&2)
- Distributed Data Parallelism
- Tensor Parallelism
- Pipeline Parallelism
- Context Parallelism
- Fully Sharded Data Parallelism
- DeepSpeed Zero (1, 2 and 3)
- Sequence Parallelism: Long Sequence Training from System Perspective
- Blockwise Parallel Transformer for Large Context Models
- Ring Attention with Blockwise Transformers for Near-Infinite Context Length
- Efficient Memory Management for Large Language Model Serving with PagedAttention
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
- PipeDream: Fast and Efficient Pipeline Parallel DNN Training
- Zero Bubble Pipeline Parallelism
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism