This repo contains code related to my learnings in GenAI. I've following these tasks and I will be updating each task in a python file
Code. Plot. Break. Repeat.
- Build a byte-pair encoder (BPE) and train your own subword vocab
- Write a token visualizer to map words/chunks → token IDs
- Compare one-hot vs learned embeddings, and
- Plot cosine distances between token vectors
- Implement & compare:
- Sinusoidal
- Learned
- RoPE
- ALiBi
- Animate a toy sequence being position-encoded in 3D
- Ablate positions — watch attention collapse
- Hand-wire dot-product attention for one token
- Scale to multi-head attention, plot per-head weight heatmaps
- Mask out future tokens, verify causal property
- Stack your Attention + LayerNorm + Residual → single-block transformer
- Generalize to n-block “mini-former” on toy data
- Dissect Q, K, V — swap them, break them, see what explodes
- Build a sampler dashboard — interactively tune temp/k/p
- Plot entropy vs output diversity
- Set temp = 0 (argmax) — watch repetition set in
- Record & reuse KV states; measure speedup vs no-cache
- Build a cache hit/miss visualizer for token streams
- Profile cache memory cost for long vs short sequences
- Implement sliding-window attention; measure loss on long docs
- Benchmark memory-efficient attention (recompute, flash)
- Plot perplexity vs context length, find the context collapse point
- Code a 2-expert router layer; route tokens dynamically
- Plot expert utilization histograms
- Simulate sparse vs dense routing, measure FLOP savings
- Convert your mini-former to grouped query layout
- Measure speed vs vanilla multi-head on large batch
- Ablate number of groups, plot latency
- Hand-implement:
- LayerNorm
- RMSNorm
- SwiGLU
- GELU
- Ablate each — observe train/test loss impact
- Plot activation distributions layer-wise
- Train:
- Masked LM
- Causal LM
- Prefix LM
- Plot loss curves — compare who learns “English” faster
- Generate samples from each — note quirks
- Fine-tune on a small custom dataset
- Instruction-tune by prepending tasks (e.g., “Summarize: …”)
- RLHF:
- Hack a reward model
- Run PPO for 10 steps
- Plot reward curve
- Train tiny / small / medium models
- Plot loss vs size
- Benchmark time, VRAM, throughput
- Extrapolate scaling curve — how “dumb” can you go?
- Implement PTQ (Post-Training Quantization)
- Implement QAT (Quantization-Aware Training)
- Export to GGUF / AWQ
- Plot accuracy drop vs compression ratio
- Port a model from HuggingFace → Deepspeed → vLLM → ExLlama
- Profile throughput, VRAM, latency across all three
- Generate toy data, add noise, dedupe, create eval splits
- Visualize model learning curves on real vs synthetic data