Skip to content

Commit 4256c75

Browse files
committed
mini diffusion WIP
1 parent 47e8166 commit 4256c75

File tree

175 files changed

+32503
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

175 files changed

+32503
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,4 @@ conv-relu-animation/node_modules/
55
conv2d-animation/node_modules/
66
node_modules/
77
.history/*
8+
mini-diffusion/target/

blog/content/_index.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
---
2+
title: "Welcome"
3+
---
4+
5+
# ML Animations Blog
6+
7+
Interactive explanations of machine learning concepts. Each article pairs with a [visualization](https://danielsobrado.github.io/ml-animations/).
8+
9+
## Recent Posts
10+
11+
Browse by category:
12+
13+
### Transformers
14+
- [Attention Mechanism](/posts/attention-mechanism-part1/)
15+
- [Self-Attention](/posts/self-attention/)
16+
- [Positional Encoding](/posts/positional-encoding/)
17+
- [Transformer Architecture](/posts/transformer-architecture/)
18+
- [BERT](/posts/bert/)
19+
20+
### NLP Fundamentals
21+
- [Word2Vec](/posts/word2vec/)
22+
- [GloVe](/posts/glove/)
23+
- [FastText](/posts/fasttext/)
24+
- [Embeddings](/posts/embeddings/)
25+
- [Tokenization](/posts/tokenization/)
26+
- [Bag of Words](/posts/bag-of-words/)
27+
28+
### Neural Networks
29+
- [ReLU](/posts/relu/)
30+
- [Leaky ReLU](/posts/leaky-relu/)
31+
- [Softmax](/posts/softmax/)
32+
- [Layer Normalization](/posts/layer-normalization/)
33+
- [LSTM](/posts/lstm/)
34+
- [Conv2D](/posts/conv2d/)
35+
- [Conv + ReLU](/posts/conv-relu/)
36+
37+
### Advanced Models
38+
- [Fine-Tuning](/posts/fine-tuning/)
39+
- [RAG](/posts/rag/)
40+
- [VAE](/posts/vae/)
41+
- [Multimodal LLM](/posts/multimodal-llm/)
42+
43+
### Math Fundamentals
44+
- [Gradient Descent](/posts/gradient-descent/)
45+
- [Linear Regression](/posts/linear-regression/)
46+
- [Matrix Multiplication](/posts/matrix-multiplication/)
47+
- [Eigenvalues](/posts/eigenvalue/)
48+
- [SVD](/posts/svd/)
49+
- [QR Decomposition](/posts/qr-decomposition/)
50+
51+
### Probability & Statistics
52+
- [Probability Distributions](/posts/probability-distributions/)
53+
- [Conditional Probability](/posts/conditional-probability/)
54+
- [Expected Value & Variance](/posts/expected-value-variance/)
55+
- [Entropy](/posts/entropy/)
56+
- [Cross-Entropy](/posts/cross-entropy/)
57+
- [Cosine Similarity](/posts/cosine-similarity/)
58+
- [Spearman Correlation](/posts/spearman-correlation/)
59+
60+
### Reinforcement Learning
61+
- [RL Foundations](/posts/rl-foundations/)
62+
- [Q-Learning](/posts/q-learning/)
63+
- [Exploration](/posts/rl-exploration/)
64+
- [Markov Chains](/posts/markov-chains/)
65+
66+
### Algorithms
67+
- [Bloom Filter](/posts/bloom-filter/)
68+
- [PageRank](/posts/pagerank/)
69+
70+
---
71+
72+
[View all animations →](https://danielsobrado.github.io/ml-animations/)

blog/content/about.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
---
2+
title: "About ML Animations"
3+
---
4+
5+
This blog accompanies the [ML Animations](https://danielsobrado.github.io/ml-animations/) project - a collection of interactive visualizations explaining machine learning concepts.
6+
7+
## What you'll find here
8+
9+
Each article explains a concept from the animations in more depth:
10+
11+
- **Transformers & Attention**: How modern language models work
12+
- **NLP Fundamentals**: Word2Vec, embeddings, tokenization
13+
- **Neural Networks**: Activations, normalization, architectures
14+
- **Math Foundations**: Linear algebra, probability, optimization
15+
- **Reinforcement Learning**: Q-learning, exploration, MDPs
16+
- **Algorithms**: PageRank, Bloom filters
17+
18+
## Why visualizations?
19+
20+
ML concepts click better when you see them. A picture of gradient descent navigating a loss surface beats equations. Watching attention weights form makes transformers less magical.
21+
22+
The animations are interactive - play with parameters, see what changes.
23+
24+
## About the writing
25+
26+
These articles try to explain things like a colleague would over coffee. Not academic papers. Occasional shortcuts and simplifications where they help understanding.
27+
28+
If something's unclear or wrong, open an issue.
29+
30+
## Links
31+
32+
- [Interactive Animations](https://danielsobrado.github.io/ml-animations/)
33+
- [GitHub Repository](https://github.com/danielsobrado/ml-animations)
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
---
2+
title: "What is Attention? finally understood it"
3+
date: 2024-11-28
4+
draft: false
5+
tags: ["transformers", "attention", "nlp", "deep-learning"]
6+
categories: ["Machine Learning"]
7+
series: ["Understanding Attention"]
8+
---
9+
10+
So you keep hearing about attention mechanism everywhere. Transformers this, attention that. I spent weeks trying to understand it from papers and tutorials. Most explanations made it way more complicated than needed.
11+
12+
Let me try to explain how I finally got it.
13+
14+
## The database analogy that clicked for me
15+
16+
Think of attention like a fuzzy database lookup. Not a perfect match, but weighted combinations.
17+
18+
You have three things:
19+
- Query (Q) - what you're searching for
20+
- Key (K) - labels or titles of items
21+
- Value (V) - the actual content
22+
23+
Unlike normal database that returns exact match, attention returns weighted combination of ALL values. The weights depend on how well query matches each key.
24+
25+
![Attention Mechanism Interactive Demo](https://danielsobrado.github.io/ml-animations/animation/attention-mechanism)
26+
27+
Check out the interactive visualization I built: [Attention Mechanism Animation](https://danielsobrado.github.io/ml-animations/animation/attention-mechanism)
28+
29+
## Library search example
30+
31+
ok so imagine walking into library looking for books about "machine learning"
32+
33+
Your query is "machine learning"
34+
35+
The keys are book titles:
36+
- Neural Networks
37+
- Python Basics
38+
- Deep Learning
39+
- Cooking Recipes
40+
- AI Fundamentals
41+
- Romance Novels
42+
43+
Values are the actual book contents.
44+
45+
Now attention doesn't just grab one book. It looks at ALL books and weights them by relevance:
46+
- Deep Learning: high weight (very relevant)
47+
- Neural Networks: high weight
48+
- AI Fundamentals: medium-high weight
49+
- Python Basics: some weight (related to ML coding)
50+
- Cooking Recipes: basically zero
51+
- Romance Novels: zero
52+
53+
Then returns weighted mix of all contents. The relevant books contribute more.
54+
55+
## Why this matters
56+
57+
Before attention, models used RNNs. Problem was information had to flow sequentially. By time you reach end of long sentence, beginning is kinda forgotten.
58+
59+
With attention? Direct access to any position. No forgetting. No distance limit.
60+
61+
also, fully parallelizable which is huge for training speed
62+
63+
## The math (simplified)
64+
65+
```
66+
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
67+
```
68+
69+
Breaking it down:
70+
1. $QK^T$ - dot product gives similarity scores
71+
2. divide by $\sqrt{d_k}$ - scaling factor, prevents softmax from getting too peaky
72+
3. softmax - converts to probabilities (weights sum to 1)
73+
4. multiply by V - weighted combination of values
74+
75+
The scaling by $\sqrt{d_k}$ is important. Without it, dot products get large for high dimensions, softmax becomes too confident on single item.
76+
77+
## What I got wrong initially
78+
79+
thought Q, K, V were separate inputs. They're not always. In self-attention, they all come from same input, just projected differently with learned weights.
80+
81+
also thought attention was expensive. it is O(n²) for sequence length n. But the parallelization makes it faster than RNNs in practice for reasonable lengths.
82+
83+
## Next up
84+
85+
In part 2 gonna cover:
86+
- scaled dot-product attention in detail
87+
- multi-head attention (why multiple heads?)
88+
- self-attention vs cross-attention
89+
90+
The visualization tool shows all of this interactively. Play with it: [https://danielsobrado.github.io/ml-animations/animation/attention-mechanism](https://danielsobrado.github.io/ml-animations/animation/attention-mechanism)
91+
92+
---
93+
94+
*Part of the [Understanding Attention](/series/understanding-attention/) series*

blog/content/posts/bag-of-words.md

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
---
2+
title: "Bag of Words - the simplest text representation"
3+
date: 2024-11-22
4+
draft: false
5+
tags: ["bag-of-words", "bow", "nlp", "text-representation", "tfidf"]
6+
categories: ["NLP Fundamentals"]
7+
---
8+
9+
Before embeddings there was Bag of Words. Still useful, still relevant for some tasks. And understanding it helps understand why newer methods are better.
10+
11+
## What is it?
12+
13+
Represent document as word counts. Ignore order completely.
14+
15+
"The cat sat on the mat"
16+
"The dog sat on the log"
17+
18+
Vocabulary: [the, cat, sat, on, mat, dog, log]
19+
20+
Document 1: [2, 1, 1, 1, 1, 0, 0]
21+
Document 2: [2, 0, 1, 1, 0, 1, 1]
22+
23+
That's it. Count each word.
24+
25+
![Bag of Words Process](https://danielsobrado.github.io/ml-animations/animation/bag-of-words)
26+
27+
See it visualized: [Bag of Words Animation](https://danielsobrado.github.io/ml-animations/animation/bag-of-words)
28+
29+
## Building it
30+
31+
```python
32+
from collections import Counter
33+
34+
def bag_of_words(documents):
35+
# build vocabulary
36+
vocab = set()
37+
for doc in documents:
38+
vocab.update(doc.split())
39+
vocab = sorted(vocab)
40+
word_to_idx = {w: i for i, w in enumerate(vocab)}
41+
42+
# vectorize
43+
vectors = []
44+
for doc in documents:
45+
counts = Counter(doc.split())
46+
vec = [counts.get(w, 0) for w in vocab]
47+
vectors.append(vec)
48+
49+
return vectors, vocab
50+
```
51+
52+
Or just use sklearn:
53+
```python
54+
from sklearn.feature_extraction.text import CountVectorizer
55+
56+
vectorizer = CountVectorizer()
57+
X = vectorizer.fit_transform(documents)
58+
```
59+
60+
## The problems
61+
62+
**Ignores word order**
63+
64+
"Dog bites man" and "Man bites dog" have identical BoW vectors. Completely different meaning.
65+
66+
**Sparse and high dimensional**
67+
68+
10,000 word vocabulary = 10,000 dim vectors. Mostly zeros.
69+
70+
**No semantic similarity**
71+
72+
"Happy" and "joyful" are as distant as "happy" and "angry". No meaning captured.
73+
74+
**Common words dominate**
75+
76+
"The", "is", "a" appear everywhere. Don't help distinguish documents.
77+
78+
## TF-IDF to the rescue
79+
80+
Term Frequency - Inverse Document Frequency
81+
82+
Weight words by:
83+
- How often they appear in this document (TF)
84+
- How rare they are across all documents (IDF)
85+
86+
$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$
87+
88+
$$\text{IDF}(t) = \log\frac{N}{|\{d : t \in d\}|}$$
89+
90+
Words appearing in every document get low weight. Rare, distinctive words get high weight.
91+
92+
```python
93+
from sklearn.feature_extraction.text import TfidfVectorizer
94+
95+
vectorizer = TfidfVectorizer()
96+
X = vectorizer.fit_transform(documents)
97+
```
98+
99+
## N-grams
100+
101+
Capture some word order by including consecutive word pairs (bigrams), triples (trigrams).
102+
103+
"The cat sat" with bigrams:
104+
- Unigrams: [the, cat, sat]
105+
- Bigrams: [the_cat, cat_sat]
106+
107+
```python
108+
vectorizer = CountVectorizer(ngram_range=(1, 2))
109+
```
110+
111+
Vocabulary explodes but captures more structure.
112+
113+
## When BoW still works
114+
115+
- Document classification (news categories, spam)
116+
- Search and information retrieval (with TF-IDF)
117+
- Baseline for comparison
118+
- When you need interpretability
119+
- Small datasets
120+
121+
## When it fails
122+
123+
- Sentiment analysis (word order matters)
124+
- Question answering
125+
- Anything requiring understanding
126+
- Short texts (not enough words)
127+
128+
## Preprocessing matters
129+
130+
BoW benefits from:
131+
- Lowercasing
132+
- Removing punctuation
133+
- Stop word removal
134+
- Stemming/lemmatization
135+
136+
```python
137+
from sklearn.feature_extraction.text import TfidfVectorizer
138+
import nltk
139+
from nltk.corpus import stopwords
140+
141+
vectorizer = TfidfVectorizer(
142+
lowercase=True,
143+
stop_words='english',
144+
max_features=5000,
145+
ngram_range=(1, 2)
146+
)
147+
```
148+
149+
## Comparison with embeddings
150+
151+
| Aspect | BoW/TF-IDF | Embeddings |
152+
|--------|------------|------------|
153+
| Semantic similarity | No | Yes |
154+
| Word order | No (partial with n-grams) | Yes |
155+
| Dimensionality | High (vocab size) | Low (100-768) |
156+
| Interpretable | Yes | No |
157+
| Training data needed | None | Lots |
158+
| Compute | Fast | Slower |
159+
160+
## Practical advice
161+
162+
Starting new NLP project?
163+
164+
1. Try TF-IDF first (baseline)
165+
2. If not good enough, try sentence embeddings
166+
3. If still not enough, fine-tune BERT
167+
168+
Surprised how often TF-IDF is "good enough" for classification tasks.
169+
170+
The animation shows how documents become vectors: [Bag of Words Animation](https://danielsobrado.github.io/ml-animations/animation/bag-of-words)
171+
172+
---
173+
174+
Related:
175+
- [Embeddings - better representations](/posts/embeddings/)
176+
- [Tokenization](/posts/tokenization/)

0 commit comments

Comments
 (0)