You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/writing/notes/Karpathy's - let's build GPT from scratch.md
+114-3Lines changed: 114 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
draft: true
2
+
draft: false
3
3
date: 2024-03-19
4
4
slug: lets-build-gpt-from-scratch
5
5
tags:
@@ -15,6 +15,117 @@ authors:
15
15
Karpathy's tutorial on Youtube [Lets build GPT from scratch](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=2794s)
16
16
17
17
18
-
ChatGPT is probabilistic system
18
+
## [The spelled-out intro to neural networks and backpropagation: building micrograd - YouTube](https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=1)
19
+
20
+
In this video he buils micrograd
21
+
22
+
23
+
## [The spelled-out intro to language modeling: building makemore - YouTube](https://youtu.be/PaCmpygFfXo?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)
24
+
25
+
Building makemore [GitHub - karpathy/makemore: An autoregressive character-level language model for making more things](https://github.com/karpathy/makemore)
26
+
27
+
Dataset: people names dataset in givernment website
28
+
29
+
## Iteration 1:
30
+
Character level language model
31
+
32
+
Method: Bigram (Predict next char using previous char)
33
+
![[Pasted image 20250130124540.png]]
34
+
35
+
As seens above, it doesn't give good names. Bigram model is not good for predicting next character.
36
+
37
+
In "bigram" model probabilities become the parameter of bigram language model.
38
+
39
+
### Quality Evaluation of model
40
+
41
+
We will be using [[Negative maximum log likelihood estimate]] , in our problem we will calculate for the entire training set.
42
+
43
+
Log 1 = 0 & Log (very small number ) = -Inf
44
+
45
+
We would estimate Negative Log likelihood as follows
46
+
47
+
```python
48
+
log_likelihood =0
49
+
n =0
50
+
for w in words[:3]:
51
+
chs = ['.'] +list(w) + ['.']
52
+
for ch1, ch2 inzip(chs, chs[1:]):
53
+
ix1 , ix2 = stoi[ch1], stoi[ch2]
54
+
prob=P[ix1, ix2] # P is the matrix that holds the probability
55
+
n+=1
56
+
log_likelihood+=torch.log(prob)
57
+
print(f'{ch1}{ch2}: {prob:.4f}')
58
+
59
+
print(f'{log_likelihood=}')
60
+
61
+
#Negative log likelihood give nice property where error (loss function) should be small, i.e zero is good.
62
+
nll =-log_likelihood
63
+
print(f'{nll=}')
64
+
65
+
#Usually people work with average negative log likelihood
66
+
print(f'{nll/n=}')
67
+
```
68
+
69
+
70
+
To avoid infinity probability for some predictions, people do model "smoothing" (assigning very small probability to unlikely scenario)
71
+
72
+
## Iteration 2: Bigram Language Model using Neural Network
73
+
74
+
Need to create a dataset for training, i.e input and output char pair. (x and y).
75
+
76
+
One hot encoding needs to be done before feeding into NN
77
+
78
+
`Log {count} = Logits`
79
+
`counts = exp(Logits)`
80
+
81
+
```python
82
+
83
+
xenc = F.one_hot(xs, num_classes=27).float()
84
+
for i inrange(100):
85
+
86
+
# Forward Pass
87
+
logits = xenc @ W # Pred log-counts
88
+
89
+
counts = logits.exp() # Counts
90
+
91
+
probs = counts / counts.sum(1, keepdims=True)
92
+
93
+
loss =-probs[torch.arange(228146), ys].log().mean()
94
+
print(loss.item())
95
+
96
+
#Backward pass
97
+
W.grad=None
98
+
loss.backward()
99
+
100
+
#Update parameters using the gradient calculated
101
+
W.data+=-50* W.grad # here 50 is h , initial tried small numbers , like 0.1 but it is decreasing the loss very slowly hence increased to 50
102
+
103
+
104
+
105
+
```
106
+
107
+
108
+
### Thoughts and comparison of above two approaches
109
+
110
+
In the first approach, we added 1 to the actual count because we don't want to end up in a situation it give $-\infty$ for the character pair it didn't see in the trainin dataset. If you add large number then actual frequency is less relevent and we get uniform distribution. It is called smoothing
111
+
112
+
113
+
Similarly, gradient based approach has a way to "smoothing". When you keep all values of `W` to be zero, exp(W) gives all ones and softmax would provide equal probabilities to all outputs. You incentivise this in loss function by using second component like below
114
+
115
+
```
116
+
loss = -probs[torch.arange(228146), ys].log().mean() + (0.1 * (W**2).mean())
117
+
```
118
+
119
+
Second component pushed W to be zero , 0.1 is the strength of Regularization that determines the how much weight we want to give to this regularization component. It is similar to the number of "fake" count you add in the first approach.
120
+
121
+
We took two approaches
122
+
123
+
i) Frequency based model
124
+
ii) NN based model (using Negative log likelihood to optimize)
125
+
126
+
We ended up with the same model , in the NN based approach the `W` represents the log probability (same as first approach) , we can exponential the `W` to get count
0 commit comments