Skip to content

Commit 925fedd

Browse files
committed
Added notes to Karpathy's videos
1 parent 68019a8 commit 925fedd

File tree

1 file changed

+114
-3
lines changed

1 file changed

+114
-3
lines changed

docs/writing/notes/Karpathy's - let's build GPT from scratch.md

Lines changed: 114 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
draft: true
2+
draft: false
33
date: 2024-03-19
44
slug: lets-build-gpt-from-scratch
55
tags:
@@ -15,6 +15,117 @@ authors:
1515
Karpathy's tutorial on Youtube [Lets build GPT from scratch](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=2794s)
1616

1717

18-
ChatGPT is probabilistic system
18+
## [The spelled-out intro to neural networks and backpropagation: building micrograd - YouTube](https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=1)
19+
20+
In this video he buils micrograd
21+
22+
23+
## [The spelled-out intro to language modeling: building makemore - YouTube](https://youtu.be/PaCmpygFfXo?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)
24+
25+
Building makemore [GitHub - karpathy/makemore: An autoregressive character-level language model for making more things](https://github.com/karpathy/makemore)
26+
27+
Dataset: people names dataset in givernment website
28+
29+
## Iteration 1:
30+
Character level language model
31+
32+
Method: Bigram (Predict next char using previous char)
33+
![[Pasted image 20250130124540.png]]
34+
35+
As seens above, it doesn't give good names. Bigram model is not good for predicting next character.
36+
37+
In "bigram" model probabilities become the parameter of bigram language model.
38+
39+
### Quality Evaluation of model
40+
41+
We will be using [[Negative maximum log likelihood estimate]] , in our problem we will calculate for the entire training set.
42+
43+
Log 1 = 0 & Log (very small number ) = -Inf
44+
45+
We would estimate Negative Log likelihood as follows
46+
47+
```python
48+
log_likelihood = 0
49+
n = 0
50+
for w in words[:3]:
51+
chs = ['.'] + list(w) + ['.']
52+
for ch1, ch2 in zip(chs, chs[1:]):
53+
ix1 , ix2 = stoi[ch1], stoi[ch2]
54+
prob=P[ix1, ix2] # P is the matrix that holds the probability
55+
n+=1
56+
log_likelihood+=torch.log(prob)
57+
print(f'{ch1}{ch2}: {prob:.4f}')
58+
59+
print(f'{log_likelihood=}')
60+
61+
#Negative log likelihood give nice property where error (loss function) should be small, i.e zero is good.
62+
nll = -log_likelihood
63+
print(f'{nll=}')
64+
65+
#Usually people work with average negative log likelihood
66+
print(f'{nll/n=}')
67+
```
68+
69+
70+
To avoid infinity probability for some predictions, people do model "smoothing" (assigning very small probability to unlikely scenario)
71+
72+
## Iteration 2: Bigram Language Model using Neural Network
73+
74+
Need to create a dataset for training, i.e input and output char pair. (x and y).
75+
76+
One hot encoding needs to be done before feeding into NN
77+
78+
`Log {count} = Logits`
79+
`counts = exp(Logits)`
80+
81+
```python
82+
83+
xenc = F.one_hot(xs, num_classes = 27).float()
84+
for i in range(100):
85+
86+
# Forward Pass
87+
logits = xenc @ W # Pred log-counts
88+
89+
counts = logits.exp() # Counts
90+
91+
probs = counts / counts.sum(1, keepdims = True)
92+
93+
loss = -probs[torch.arange(228146), ys].log().mean()
94+
print(loss.item())
95+
96+
#Backward pass
97+
W.grad=None
98+
loss.backward()
99+
100+
#Update parameters using the gradient calculated
101+
W.data+= -50 * W.grad # here 50 is h , initial tried small numbers , like 0.1 but it is decreasing the loss very slowly hence increased to 50
102+
103+
104+
105+
```
106+
107+
108+
### Thoughts and comparison of above two approaches
109+
110+
In the first approach, we added 1 to the actual count because we don't want to end up in a situation it give $-\infty$ for the character pair it didn't see in the trainin dataset. If you add large number then actual frequency is less relevent and we get uniform distribution. It is called smoothing
111+
112+
113+
Similarly, gradient based approach has a way to "smoothing". When you keep all values of `W` to be zero, exp(W) gives all ones and softmax would provide equal probabilities to all outputs. You incentivise this in loss function by using second component like below
114+
115+
```
116+
loss = -probs[torch.arange(228146), ys].log().mean() + (0.1 * (W**2).mean())
117+
```
118+
119+
Second component pushed W to be zero , 0.1 is the strength of Regularization that determines the how much weight we want to give to this regularization component. It is similar to the number of "fake" count you add in the first approach.
120+
121+
We took two approaches
122+
123+
i) Frequency based model
124+
ii) NN based model (using Negative log likelihood to optimize)
125+
126+
We ended up with the same model , in the NN based approach the `W` represents the log probability (same as first approach) , we can exponential the `W` to get count
127+
128+
129+
130+
19131

20-
Transformer Neural Net is used for LLMs

0 commit comments

Comments
 (0)