LINMA2472/Lectures/transformers.jl at main · gasparrobert29/LINMA2472 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
### A Pluto.jl notebook ###
# v0.20.19

using Markdown
using InteractiveUtils

# ╔═╡ 32621224-a782-4bf6-9570-562cf2bb7360
using PlutoUI, DataFrames, PrettyTables, LinearAlgebra, Luxor, LaTeXStrings, MathTeXEngine, PlutoUI, PlutoUI.ExperimentalLayout, HypertextLiteral, PlutoTeachingTools

# ╔═╡ 6f72e8a5-819d-474c-a725-7f7318d964d7
include("utils.jl")

# ╔═╡ 5058e4eb-c53d-4468-b7ff-4f04ded96418
@htl("""
<p align=center style=\"font-size: 40px;\">Transformers in Large Language Models (LLMs)</p><p align=right><i>Benoît Legat</i></p>
$(PlutoTeachingTools.ChooseDisplayMode())
$(PlutoUI.TableOfContents(depth=1))
""")

# ╔═╡ 95ec4140-9147-11ef-2af4-5528bad0e6f5
md"# Large Language Models (LLMs)"

# ╔═╡ c09ec483-9fcf-48e7-b3c0-2508289e3cf3
md"## Autoregressive Models"

# ╔═╡ beccf4e8-1b01-4cb2-b23c-bc5db604f21c
md"""
Given a sequence of ``n_\text{ctx}`` past vectors ``x_{-1}, \ldots, x_{-n_\text{ctx}} \in \mathbb{R}^{n}``, "predict" the next ones. Key idea : *receding horizon*:

```math
\begin{align}
& p(x_0, x_1 | x_{-1}, \ldots, x_{-n_\text{ctx}})\\
& = p(x_0 | x_{-1}, \ldots, x_{-n_\text{ctx}})p(x_1 | x_0, x_{-1}, \ldots, x_{-n_\text{ctx}+1}, \textcolor{red}{x_{-n_\text{ctx}}})\\
& \approx p(x_0 | x_{-1}, \ldots, x_{-n_\text{ctx}}) p(x_{1} | x_0, x_{-1}, \ldots, x_{-n_\text{ctx}+1})
\end{align}
```

* **Model** : Probability of next vector ``\hat{p}(x_0 | X)`` where ``X`` concatenates ``x_{-1}, \ldots, x_{-n_\text{ctx}}``.
* **Loss** : Cross-entropy : ``\mathcal{L}_{\hat{p}}(X) \triangleq H(p, \hat{p}) = -\textbf{E}_p[\log(\hat{p})] = -\sum_{x_0} p(x_0 | X) \log(\hat{p}(x_0 | X))``
* Particular case for ``\hat{p}(x_0 | X) = \delta_y`` : ``\mathcal{L}_{\hat{p}}(X) = -\log(\hat{p}(y | X))``

#### What about Language Models ?

Given "past text", predict the "following text". How to turn text into vectors of ``\mathbb{R}^n`` ?
"""

# ╔═╡ ccf2dc71-b883-497a-bc58-29ffaf9ea4ad
md"## Text to vectors : step 1 → tokenization"

# ╔═╡ 1bbf2152-4fdf-4ed2-9bdf-95d699824d11
md"""
#### Why not encode each letter ?

* **Idea** : Turn each letter into its one-hot encoding in ``\mathbb{R}^{26}``.
* **Issue** : The "past text" only has ``n_\text{ctx}`` characters so ``n_\text{ctx}`` must be **large** but transformers have a complexity **quadratic** in ``n_\text{ctx}``!
* **Practical details** : Text is encoded with [UTF-8](https://en.wikipedia.org/wiki/UTF-8) so each character is encoded into 1 to 4 bytes. We encode each byte to a vector in ``\mathbb{R}^{256}`` but care must be taken not to generate invalid UTF-8.

#### Why not encode each word ?

* **Idea** : Turn each word into its one-hot encoding in ``\mathbb{R}^n``. The value of ``n`` is the number of words. Depending on the language ([source](https://en.wikipedia.org/wiki/List_of_dictionaries_by_number_of_words)):

| Language | French  | English |  Dutch  | German  |
|----------|---------|---------|---------|---------|
| ``n``    | 408,078 | 350,000 | 350,000 | 200,000 |

* **Issue** : The value of ``n`` is **too large**. We cannot trust the words of languages to be a tokenization that optimally compresses text for our dataset.
"""

# ╔═╡ 57c2c944-0d91-489d-8ad7-f5520e71ef3e
md"## Byte Pair Encoding"

# ╔═╡ c3db7eb2-356a-428f-9777-6369662d8b06
md"""
Note that the new tokens can also be part of the most frequence pair!
"""

# ╔═╡ 2e8b1a77-1f04-4035-8d82-4061d81ecb7a
md"## Increasing length of \"past text\""

# ╔═╡ ed5b5702-4cca-4116-a70f-4a562178f490
md"""
> **Challenging tradeoff**: Encode text to **increase** length of "past text" while keeping ``n_\text{ctx}`` and ``n`` **small** enough.

Length of "past text" increases with vocabulary size ``n_\text{voc}`` and context window ``n_\text{ctx}``.
"""

# ╔═╡ e2eca085-9f99-4e3a-9db4-e7f692aedd34
md"## Text to vectors : step 2 → embedding"

# ╔═╡ 9e898325-e9e2-45bd-af74-3dd86f00f7b5
md"""
Consider one-hot encoding with vocabulary size ``n_\text{voc}`` and a *bigram model*
```math
\hat{p}(x_0 | x_{-1}) = \text{softmax}(W_d \tanh(\cdots\tanh(W_1 x_{-1})\cdots)
```
The matrix ``W_d`` has ``n_\text{voc}`` rows and ``W_1`` has ``n_\text{voc}`` columns → issue if ``n_\text{voc}`` is large

**Embedding** : Use vectors ``c_1, \ldots, c_{n_\text{voc}} \in \mathbb{R}^{d_\text{emb}}`` with *embedding size* (aka *hidden size*) ``d_\text{emb} \ll n_\text{voc}``.

Equivalently, we still use one-hot encoding but we add an encoder
``C \in \mathbb{R}^{d_\text{emb} \times n_\text{voc}}`` and decoder ``D \in \mathbb{R}^{n_\text{voc} \times d_\text{emb}}``
```math
\hat{p}(x_0 | x_{-1}) = \text{softmax}(D W_d \tanh(\cdots\tanh(W_1 C x_{-1})\cdots)
```
"""

# ╔═╡ bcf7667f-f99b-4d10-af84-5d3879f1db5d
qa(
html"What difference do you expect with respect to the previous model ?",
md"""
The products ``W_1C`` and ``DW_d`` have the same dimension as the matrices ``W_1`` and ``W_d`` of the previous model. So the expressive power of the model was not improved while we increased the number of parameters and we potentially made the loss function "even more nonconvex".

If the hidden dimension (i.e., the number of rows of ``C`` / columns of ``W_1`` or the number of rows of ``W_d`` / columns of ``D``) is much smaller than ``n_\text{voc}``, then it's faster to compute ``W_1(Cx)``. Moreover, we are forcing the matrix ``W_1C`` to have a low rank compared the model without ``C``. This means less expressivness but it might also prevent overfitting so the case isn't so clear.

The case become clearer when the input embedding ``C`` is shared between more than one character, i.e., ``n_\text{ctx} > 1``.
Same for the output embedding, ``D`` is useful when it is not preceded by a linear with which it can just be merged.
""")

# ╔═╡ e86a57ae-3945-4cbf-b2e4-a96f4b5295e0
qa(md"Why don't we use ``D = C^{-1}`` ?",
md"""
The ``C`` matrix is not invertible as it is not square. Given a vector ``u``, the vector ``v = Du`` could be computed to be the mininum norm solution of ``u = Cv``, this is what is done by `v = D \ u`. Computing `D \ u` would be more computational work than `D' \ u` but that's not the only reason `D'` is used.
If ``v`` is chosen so that ``u = Cv``, it means that ``v`` is a linear combination allowing to reconstruct ``u`` using the columns of ``C``.
From that perspective, ``v_i`` could be nonzero even though the direction of the vector ``u`` us far from the direction of the column ``C_{:i}``.
That does not correspond to what we want to do here.

Here, we want a probability vector ``p`` with high probability ``p_i`` when the direction of the column ``C_{:i}`` is close to the direction of ``u``.
In the vector ``v = C^\top u``, ``v_i = \langle C_{:i}, u \rangle`` is the scalar product between the ``i``th column of ``C`` and ``u`` hence it is a good measure of how close the direction are to each other.
Of course this assumes that the column of ``C`` have the same norm, but the hope is for the model to figure in training that the columns of ``C`` should have unit norm.
We can then simply apply softmax to turn these scalar products into probabilities ``p = \text{softmax}(v)``.
""")

# ╔═╡ 6622f9f0-cecc-476e-9d49-7d651f433b9f
md"# Pre-transformers approaches"

# ╔═╡ 6aa690e9-389f-4398-abae-b95060db4d90
md"## Shared embedding"

# ╔═╡ 6712c883-b407-47e1-a666-4de05f8f8d6e
HAlign(
	md"""
```math
\begin{multline}
\hat{p}(x_0 | x_{-1}, \ldots, x_{-n_\text{ctx}}) = \\
	\text{softmax}(W_2 \tanh(W_1
\begin{bmatrix}
C x_{-1}\\
\vdots\\
C x_{-n_\text{ctx}}
\end{bmatrix}
))
\end{multline}
```
	""",
	img("bengio2000Neural", :width => 250),
)

# ╔═╡ c4bebd0d-eacf-4db4-b5b3-4dca50ab9e1b
qa(md"What are the number of columns of ``W_1`` and number of rows of ``W_2`` now ?",
   md"""
The matrix ``W_1`` has ``n_\text{ctx}d_\text{emb}`` columns. Assuming ``d_\text{emb} \ll n_\text{voc}`` and ``n_\text{ctx} \gg 1``, this is much smaller than the number ``n_\text{ctx}n_\text{voc}`` that we would have without the embedding. The number of rows of ``W_2`` is ``n_\text{voc}``, unaffected by the embedding.
""")

# ╔═╡ f8330700-e964-4e19-9c55-2b11df45789e
md"## Embedding sizes in LLMs"

# ╔═╡ 4e10271c-49f8-4f1d-869c-5fa11275d7f6
md"## Recurrent neural networks (RNN)"

# ╔═╡ d54b5390-0ec0-4ff8-ab18-51726482ca46
md"## Extensions of RNNs"

# ╔═╡ 6800afbf-8ac6-4308-b4cc-b37da57e42c1
md"# Attention is all you need"

# ╔═╡ 55435b26-7fc3-4c8b-8013-6fd4fb65a08e
md"## Numerical dictionary"

# ╔═╡ bcbb3db2-85b3-4cb0-9309-f5c032d14da5
md"
What would a numerical dictionary look like ? Consider keys ``k_i \in \mathbb{R}^{d_k}`` and values ``v_i \in \mathbb{R}^{d_v}``. Given a query ``q \in \mathbb{R}^{d_k}``,"

# ╔═╡ d558636d-c714-4033-ae73-5b92c3cdedf3
dict = Dict([1, 0] => [1, 1], [0, 1] => [-1, 1])

# ╔═╡ 70f395b2-f8c2-44d5-b0af-702659dd7fee
dict[[1, 0]]

# ╔═╡ b1a924f4-e2f0-445c-830f-94287a0e52f7
function numerical_lookup(dict, query)
	_, i = findmax([dot(query, key) for key in keys(dict)])
	return collect(values(dict))[i]
end

# ╔═╡ 95504a74-d5ef-4fb7-83a0-88914c7cbc59
numerical_lookup(dict, [0.8, 0.2])

# ╔═╡ 8d231f2c-4b0c-4c37-a746-16e98d4cafc8
md"## Attention head"

# ╔═╡ 570fa160-3adb-463e-99b8-b7dd05076908
function softmax(x)
	y = exp.(x)
	return y / sum(y)
end

# ╔═╡ 77f446ac-6030-48f2-9bea-93c427f9fcb9
function softmax_lookup(dict, query)
	ks = keys(dict)
	α = softmax([dot(query, key) for key in keys(dict)])
	@show α
	return sum(α * value for (α, value) in zip(α, values(dict)))
end

# ╔═╡ 1faa4ab2-6c93-47dc-b631-8be52780fe7d
softmax_lookup(dict, [0.8, 0.2])

# ╔═╡ bf563783-9784-4c74-a7b1-6d7a3ed618c5
md"## Matrix form of attention"

# ╔═╡ 9ff95a9a-192b-4a12-8e2e-7acd6659c066
md"""
```math
\begin{align}
Q & = \begin{bmatrix}
  q_1 & \cdots & q_{n_\text{ctx}}
\end{bmatrix} &
K & = \begin{bmatrix}
  k_1 & \cdots & k_{n_\text{ctx}}
\end{bmatrix} &
K^\top Q & =
\begin{bmatrix}
  \langle k_1, q_1 \rangle & \cdots & \langle k_1, q_{n_\text{ctx}} \rangle\\
  \vdots & \ddots & \vdots\\
  \langle k_{n_\text{ctx}}, q_1 \rangle & \cdots & \langle k_{n_\text{ctx}}, q_{n_\text{ctx}} \rangle
\end{bmatrix}
\end{align}
```
"""

# ╔═╡ 76ba4e9b-8bb0-47c4-b607-2ca711f035e6
md"## Masked Attention"

# ╔═╡ 8c27b182-0c3c-4c19-9619-df62b7dd6bf0
HAlign(
md"""
💡 **Key idea** In the model for ``\hat{p}(x_0 | x_{-1}, \ldots, x_{-n_\text{ctx}})``, incorporate sub-models
```math
\begin{align}
\bar{p}(&x_0 | x_{-1}, \ldots, x_{-n_\text{ctx}})\\
\bar{p}(&x_{-1} | x_{-2}, \ldots, x_{-n_\text{ctx}})\\
& \quad\qquad\vdots\\
\bar{p}(&x_{-n_\text{ctx}+1} | x_{-n_\text{ctx}}).
\end{align}
```
""",
md"""
Mask prevent ``\hat{p}`` to look input the future:
```math
M
=
\begin{bmatrix}
  0 & 0 & \cdots & 0\\
  -\infty & 0 & \ddots & \vdots\\
  \vdots & \ddots & \ddots & 0\\
  -\infty & \cdots & -\infty & 0
\end{bmatrix}
```
""",
)

# ╔═╡ 0c0c1163-0aec-4089-9acc-539b3a86d0b3
md"""
```math
\text{Masked-Attention}(V, K, Q)\
=
V\text{softmax}(M + K^\top Q/\sqrt{d_k})
```
"""

# ╔═╡ b7583418-f4fb-4c63-b421-b5b9af269768
md"## Multi-Head Attention"

# ╔═╡ 6fc13413-53de-4c75-9b9e-620e0b7f8a1f
qa(md"Is ``W^O`` needed if ``h = 1`` ?", md"No, if ``h = 1``, we can merge ``W^OW_1^V`` into a new ``W_1^V``.")

# ╔═╡ d05e6f0f-0081-4fb6-91e9-ac2f58beda4a
md"# Decoder-only transformer"

# ╔═╡ a3efd921-eb14-4901-9d6c-800cc812fe02
md"## Self-Attention"

# ╔═╡ 4b61363d-87c9-4755-8286-44df34e9dd6a
qa(
html"Is the order between the tokens taken into account by the model ?",
md"""
No. Since the same matrices ``W_j^V``, ``W_j^K`` and ``W_j^Q`` multiply the different position. The **position** information is completely **lost**!
"""
)

# ╔═╡ 453544fc-0e3e-4e04-8c0c-192f3a038884
md"## Positional encoding"

# ╔═╡ 92e01e21-ca77-43fc-9bf8-0c5a7aaed1bb
md"## Residual connection"

# ╔═╡ f2cba2aa-c541-4692-a441-e65741750a15
md"## Layer normalization"

# ╔═╡ e383bb72-49a1-4df1-84c3-b95a2ffe00f5
md"## Feed-Forward network"

# ╔═╡ af8194a1-a358-4cf7-b446-6b377cb76687
md"The feed-forward network is implemented **independently** for the output of each query so each query can be processed independently through each **layer**. The next layer allows each queries to then look at the results of the previous layer for **past** (because of the mask) queries."

# ╔═╡ 79e6c4a8-cc1e-40cc-bb09-e9a7a9a8e475
md"## Transformer variations"

# ╔═╡ a5b20939-9afa-48c0-aa67-cbca6bc99804
md"## Cost of LLMs"

# ╔═╡ a14e505e-2e4a-4c73-8133-7560ba58916b
md"## Key-Value (KV) cache"

# ╔═╡ 9a8eef1b-27c0-4d57-a389-53708ade9058
md"""
Let ``\hat{Y}_{i}`` be the intermediate output of ``i \in \{1, \ldots, N\}``.
The the columns of the matrix ``\text{softmax}(C^\top \hat{Y}_i)`` (column-wise softmax) can be thought as intermediate probabilities that we denote ``\hat{p}_i``:
```math
(\hat{p}_i(x_{-n_\text{ctx}+1} | x_{-n_\text{ctx}}), \ldots, \hat{p}_i(x_{-1} | x_{-2}, \ldots, x_{-n_\text{ctx}}), \hat{p}_i(x_0 | x_{-1}, \ldots, x_{-n_\text{ctx}}))
```
and we predict the next token using ``\hat{p}_N(x_0 | x_{-1}, \ldots, x_{-n_\text{ctx}})``.
"""

# ╔═╡ 728f5fdf-77a5-46c7-b3ee-01064ef1b7e2
qa(md"Should we discard all these intermediate ``\hat{Y}_i`` we computated or can we reuse it for the following token ?",
md"""
For the next token, the corresponding intermediate probabilities would be:
```math
(\hat{p}_i(x_{-n_\text{ctx}+2} | x_{-n_\text{ctx}+1}), \ldots, \hat{p}_i(x_{0} | x_{-1}, \ldots, x_{-n_\text{ctx}+1}), \hat{p}_i(x_1 | x_{0}, \ldots, x_{-n_\text{ctx}+1}))
```
Note that
```math
\begin{align}
   \hat{p}_i(x_{-n_\text{ctx}+2} | x_{-n_\text{ctx}+1})
   & \approx
   \hat{p}_i(x_{-n_\text{ctx}+2} | x_{-n_\text{ctx}+1}, x_{-n_\text{ctx}})\\
   \hat{p}_i(x_{0} | x_{-1}, \ldots, x_{-n_\text{ctx}+1})
   & \approx
   \hat{p}_i(x_0 | x_{-1}, \ldots, x_{-n_\text{ctx}})
\end{align}
```
So for any ``j < n_\text{ctx}``, the ``j``th column of the ``\hat{Y}_i'`` that should be computed for the new token is approximately equal to
the ``(j+1)``th column of ``\hat{Y}_i`` that we already computed for the previous token.
What's more, the column of ``\hat{Y}_i`` was computed with one more token as context compared to what we need to compute in ``\hat{Y}_i'``.
So even though it's not equal, reusing what we computed in ``\hat{Y}_i`` should provide better result, assuming the trained transformers using this auto-regressive structure in his layers.
""")

# ╔═╡ c8923675-e73e-4621-82b9-966d8b003b97
md"# Encoder-decoder transformer"

# ╔═╡ 04e9b912-6712-4290-acc4-f24bb27a1469
md"## Machine translation"

# ╔═╡ 8b78360a-21cb-4574-a84d-46ea4d0cedb1
img("sutskever2014Sequence")

# ╔═╡ 6bff7bca-ea1d-44c6-b8c3-040250f90654
md"## Cross-Attention"

# ╔═╡ f572e113-b36b-4a6b-96c7-c26f100e1ad4
md"## Utils"

# ╔═╡ f6f7376e-9984-4289-b8ff-9d47e5358791
import DocumenterCitations, CSV, Logging

# ╔═╡ 1d5b1b7c-828c-4a16-b446-cff21b015d45
biblio = load_biblio!()

# ╔═╡ 94ae440d-0644-49db-9461-f1a1ff1d7f87
cite(args...) = bibcite(biblio, args...)

# ╔═╡ f4366cf6-2be0-42b8-96c4-120be3f5c25e
md"""
### References

* Recurrent neural networks : $(cite("goodfellow2016Deep", "Chapter 10")) and Section 4.7 of [The Elements of Differentiable Programming book](https://diffprog.github.io/)
* Transformers : $(cite("vaswani2017Attentiona")) and Section 4.8 of [The Elements of Differentiable Programming book](https://diffprog.github.io/)
* [Neural Networks: Zero to Hero](https://karpathy.ai/zero-to-hero.html) by Andrej Karpathy
"""

# ╔═╡ 0583ee0c-3802-4e81-b179-a80a82493b43
md"""
Byte Pair Encoding algorithm $(cite("sennrich2016Neural")) greedily merges the most frequent pair of tokens over the dataset into a new token.
Most used implementations are `SentencePiece` $(cite("kudo2018SentencePiece")) and `tiktoken` (play with it [here](https://tiktokenizer.vercel.app/)). For instance, on [this example](https://en.wikipedia.org/wiki/Byte_pair_encoding), the pair `('a', 'a')` is the most frequent so we substitute it by a new token, say `'Z'`:
"""

# ╔═╡ 2a7e5096-1e8d-4506-96d2-86de0a7d39aa
md"""
Forcing ``D = C^\top`` appears to work well in practice $(cite("press2017Using")), this is what is used in $(cite("vaswani2017Attentiona")).
"""

# ╔═╡ 9cb90e76-3bb5-41ff-bc79-c4949400d904
md"""
With ``n_\text{ctx} > 1``, the encoder ``C`` is shared by all tokens:
See for instance the network below taken from $(cite("bengio2000Neural", "Figure 1")), the first popular application of neural nets for languages:
"""

# ╔═╡ 55a09acc-84da-491c-86ba-9a66f4ea52fe
HAlign(
md"""
```math
\begin{align}
h^{(t+1)} & = \tanh(Wh^{(t)} + Ux^{(t+1)} + b)\\
o^{(t)} &= Vh^{(t)} + c\\
\hat{y}^{(t)} &= \text{softmax}(o^{(t)})
\end{align}
```
Illustrated on the right $(cite("goodfellow2016Deep", "Figure 10.3")).

RNNs as language model showcased in $(cite("mikolov2010Recurrent")).

**Issue**: Training time and space complexity is proportional to ``n_\text{ctx}`` and **cannot parallelize** to speed up.
""",
img("RNN")
)

# ╔═╡ 225e58ba-b78d-4a0a-be4f-ad642c879b93
md"""
It's difficult to model long-term dependencies as their gradient either vanish or explodes exponentially (think of the power method) $(cite("goodfellow2016Deep", "Section 10.7"))

*Gated* extensions attempting to solve this issue $(cite("goodfellow2016Deep", "Section 10.10")):

* Long short-term memory (LSTM) $(cite("graves2014Generating"))
* Gated recurrent unit (GRU) $(cite("cho2014Properties"))
"""

# ╔═╡ a21fbc70-9137-4d0e-8c8c-cbdc5269778f
md"""
Recently, Mamba suggests a solution to the complexity issue $(cite("gu2024Mamba")). As it scales better with ``n_\text{ctx}``, it is even suggested to get rid of the tokenizer : $(cite("wang2024MambaByte")).
"""

# ╔═╡ 86101f07-67c5-4df2-911c-4013c44d6c5b
md"""
Attention head provides a differentiable numerical dictionary $(cite("bahdanau2016Neural"))
```math
\begin{align}
\alpha
& =
\text{softmax}(\langle q, k_1 \rangle, \ldots, \langle q, k_{n_\text{ctx}}\rangle)
&
\text{Attention}(q, k, v)
& =
\sum_{i=1}^{n_\text{ctx}} \alpha_i v_i
\end{align}
```
"""

# ╔═╡ 5150d8f3-6e85-43f2-801a-eae5cc3e3095
HAlign(
	md"""
`softmax` is then applied to each **column**:
```math
\text{softmax}(K^\top Q/\sqrt{d_k})
```
Division by ``\sqrt{d_k}`` scales the input of softmax to
preferable regions $(cite("vaswani2017Attentiona", "Secton 3.2.1")).

Illustrated on the right from $(cite("bahdanau2016Neural", "Figure 3(a)")).

```math
\text{Attention}(V, K, Q) = V\text{softmax}(K^\top Q/\sqrt{d_k})
```
""",
	img("attention_matrix"),
)

# ╔═╡ d014e6aa-92f6-4ca1-be47-516565d1bb20
HAlign((
	md"""
Heads focus on different aspects. Their outputs are **combined** with ``W^O \in \mathbb{R}^{d_\text{emb} \times hd_v}``:
```math
\begin{align}
	\text{head}_j & = \text{Attention}(W_j^VV, W_j^KK, W_j^QQ)\\
	\text{MultiHead}(V, K, Q)
	& =
	W^O\text{vcat}(\text{head}_1, \ldots, \text{head}_h)
\end{align}
```
See $(cite("vaswani2017Attentiona", "Figure 2")) on the right.

Similarly, in the masked case:
```math
\begin{align}
	\text{head}_j & = \text{Masked-Attention}(W_j^VV, W_j^KK, W_j^QQ)\\
	\text{Masked-MultiHead}&(V, K, Q)
	=
	W^O\text{vcat}(\text{head}_1, \ldots, \text{head}_h)
\end{align}
```
""",
	img("multi-head", :width => 250)),
	[70, 30],
)

# ╔═╡ 25b79953-fd7c-46c1-b760-d57c09910981
qa(md"""
How does the number of parameters of transformers compare with $(cite("bengio2000Neural")) or RNNs for large ``n_\text{ctx}`` ?
""",
md"""
* The number of parameters of the transformer does **not** depend on ``n_\text{ctx}``.
* The number of parameters of $(cite("bengio2000Neural")) depends linearly with ``n_\text{ctx}``. Assuming that the number of hidden neurons scales proportionally with ``n_\text{ctx}``, the number of parameters even scales quadratically with ``n_\text{ctx}``!
* For RNNs, if the dimension of the internal state scales proportionally with ``n_\text{ctx}``, the number of parameters is also proportional with ``n_\text{ctx}``! If the dimension of the internal state is kept too small, increasing the context won't be so helpful, due to *encoder bottleneck*, see next slide.
""")

# ╔═╡ c1437dcc-22cb-424f-9b8e-326172f82d86
md"""
* LSTM **encoder** → **context** → LSTM **decoder** $(cite("sutskever2014Sequence")). See $(cite("sutskever2014Sequence", "Figure 1")) below.
* Issue with *encoder bottleneck*. All information has to be summarized in the **context**.
"""

# ╔═╡ 4f1d5112-dbac-4eb6-8518-0dc4193c3f8e
bib(args...) = bibrefs(biblio, args...)

# ╔═╡ 61dc1905-338f-4bfd-a158-2f6bacff769e
bib(["goodfellow2016Deep", "vaswani2017Attentiona"])

# ╔═╡ 4df0a18d-cb14-41b1-ba40-fd6bfcbb0b03
bib(["sennrich2016Neural", "kudo2018SentencePiece"])

# ╔═╡ 97463c54-7cc7-4497-a8a6-6422f5f582bd
bib(["team2024Gemini", "team2024Geminia", "team2024Gemma", "team2024Gemmaa", "sennrich2016Neural", "radford2019Language", "brown2020Language", "touvron2023Llama", "yu2023MEGABYTE"])

# ╔═╡ eb18303f-3dfb-4b87-90f2-f6dc542d7221
bib(["press2017Using", "vaswani2017Attentiona"])

# ╔═╡ 76e2f97b-1c06-40cd-b134-d5155aa5587d
bib(["bengio2000Neural"])

# ╔═╡ 75ca478c-916f-464a-9435-8208ee726d50
bib(["team2024Gemma", "team2024Gemmaa", "radford2019Language", "touvron2023Llama"])

# ╔═╡ 5b4a67a9-e33e-4dc6-b9f0-fd9a2cca6f2a
bib(["mikolov2010Recurrent", "goodfellow2016Deep"])

# ╔═╡ 8eafcfed-9771-4d99-b0c5-bd75a6dab012
bib(["cho2014Properties", "graves2014Generating", "goodfellow2016Deep", "gu2024Mamba", "wang2024MambaByte"])

# ╔═╡ e41d13ca-1dc1-45ae-9fa6-a83c4101120d
bib(["bahdanau2016Neural"])

# ╔═╡ c032b3ff-c539-4e38-81d0-39b28b3a8076
bib(["bahdanau2016Neural", "vaswani2017Attentiona"])

# ╔═╡ b56e9e56-e74a-401b-b4b5-f36bb33341d5
bib(["he2015Deep"])

# ╔═╡ 2a8433e3-9a3b-487b-abf3-09278ea42389
bib(["ioffe2015Batch", "ba2016Layer", "vaswani2017Attentiona"])

# ╔═╡ 4dd7083a-e730-4f4b-bde8-fc1a5b08ebfc
bib(["he2016Identity", "radford2019Language", "su2023RoFormer"])

# ╔═╡ 45efc71d-d5f8-474e-9b89-e72fac7110fd
bib("bengio2000Neural")

# ╔═╡ f7ca738d-5215-4e91-a2f3-a5ff10911313
bib("sutskever2014Sequence")

# ╔═╡ 85a10748-8d19-44a8-a1c5-0d13b093f1bf
function draw_transformer(decoder_only = true)
	scale(0.4, 0.4)
	Luxor.placeimage(readpng("images/transformer.png"), centered = true)
	if decoder_only
		sethue("red")
		setopacity(0.4)
		box(Point(-350, -160), Point(320, 20), :fill)
		box(Point(-350, 20), Point(0, 460), :fill)
		translate(Point(-170, -190))
		setopacity(1)
		fontsize(32)
		text("Not used for now", halign = :center)
	end
end

# ╔═╡ d1ba8da3-add8-4dbe-9ebf-9a32fa5cd5dd
HAlign(
md"""
*Pre-activation* for residual neural networks introduced in $(cite("he2016Identity")) and used in GPT-2 $(cite("radford2019Language")). See figure on the right.

*Rotary Positional Encoding* $(cite("su2023RoFormer")) replaces
``W^K(Cx_i + p_i)`` and ``W^Q(Cx_i + p_i)``
by ``R^i W^KCx_i`` and ``R^i W^QCx_i`` where ``R`` is a rotation matrix.
Advantage : ``\langle k_i, q_j \rangle`` contains ``R^{i - j}`` → **relative** difference of position.
""",
HTML(html(@draw begin
	draw_transformer()
	sethue("blue")
	scale(2, 2)
	arrow(Point(175, -36), Point(180, -43), Point(170, -48), Point(145, -53), :stroke, startarrow=false, finisharrow=true)
	arrow(Point(175, -30), Point(190, -10), Point(195, 10), Point(145, 17), :stroke, startarrow=false, finisharrow=true)
	arrow(Point(175, 120), Point(190, 130), Point(195, 150), Point(145, 192), :stroke, startarrow=false, finisharrow=true)
end 300 400)),
)

# ╔═╡ 8d6ec2b3-997e-4df5-a3b2-c1dffa53d0ec
qa(
	md"What is the time complexity of inference with respect to ``d_\text{emb}``, ``n_\text{voc}``, ``n_\text{ctx}``, ``d_\text{ff}``, ``h`` and ``N`` ?",
HAlign(
md"""
| Input | Parameters | Time |
|-------|------------|------|
| ``CX + P \in \mathbb{R}^{d_\text{emb} \times n_\text{ctx}}`` | ``W_j^V \in \mathbb{R}^{d_v \times d_\text{emb}}`` | ``O(d_v d_\text{emb} n_\text{ctx})`` |
| ``CX + P \in \mathbb{R}^{d_\text{emb} \times n_\text{ctx}}`` | ``W_j^K, W_j^Q \in \mathbb{R}^{d_k \times d_\text{emb}}`` | ``O(d_k d_\text{emb} n_\text{ctx})`` |
| ``K, Q \in \mathbb{R}^{d_k \times n_\text{ctx}}`` |  | ``O(d_k n_\text{ctx}^2)`` |
| ``V \in \mathbb{R}^{d_v \times n_\text{ctx}}, ... \in \mathbb{R}^{n_\text{ctx} \times n_\text{ctx}}`` |  | ``O(d_v n_\text{ctx}^2)`` |
| ``... \in \mathbb{R}^{d_v \times n_\text{ctx}}`` | ``W^O \in \mathbb{R}^{d_\text{emb} \times d_v}`` | ``O(d_\text{emb} d_v n_\text{ctx})`` |
| ``... \in \mathbb{R}^{d_\text{emb} \times n_\text{ctx}}`` | ``W_1 \in \mathbb{R}^{d_\text{ff} \times d_\text{emb}}`` | ``O(d_\text{emb} d_\text{ff} n_\text{ctx})`` |
| ``... \in \mathbb{R}^{d_\text{ff} \times n_\text{ctx}}`` | ``W_2 \in \mathbb{R}^{d_\text{emb} \times d_\text{ff}}`` | ``O(d_\text{emb} d_\text{ff} n_\text{ctx})`` |

So for ``N`` layers (ignoring the complexity of the embedding):
```math
O(Nn_\text{ctx}(n_\text{ctx}(d_v + d_k) + d_\text{emb}(d_v+d_k+d_\text{ff})))
```
Assuming that ``d_v, d_k, d_\text{ff}`` has the same scale as ``d_\text{emb}``:
```math
O(Nn_\text{ctx}^2d_\text{emb} + Nn_\text{ctx}d_\text{emb}^2)
```
""",
HTML(html(@draw begin
	draw_transformer()
	translate(-10, 150)
	scale(0.6)
	Luxor.placeimage(readpng("images/multi-head.png"), centered = true)
end 300 400))
)
)

# ╔═╡ a873f760-bfc1-489f-a58e-75e12afa54f2
function highlight(a, b, c, d)
	sethue("green")
	setopacity(0.4)
	#box(Point(a, b), Point(c, d), :fill)
	polysmooth(box(Point(a, b), Point(c, d), vertices=true), 10, action = :fill)
	setopacity(1)
	polysmooth(box(Point(a, b), Point(c, d), vertices=true), 10, action = :stroke)
end

# ╔═╡ b9caae1a-38aa-4d01-9cda-3d6782fb0e03
HAlign(md"""
*Self-Attention* with embedding ``C`` is:
```math
\text{Masked-MultiHead}(CX, CX, CX)
```

The embedding vectors ``CX`` take then different projections
for value, key, query and also for different heads!
```math
\text{head}_j = \text{Masked-Attention}(W_j^VCX, W_j^KCX, W_j^QCX)
```
""",
HTML(html(@draw begin
	draw_transformer()
	highlight(200, 250, 375, 350)
end 300 400)),
)

# ╔═╡ c5be3956-5102-4d88-bfdb-9813c0555fe1
HAlign(
md"""
Cannot sum ``Cx_i + e_i`` with one-hot encoding ``e_i \in \mathbb{R}^{n_\text{ctx}}`` as the dimension of ``Cx_i`` is ``\mathbb{R}^{d_\text{emb}}``.

So we also add a positional embedding ``P`` : ``Cx_i + Pe_i = Cx_i + p_i``.

With Self-Attention:
```math
\text{Self-MultiHead}(CX + P, CX + P, CX + P)
```
""",
HTML(html(@draw begin
	draw_transformer()
	highlight(310, 410, 530, 510)
end 300 400))
)

# ╔═╡ 18c26901-85eb-45ac-89bf-b03bd255007a
HAlign(
md"""
Residual connection $(cite("he2015Deep"))
$(img("resnet"))
""",
HTML(html(@draw begin
	draw_transformer()
	highlight(210, -85, 290, -45)
	highlight(360, -75, 440, 50)
	highlight(210, 210, 290, 250)
	highlight(360, 220, 440, 420)
end 300 400))
)

# ╔═╡ 5f05e717-a51a-4a99-bb11-cc493217f93f
HAlign(
md"""
Norm of gradient increases exponentially with depth.
Issue for deep neural net.
Consider output
```math
\begin{bmatrix}
  y_{1,1} & \ldots & y_{1,d_\text{emb}}\\
  \vdots & \ddots & \vdots\\
  y_{d_\text{batch},1} & \ldots & y_{d_\text{batch},d_\text{emb}}
\end{bmatrix}
```
Normalization : ``y_{i,j} \mapsto g(y_{i,j} - \mu_{i,j})/\sigma_{i,j}`` for gain ``g``, mean ``\mu`` and standard deviation ``\sigma``.

* Batch normalization : ``\sigma_{i,j} = \sigma_{j}`` $(cite("ioffe2015Batch"))
* Layer normalization : ``\sigma_{i,j} = \sigma_{i}`` $(cite("ba2016Layer"))

Batch norm depends on the batch hence [is tricky to implement](https://www.youtube.com/watch?v=P6sfmUTpUmc). Layer normalization is used in $(cite("vaswani2017Attentiona")).
""",
HTML(html(@draw begin
	draw_transformer()
	highlight(290, -85, 360, -45)
	highlight(290, 210, 360, 250)
end 300 400))
)

# ╔═╡ 3d8add97-59e1-444a-838b-85c2a2ac60b3
HAlign(
md"""
*Cross-Attention* between
* values and keys ``E(CX + P)`` where ``E`` is the encoder, and ``X`` is the matrix of input tokens
* query ``Q`` depending on past output ``Y`` and number of layers already applied
```math
\text{MultiHead}(E(CX + P), E(CX + P), Q)
```

The embedding vectors ``CX`` take then different projections
for value, key, query and also for different heads!
```math
\begin{multline}
\text{head}_j = \text{Attention}(W_j^VV, W_j^KK, W_j^QQ)\\
\text{where } V = K = E(CX + P)
\end{multline}
```
""",
HTML(html(@draw begin
	draw_transformer(false)
	highlight(31, -97, 205, 5)
	#sethue("red")
	#setopacity(1)
	fontsize(32)
	text("CX + P", Point(-60, 252), halign = :center)
	text("CY + P", Point(65, 252), halign = :center)
	text(L"E(CX + P)", Point(-70, -160), halign = :center)
end 300 400)),
)

# ╔═╡ d050a7ee-3aa7-4539-a236-5b6446599ded
struct BPE
	text::String
	pairs::Dict{Tuple{Char,Char},Char}
end

# ╔═╡ 29474a70-32eb-4281-8626-87819afa7267
function add_pair(bpe::BPE, subs)
	pairs = copy(bpe.pairs)
	push!(pairs, subs)
	return BPE(replace(bpe.text, prod(subs.first) => subs.second), pairs)
end

# ╔═╡ 89305cae-098f-4644-9109-d00f1e3bc04c
function pair_stats(text::String)
	stats = Dict{Tuple{Char,Char},Int}()
	for i in eachindex(text)
		j = nextind(text, i)
		if j > lastindex(text)
			break
		end
		a = text[i]
		b = text[j]
		stats[(a, b)] = get(stats, (a, b), 0) + 1
	end
	return stats
end

# ╔═╡ 728c16b7-50cd-43fe-a0d7-61d37952a6b3
pair_stats("aaabdaaabac")

# ╔═╡ c7f318b9-30e6-4b79-b7da-52f70904d246
function substitute(text::String, pair::Tuple{Char,Char})
	new_char = min('Z' + 1, minimum(text)) - 1
	return replace(text, prod(pair.first) => pair.second)
end

# ╔═╡ f1afaf8c-d9ad-446a-9826-9c4cda19993f
new_token(text::String) = new_token(BPE(text, Dict()))

# ╔═╡ 01372b00-ecb2-42bd-b408-13234717d969
function new_token(bpe::BPE)
	stats = pair_stats(bpe.text)
	pair = findmax(stats)[2]
	new_char = min('Z' + 1, minimum(bpe.text)) - 1
	return add_pair(bpe, pair => new_char)
end

# ╔═╡ f7ca3ff7-b5cf-452b-b955-7219e7397324
iter_1 = new_token("aaabdaaabac")

# ╔═╡ 736920df-e4bb-4535-b982-e397aa0a782d
iter_2 = new_token(iter_1)

# ╔═╡ c3a9a0ce-3450-4b17-8696-2ab8534b29f2
iter_3 = new_token(iter_2)

# ╔═╡ 579a203b-e6f7-4190-b874-18b00a5c3f77
function load_llms()
	llms = DataFrame(CSV.File("llms.csv"))
	rename!(llms, "Embedding dimension" => "``d_\\text{emb}``")
	rename!(llms, "Vocabulary size" => "``n_\\text{voc}``")
	rename!(llms, "Context window" => "``n_\\text{ctx}``")
	rename!(llms, "Feed-Forward hidden dimension" => "``d_\\text{ff}``")
	return llms
end

# ╔═╡ f39305ea-f7f5-440e-ac55-c83e27f6e7fc
llms = load_llms()

# ╔═╡ 93200f46-7c8f-4362-a445-43c57b50a2d2
names(llms)

# ╔═╡ 7e27c349-ee76-46bd-b1c2-a9ce54974e10
function table(df; mandatory_columns = String[], included_columns = nothing)
	for col in mandatory_columns
		df = df[(!ismissing).(df[!, col]), :]
	end
	if !isnothing(included_columns)
		df = unique(df[!, included_columns])
	end
	Markdown.parse(pretty_table(
		String,
		sort(df),
		backend = :markdown,
		column_labels = names(df),
		allow_markdown_in_cells = true,
		formatters = [(v, _, _) -> ismissing(v) ? "" : v],
	))
end

# ╔═╡ 771d39a5-74dc-494e-929e-1164bb08b983
table(llms, mandatory_columns = ["``n_\\text{ctx}``"], included_columns = [
	"Name",
	"Ref",
	"``n_\\text{voc}``",
	"``n_\\text{ctx}``",
	"Tokenizer",
])

# ╔═╡ 91abc03b-fef7-4f93-96fc-13f1cf654f0d
HAlign(
md"""
See the table below for the size of embeddings of large language models:
$(table(llms, mandatory_columns = ["``d_\\text{emb}``"], included_columns = [
	"Name",
	"Num params",
	"Ref",
	"``n_\\text{voc}``",
	"``d_\\text{emb}``",
]))
""",
HTML(html(@draw begin
	draw_transformer()
	highlight(200, -160, 375, -110)
	highlight(200, 490, 375, 565)
end 300 400));
)

# ╔═╡ f95a6de6-5e02-4237-88ba-ec44ef3d38c3
HAlign(
md"""
Different weights
``W_1 \in \mathbb{R}^{d_\text{ff} \times d_\text{emb}}``, ``W_2 \in \mathbb{R}^{d_\text{emb} \times d_\text{ff}}`` for each layer:
```math
x \mapsto W_2\max(0, W_1x + b_1) + b_2
```
Expansion factor ``d_\text{ff} / d_\text{emb}`` is typically 4× like suggested in $(cite("vaswani2017Attentiona")) (but not for Gemma)
$(table(llms, mandatory_columns = ["``d_\\text{ff}``"], included_columns = [
	"Name",
	"Ref",
	"``d_\\text{emb}``",
    "``d_\\text{ff}``",
]))
""",
HTML(html(@draw begin
	draw_transformer()
	highlight(200, -45, 375, 25)
end 300 400)),
)

# ╔═╡ 00000000-0000-0000-0000-000000000001
PLUTO_PROJECT_TOML_CONTENTS = """
[deps]
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
DocumenterCitations = "daee34ce-89f3-4625-b898-19384cb65244"
HypertextLiteral = "ac1192a8-f4b3-4bfe-ba22-af5b92cd3ab2"
LaTeXStrings = "b964fa9f-0449-5b57-a5c2-d3ea65f4040f"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"
Luxor = "ae8d54c2-7ccd-5906-9d76-62fc9837b5bc"
MathTeXEngine = "0a4f8689-d25c-4efe-a92b-7142dfc1aa53"
PlutoTeachingTools = "661c6b06-c737-4d37-b85c-46df65de6f69"
PlutoUI = "7f904dfe-b85e-4ff6-b463-dae2292396a8"
PrettyTables = "08abe8d2-0d0c-5749-adfa-8a2ac140af0d"

[compat]
CSV = "~0.10.15"
DataFrames = "~1.8.1"
DocumenterCitations = "~1.4.1"
HypertextLiteral = "~0.9.5"
LaTeXStrings = "~1.4.0"
Luxor = "~4.3.0"
MathTeXEngine = "~0.6.6"
PlutoTeachingTools = "~0.3.1"
PlutoUI = "~0.7.72"
PrettyTables = "~3.1.0"
"""

# ╔═╡ 00000000-0000-0000-0000-000000000002
PLUTO_MANIFEST_TOML_CONTENTS = """
# This file is machine-generated - editing it directly is not advised

julia_version = "1.12.1"
manifest_format = "2.0"
project_hash = "8cf5017c347c067a952c4fa0584b3d429744517c"

[[deps.ANSIColoredPrinters]]
git-tree-sha1 = "574baf8110975760d391c710b6341da1afa48d8c"
uuid = "a4c015fc-c6ff-483c-b24f-f7ea428134e9"
version = "0.0.1"

[[deps.AbstractPlutoDingetjes]]
deps = ["Pkg"]
git-tree-sha1 = "6e1d2a35f2f90a4bc7c2ed98079b2ba09c35b83a"
uuid = "6e696c72-6542-2067-7265-42206c756150"
version = "1.3.2"

[[deps.AbstractTrees]]
git-tree-sha1 = "2d9c9a55f9c93e8887ad391fbae72f8ef55e1177"
uuid = "1520ce14-60c1-5f80-bbc7-55ef81b5835c"
version = "0.4.5"

[[deps.ArgTools]]
uuid = "0dad84c5-d112-42e6-8d28-ef12dabb789f"
version = "1.1.2"

[[deps.Artifacts]]
uuid = "56f22d72-fd6d-98f1-02f0-08ddc0907c33"
version = "1.11.0"

[[deps.Automa]]
deps = ["PrecompileTools", "SIMD", "TranscodingStreams"]
git-tree-sha1 = "a8f503e8e1a5f583fbef15a8440c8c7e32185df2"
uuid = "67c07d97-cdcb-5c2c-af73-a7f9c32a568b"
version = "1.1.0"

[[deps.Base64]]
uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"
version = "1.11.0"

[[deps.BaseDirs]]
git-tree-sha1 = "bca794632b8a9bbe159d56bf9e31c422671b35e0"
uuid = "18cc8868-cbac-4acf-b575-c8ff214dc66f"
version = "1.3.2"

[[deps.BibInternal]]
deps = ["TestItems"]
git-tree-sha1 = "b3107800faf461eca3281f89f8d768f4b3e99969"
uuid = "2027ae74-3657-4b95-ae00-e2f7d55c3e64"
version = "0.3.7"

[[deps.BibParser]]
deps = ["BibInternal", "DataStructures", "Dates", "JSONSchema", "TestItems", "YAML"]
git-tree-sha1 = "33478bed83bd124ea8ecd9161b3918fb4c70e529"
uuid = "13533e5b-e1c2-4e57-8cef-cac5e52f6474"
version = "0.2.2"

[[deps.Bibliography]]
deps = ["BibInternal", "BibParser", "DataStructures", "Dates", "FileIO", "TestItems", "YAML"]
git-tree-sha1 = "0f25be9708ae20d7b94d3bf9d0a91defcca4c884"
uuid = "f1be7e48-bf82-45af-a471-ae754a193061"
version = "0.3.0"

[[deps.Bijections]]
git-tree-sha1 = "a2d308fcd4c2fb90e943cf9cd2fbfa9c32b69733"
uuid = "e2ed5e7c-b2de-5872-ae92-c73ca462fb04"
version = "0.2.2"

[[deps.Bzip2_jll]]
deps = ["Artifacts", "JLLWrappers", "Libdl"]