transcription_demo/index.html at main · hundredblocks/transcription_demo · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000

<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>LLM Tokenization</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            line-height: 1.6;
            margin: 0 auto;
            max-width: 800px;
            padding: 20px;
        }
        h1, h2, h3 {
            margin-bottom: 10px;
        }
        img {
            display: block;
            margin: 20px auto;
            max-width: 100%;
        }
        pre {
            background-color: #f4f4f4;
            padding: 10px;
        }
    </style>
</head>
<body>
    <h1>LLM Tokenization</h1>

    <p>Hi everyone, today we are going to look at Tokenization in Large Language Models (LLMs). Sadly, tokenization is a relatively complex and gnarly component of the state of the art LLMs, but it is necessary to understand in some detail because a lot of the strange things of LLMs that may be attributed to the neural network or otherwise appear mysterious actually trace back to tokenization.</p>

    <h2>Previously: character-level tokenization</h2>

    <p>So what is tokenization? Well it turns out that in our previous video, Let's build GPT from scratch, we already covered tokenization but it was only a very simple, naive, character-level version of it. When you go to the Google colab for that video, you'll see that we started with our training data (Shakespeare), which is just a large string in Python:</p>

    <img src="frames/00_01_00.jpg" alt="Code snippet showing character-level tokenization">

    <p>But how do we feed strings into a language model? Well, we saw that we did this by first constructing a vocabulary of all the possible characters we found in the entire training set:</p>

    <pre>
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(vocab_size)
print(''.join(chars))
    </pre>

    <h2>Byte Pair Encoding (BPE)</h2>

    <p>In practice, state-of-the-art language models use a lot more complicated schemes for constructing token vocabularies. They deal on the chunk level rather than the character level, and these character chunks are constructed using algorithms such as the byte pair encoding algorithm, which we will cover in detail in this video.</p>

    <p>The GPT-2 paper introduced byte-level encoding as a mechanism for tokenization in the context of large language models:</p>

    <img src="frames/00_03_40.jpg" alt="Excerpt from the GPT-2 paper discussing byte pair encoding">

    <p>Tokens are the fundamental unit of large language models. Everything is in units of tokens and tokenization is the process for translating strings or text into sequences of tokens and vice versa.</p>

    <h2>Tokenization issues</h2>

    <p>Tokenization is at the heart of much weirdness in LLMs. A lot of issues that may look like they are with the neural network architecture or the large language model itself are actually issues with the tokenization. For example:</p>

    <ul>
        <li>LLMs can't spell words or do simple string processing tasks like reversing a string.</li>
        <li>LLMs are worse at non-English languages.</li>
        <li>LLMs are bad at simple arithmetic.</li>
        <li>GPT-2 had quite a bit more trouble coding in Python.</li>
        <li>Weird warnings about trailing whitespace.</li>
        <li>"Solid Gold Magikarp" would make LLMs go off on unrelated tangents.</li>
        <li>YAML is preferred over JSON with LLMs.</li>
    </ul>
</body>
</html>


<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Tokenization by Example in a Web UI (Tiktokenizer)</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            line-height: 1.6;
            margin: 20px;
        }
        h1, h2, h3 {
            margin-top: 30px;
        }
        img {
            max-width: 100%;
            display: block;
            margin: 20px auto;
        }
        figcaption {
            text-align: center;
            font-style: italic;
            margin-bottom: 20px;
        }
        pre {
            background-color: #f4f4f4;
            padding: 10px;
            overflow-x: auto;
        }
        .callout {
            background-color: #f8f8f8;
            border-left: 4px solid #2980b9;
            padding: 15px;
            margin-bottom: 20px;
        }
    </style>
</head>
<body>
    <h1>Tokenization by Example in a Web UI (Tiktokenizer)</h1>

    <p>Tokenization is at the heart of the weirdness of Large Language Models (LLMs). We shouldn't brush it off, as it is necessary to understand in some detail because a lot of the strange things about LLMs may be attributed to the neural network or otherwise appear mysterious actually trace back to tokenization.</p>

    <h2>Exploring the Tiktokenizer Web App</h2>

    <p>Let's explore the <a href="https://tiktokenizer.vercel.app">Tiktokenizer web app</a>. What's great about it is that tokenization runs live in your browser in JavaScript. You can type something like "hello world", and it shows the whole string broken into tokens:</p>

    <figure>
        <img src="frames/00_06_50.jpg" alt="Tiktokenizer web app screenshot">
        <figcaption>Tiktokenizer web app showing how "hello world" is tokenized</figcaption>
    </figure>

    <p>On the left is the input string, and on the right we see it tokenized using the GPT-2 tokenizer into 3 tokens: "hello" (token 31373), " " space (token 318), and "world" (token 984).</p>

    <div class="callout">
        <p>Be careful because spaces, newlines, and other whitespace characters are included as tokens, but you can hide whitespace tokens for clarity.</p>
    </div>

    <h2>Tokenization of Numbers, Code, and Non-English Text</h2>

    <p>Numbers are handled in an arbitrary way - sometimes multiple digits are a single token, sometimes individual digits are separate tokens. The LLM has to learn from patterns in the training data that these represent the same concept.</p>

    <p>Code, especially Python, is tokenized inefficiently by GPT-2. Each space of indentation becomes a separate token, wasting sequence length. GPT-4 handles this much better by grouping indentation spaces.</p>

    <figure>
        <img src="frames/00_12_00.jpg" alt="Python code tokenized by GPT-2">
        <figcaption>GPT-2 tokenizes each space in Python indentation separately, wasting sequence length</figcaption>
    </figure>

    <p>Non-English languages like Korean tend to get broken into more tokens compared to the equivalent English text. This stretches out the sequence length, causing the attention mechanism to run out of context.</p>

    <h2>GPT-4's Improved Tokenizer</h2>

    <p>The GPT-4 tokenizer (CL100K_base) has roughly double the number of tokens compared to GPT-2 (50K vs 100K). This allows the same text to be represented in fewer tokens, effectively doubling the context length the attention mechanism can utilize.</p>

    <pre>
GPT-2 token count: 300
GPT-4 token count: 185
    </pre>

    <p>GPT-4 also deliberately groups more whitespace into single tokens, especially for Python indentation. This densifies the code representation, allowing the model to attend to more useful context when generating the next token.</p>

    <p>In summary, many of the quirks and limitations of LLMs can be traced back to details of the tokenizer used. The improvements from GPT-2 to GPT-4 are not just from the model architecture, but also the more efficient tokenizer enabling the model to effectively utilize more context.</p>

</body>
</html>


<html lang="en">
  <head>
    <meta charset="UTF-8">
    <title>Tokenization in Large Language Models</title>
    <style>
      body {
        font-family: Arial, sans-serif;
        line-height: 1.6;
        margin: 20px 100px;
      }

      h1, h2, h3 { margin-top: 30px; }

      img {
        display: block;
        margin: 20px auto;
        max-width: 80%;
      }

      pre {
        background-color: #f4f4f4;
        padding: 10px;
        overflow-x: auto;
      }
    </style>
  </head>

  <body>
    <h1>Tokenization in Large Language Models</h1>

    <p>When training large language models (LLMs), we need to take strings and feed them into the models. For that, we need to tokenize the strings into integers from a fixed vocabulary. These integers are then used to look up vectors in an embedding table, which are fed into the Transformer as input. This process gets tricky because we don't just want to support the simple English alphabet, but also different languages and special characters like emojis.</p>

    <h2>Strings in Python and Unicode Code Points</h2>

    <p>In Python, strings are immutable sequences of Unicode code points. Unicode code points are defined by the Unicode Consortium as part of the Unicode standard, which currently defines roughly 150,000 characters across 161 scripts. The standard is very much alive, with the latest version 15.1 released in September 2023.</p>

    <p>We can access the Unicode code point for a single character using the <code>ord()</code> function in Python. For example:</p>

    <pre>
ord("H")       # 104
ord("😊")      # 128522
ord("안")      # 50504
    </pre>

    <p>However, we can't simply use these raw code point integers for tokenization, as the vocabulary would be too large (150,000+) and unstable due to the evolving Unicode standard.</p>

    <img src="frames/00_17_45.jpg" alt="Python code showing Unicode code points for a Korean string">
    <figcaption>Using ord() to get Unicode code points for characters in a string.</figcaption>

    <h2>Unicode Byte Encodings</h2>

    <p>To find a better solution for tokenization, we turn to Unicode byte encodings like ASCII, UTF-8, UTF-16, and UTF-32. These encodings define how to translate the abstract Unicode code points into actual bytes that can be stored and transmitted.</p>

  </body>
</html>


<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Unicode Byte Encodings, ASCII, UTF-8, UTF-16, UTF-32</title>
</head>
<body>
<article>
<h1>Unicode Byte Encodings, ASCII, UTF-8, UTF-16, UTF-32</h1>

<p>The Unicode Consortium defines three types of encodings: UTF-8, UTF-16 and UTF-32. These encodings are the way by which we can take Unicode text and translate it into binary data or byte streams.</p>

<p>UTF-8 is by far the most common encoding. It takes every single Unicode code point and translates it to a byte stream between one to four bytes long, depending on the code point. The first 128 code points (ASCII) only need one byte. The next 1,920 code points need two bytes to encode, which covers the remainder of almost all Latin-script alphabets. Three bytes are needed for the remaining 61,440 code points of the Basic Multilingual Plane (BMP). Four bytes cover the other planes of Unicode, which include less common CJK characters, various historic scripts, and mathematical symbols.</p>

<img src="frames/00_19_50.jpg" alt="Wikipedia article on UTF-8 encoding">
<figcaption>The Wikipedia article on UTF-8 encoding shows how different ranges of Unicode code points map to byte sequences of varying length.</figcaption>

<p>UTF-16 and UTF-32, while having some advantages like fixed-width encoding, are significantly more wasteful in terms of space, especially for simpler ASCII characters. Most standards explicitly favor UTF-8.</p>

<p>In Python, we can use the <code>encode()</code> method on strings to get their UTF-8 byte representation:</p>

<pre><code>
In [64]: list("안녕하세요 👋 (hello in Korean)".encode("utf-8"))
Out[64]: [236,
          149,
          136,
          235,
          133,
          149,
          237,
          149,
          152,
          236,
          132,
          184,
          236,
          154,
          148,
          240,
          159,
          145,
          139,
          32,
          240,
          159,
          140,
          136,
          108,
          101,
          108,
          108,
          111,
          32,
          105,
          110,
          32,
          75,
          111,
          114,
          101,
          97,
          110,
          41]
</code></pre>

<p>However, directly using the raw UTF-8 bytes would be very inefficient for language models. It would lead to extremely long sequences with a small vocabulary size of only 256 possible byte values. This prevents attending to sufficiently long contexts.</p>

<p>The solution is to use a byte pair encoding (BPE) algorithm to compress these byte sequences to a variable amount. This allows efficiently representing the text with a larger but tunable vocabulary size.</p>

</article>
</body>
</html>


<html>
<head>
  <title>Daydreaming: Deleting Tokenization in Language Models</title>
  <style>
    body {
      font-family: Arial, sans-serif;
      line-height: 1.6;
      margin: 0 auto;
      max-width: 800px;
      padding: 20px;
    }
    h1, h2 {
      color: #333;
    }
    img {
      display: block;
      margin: 20px auto;
      max-width: 100%;
    }
    code {
      background-color: #f4f4f4;
      padding: 2px 4px;
    }
    .caption {
      font-style: italic;
      text-align: center;
    }
  </style>
</head>
<body>
  <h1>Daydreaming: Deleting Tokenization in Language Models</h1>

  <p>In this article, we explore the idea of removing tokenization from language models, as proposed in a paper from the summer of last year. While this would be an amazing achievement, allowing us to feed byte streams directly into our models, it comes with some challenges.</p>

  <h2>The Problem with Long Sequences</h2>

  <p>One of the main issues with removing tokenization is that attention becomes extremely expensive for very long sequences. To address this, the paper proposes a hierarchical structuring of the Transformer architecture that could allow feeding in raw bytes.</p>

  <img src="frames/00_23_00.jpg" alt="Overview of MegaByte">
  <p class="caption">Figure 1. Overview of MegaByte with patch size. A small local model autoregressively predicts each patch byte, using the output of a larger global model to condition on previous patches. Global and Local inputs are padded by a token respectively to avoid leaking information about future tokens.</p>

  <h2>Establishing the Viability of Tokenization-Free Modeling</h2>

  <p>The paper concludes by stating that their results "establish the viability of tokenization-free autoregressive sequence modeling at scale." However, this approach has not yet been proven out by sufficiently many groups and at a sufficient scale.</p>

  <h2>The Current State of Affairs</h2>

  <p>For now, we still need to compress text using the Byte Pair Encoding (BPE) algorithm before feeding it into language models. The BPE algorithm is not overly complicated, and its Wikipedia page provides a good walkthrough.</p>

  <p>Tokenization-free modeling would be a significant breakthrough, and hopefully, future research will make it a reality. Until then, we must rely on established methods like BPE to preprocess our input data.</p>

</body>
</html>


<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Byte Pair Encoding (BPE) Algorithm Walkthrough</title>
</head>
<body>
    <h1>Byte Pair Encoding (BPE) Algorithm Walkthrough</h1>

    <p>The Byte Pair Encoding (BPE) algorithm is quite instructive for understanding the basic idea of tokenization. Let's walk through an example to see how it works.</p>

    <h2>Example</h2>

    <p>Suppose we have a vocabulary of only four elements: a, b, c, and d. Our input sequence is:</p>

    <img src="frames/00_24_00.jpg" alt="Input sequence: aaabdaaabac">
    <figcaption>Input sequence: aaabdaaabac</figcaption>

    <p>The sequence is too long, and we'd like to compress it. The BPE algorithm iteratively finds the pair of tokens that occur most frequently and replaces that pair with a single new token.</p>

    <p>In the first iteration, the byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, such as "Z". The data and replacement table become:</p>

    <img src="frames/00_25_00.jpg" alt="After first iteration">
    <figcaption>After first iteration: Zabdaaabac, Z=aa</figcaption>

    <p>The process is repeated with byte pair "ab", replacing it with "Y":</p>

    <img src="frames/00_25_30.jpg" alt="After second iteration">
    <figcaption>After second iteration: ZYdZYac, Y=ab, Z=aa</figcaption>

    <p>In the final round, the pair "ZY" is most common and replaced with "X":</p>

    <img src="frames/00_26_05.jpg" alt="After final iteration">
    <figcaption>After final iteration: XdXac, X=ZY, Y=ab, Z=aa</figcaption>

    <h2>Compression Result</h2>

    <p>After going through this process, instead of having a sequence of 11 tokens with a vocabulary length of 4, we now have a sequence of 5 tokens with a vocabulary length of 7.</p>

    <p>The BPE algorithm can be applied in the same way to byte sequences. Starting with a vocabulary size of 256, we iteratively find the byte pairs that occur most frequently, mint new tokens, append them to the vocabulary, and replace occurrences in the data. This results in a compressed dataset and an encoding/decoding algorithm.</p>

</body>
</html>


<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>LLM Tokenization</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            line-height: 1.6;
            margin: 0 auto;
            max-width: 800px;
            padding: 20px;
        }
        h1, h2, h3 {
            margin-top: 30px;
        }
        img {
            display: block;
            margin: 20px auto;
            max-width: 100%;
        }
        pre {
            background-color: #f4f4f4;
            padding: 10px;
        }
        .image-caption {
            font-style: italic;
            text-align: center;
        }
    </style>
</head>
<body>
    <h1>LLM Tokenization</h1>

    <p>In this chapter, we will start the implementation of tokenization in large language models (LLMs). Tokenization is a crucial component of the state of the art LLMs, but it is necessary to understand in some detail because a lot of the shining results of LLMs that may be attributed to the neural network or otherwise actually trace back to tokenization.</p>

    <h2>Unicode Tokenization</h2>

    <p>To get the tokens, we take our input text and encode it into UTF-8. At this point, the tokens will be a raw bytes single stream of bytes. To make it easier to work with, we convert all those bytes to integers and create a list out of it for easier manipulation and visualization in Python.</p>

    <img src="frames/00_27_45.jpg" alt="Tokenization code">
    <p class="image-caption">Converting text to a list of token integers</p>

    <p>The original paragraph has a length of 533 code points, but after encoding into UTF-8, it expands to 616 bytes or tokens. This is because while many simple ASCII characters become a single byte, more complex Unicode characters can take up to four bytes each.</p>

    <h2>Finding the Most Common Byte Pair</h2>

    <p>As a first step in the algorithm, we want to iterate over the bytes and find the pair of bytes that occur most frequently, as we will then merge them.</p>

    <img src="frames/00_28_15.jpg" alt="Counting consecutive pairs">
    <p class="image-caption">Iterating over bytes to find the most common pair</p>

    <p>There are many different ways to approach counting consecutive pairs and finding the most common pair. Here is one implementation in Python:</p>

    <pre>
def get_most_frequent_pair(tokens):
    pairs = {}
    for i in range(len(tokens)-1):
        pair = (tokens[i], tokens[i+1])
        if pair not in pairs:
            pairs[pair] = 0
        pairs[pair] += 1

    most_frequent_pair = None
    max_frequency = 0
    for pair, frequency in pairs.items():
        if frequency > max_frequency:
            most_frequent_pair = pair
            max_frequency = frequency

    return most_frequent_pair
    </pre>

    <p>This function takes the list of token integers, counts the frequency of each consecutive pair, and returns the pair that appears most often. This is a key step in the byte pair encoding algorithm used for tokenization in many LLMs.</p>

</body>
</html>


<html>
<head>
  <title>Finding Most Common Consecutive Pairs in Tokenized Text</title>
  <style>
    body {
      font-family: Arial, sans-serif;
      line-height: 1.6;
      margin: 40px;
    }
    h1, h2, h3 {
      margin-top: 30px;
    }
    img {
      display: block;
      margin: 20px auto;
      max-width: 100%;
      border: 1px solid #ccc;
    }
    pre {
      background-color: #f4f4f4;
      padding: 10px;
      overflow-x: auto;
    }
    .caption {
      font-style: italic;
      text-align: center;
      margin-top: -10px;
    }
  </style>
</head>
<body>
  <h1>Finding Most Common Consecutive Pairs in Tokenized Text</h1>

  <p>In this chapter, we will explore how to find the most commonly occurring consecutive pairs in a list of tokenized integers. We'll implement a function called <code>get_stats</code> that takes a list of integers and returns a dictionary keeping track of the counts of each consecutive pair.</p>

  <h2>Implementing <code>get_stats</code></h2>

  <p>Here's how we can implement <code>get_stats</code> in Python:</p>

  <pre>
def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]): # Pythonic way to iterate consecutive elements
        counts[pair] = counts.get(pair, 0) + 1
    return counts
  </pre>

  <p>This function uses <code>zip(ids, ids[1:])</code> to iterate over consecutive elements of the input list in a Pythonic way. It then builds a dictionary <code>counts</code> where the keys are tuples of consecutive elements and the values are the number of occurrences.</p>

  <p>Let's see it in action on a sample list of tokenized integers:</p>

  <pre>
tokens = [1, 115, 32, 99, 97, 110, 32, 98, 101, 32, 109, 111, 114, 101, 32, 116, 104, 97, 110, 32, 97, 32, 108, 105, 116]
stats = get_stats(tokens)
print(sorted([(v,k) for k,v in stats.items()], reverse=True))
  </pre>

  <p>Here we print the (value, key) pairs from the <code>stats</code> dictionary, sorted by value in descending order. This gives us a nice view of the most common consecutive pairs:</p>

  <img src="frames/00_29_45.jpg" alt="Printing stats dictionary">
  <p class="caption">Printing the stats dictionary sorted by value in descending order</p>

  <p>We can see that the pair <code>(101, 32)</code> is the most commonly occurring, appearing 20 times in the input. Using <code>chr()</code> we can convert these Unicode code points to characters:</p>

  <pre>
print(chr(101), chr(32))
# Output: e
  </pre>

  <p>So the most common consecutive pair is "e" followed by a space, which makes sense as many English words end with "e".</p>

  <h2>Conclusion</h2>

  <p>The <code>get_stats</code> function provides a straightforward way to find the most common consecutive pairs in a tokenized list using a dictionary to count occurrences. This can be a useful building block for various text analysis tasks.</p>

</body>
</html>


<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>LLM Tokenization</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            line-height: 1.6;
            margin: 20px;
        }
        h1, h2 {
            color: #333;
        }
        img {
            max-width: 100%;
            height: auto;
            display: block;
            margin: 20px 0;
        }
        .code {
            background-color: #f4f4f4;
            padding: 10px;
            border-radius: 5px;
            font-family: "Courier New", monospace;
            font-size: 14px;
            overflow-x: auto;
        }
    </style>
</head>
<body>
    <h1>LLM Tokenization</h1>
    <p>In the previous video, "Let's build GPT from scratch", a simple, naive, character-level version of tokenization was covered. However, when it comes to large language models (LLMs), a more complex and nuanced approach is necessary to understand the neural network or otherwise appear mysterious to trace back to tokenization.</p>

    <h2>Merging the Most Common Pair</h2>
    <p>The tokenization process begins by iterating over the sequence and minting a new token with the ID of 256, as the current tokens range from 0 to 255. During this iteration, every occurrence of the most common pair (in this case, "101, 32") is replaced with the new token ID.</p>

    <img src="frames/00_33_30.jpg" alt="Python code for merging the most common pair">
    <p class="code">def merge(ids, pair, idx):<br>
    # in the list of ints (ids), replace all consecutive occurrences of pair with the new token idx<br>
    newids = []<br>
    i = 0<br>
    while i &lt; len(ids):<br>
        # if we are not at the very last position AND the pair matches, replace it<br>
        if i &lt; len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:<br>
            newids.append(idx)<br>
            i += 2<br>
        else:<br>
            newids.append(ids[i])<br>
            i += 1<br>
    return newids
    </p>

    <p>The <code>merge</code> function takes a list of IDs, the pair to be replaced, and the new index (idx) as arguments. It iterates through the list, checking for consecutive occurrences of the pair and replacing them with the new token ID. If a match is found, the new ID is appended to the <code>newids</code> list, and the position is incremented by two to skip over the entire pair. If no match is found, the element at the current position is copied over, and the position is incremented by one.</p>

    <h2>Iterating the Tokenization Process</h2>
    <p>The tokenization process is repeated iteratively, finding the most common pair at each step and replacing it with a new token ID. The number of iterations is a hyperparameter that can be tuned to find the optimal vocabulary size. Typically, larger vocabulary sizes result in shorter sequences, and there is a sweet spot that works best in practice. For example, GPT-4 currently uses around 100,000 tokens, which is considered a reasonable number for large language models.</p>

</body>
</html>


<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Training the Tokenizer: Adding the While Loop, Compression Ratio</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            max-width: 800px;
            margin: 0 auto;
            padding: 20px;
        }
        h1, h2 {
            color: #1a0dab;
        }
        img {
            display: block;
            margin: 20px auto;
            max-width: 100%;
            border: 1px solid #ccc;
        }
        pre {
            background-color: #f0f0f0;
            padding: 10px;
            overflow-x: auto;
        }
    </style>
</head>
<body>
    <h1>Training the Tokenizer: Adding the While Loop, Compression Ratio</h1>

    <p>In this section, we will dive into the process of training the tokenizer by adding a while loop and examining the compression ratio achieved.</p>

    <h2>Preparing the Training Data</h2>

    <p>To begin, we grabbed the entire blog post and stretched it out into a single line to use as our training data. Using longer text allows us to have more representative token statistics and obtain more sensible results.</p>

    <p>Here's a snapshot of the code used to prepare the training data:</p>

    <img src="frames/00_35_10.jpg" alt="Preparing the training data">
    <figcaption>Preparing the training data by encoding the text and converting it to a list of integers.</figcaption>

    <h2>Implementing the Merging Loop</h2>

    <p>Next, we implemented the merging loop to iteratively combine the most frequently occurring pair of tokens. The key steps are:</p>

    <ol>
        <li>Set the desired final vocabulary size (e.g., 276).</li>
        <li>Create a copy of the initial tokens list.</li>
        <li>Initialize a merges dictionary to store the mappings of merged tokens.</li>
        <li>Iterate for a specified number of merges (e.g., 20).</li>
        <li>In each iteration:
            <ul>
                <li>Find the most commonly occurring pair of tokens.</li>
                <li>Assign a new token ID to the merged pair.</li>
                <li>Replace all occurrences of the pair with the new token ID.</li>
                <li>Record the merge in the merges dictionary.</li>
            </ul>
        </li>
    </ol>

    <p>Here's the complete code for the merging loop:</p>

    <img src="frames/00_38_20.jpg" alt="Merging loop code">
    <figcaption>Code implementation of the merging loop.</figcaption>

    <h2>Compression Ratio</h2>

    <p>After performing the merging process, we can examine the compression ratio achieved. In our example, we started with 24,000 bytes and ended up with 19,000 tokens after 20 merges. The compression ratio is calculated by dividing the original length by the compressed length:</p>

    <pre>
tokens length: 24597
ids length: 19438
compression ratio: 1.27X
    </pre>

    <p>The compression ratio indicates the amount of compression achieved on the text with the specified number of merges. As more vocabulary elements are added, the compression ratio increases.</p>

    <h2>Conclusion</h2>

    <p>In this chapter, we covered the process of training the tokenizer by adding a while loop to perform iterative merging of token pairs. We also examined the compression ratio achieved through this process. The trained tokenizer is a separate stage from the language model itself and plays a crucial role in preparing the input data for the model.</p>

</body>
</html>


<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>The Tokenizer: A Separate Stage from the LLM</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            line-height: 1.6;
            margin: 0 auto;
            max-width: 800px;
            padding: 20px;
        }
        h1, h2, h3 {
            margin-top: 20px;
        }
        img {
            display: block;
            margin: 20px auto;
            max-width: 100%;
        }
        pre {
            background-color: #f4f4f4;
            padding: 10px;
        }
    </style>
</head>
<body>
    <h1>The Tokenizer: A Separate Stage from the LLM</h1>

    <p>It's important to understand that the tokenizer is a completely separate, independent module from the LLM. It has its own training dataset of text (which could be different from that of the LLM), on which you train the vocabulary using the Byte Pair Encoding (BPE) algorithm. The LLM later only ever sees the tokens and never directly deals with any text.</p>

    <img src="frames/00_40_55.jpg" alt="Tokenizer/LLM Diagram">
    <figcaption>The tokenizer translates between raw text and sequences of tokens. The LLM only sees token sequences.</figcaption>

    <p>Once you have trained the tokenizer and have the vocabulary and merges, you can do both encoding and decoding. The tokenizer is a translation layer between raw text (a sequence of Unicode code points) and token sequences. It can take raw text and turn it into a token sequence, and vice versa.</p>

    <h2>Tokenizer Training and LLM Training</h2>

    <p>Typically, in a state-of-the-art application, you might take all of your training data for the language model and run it through the tokenizer as a single, massive pre-processing step. This translates everything into token sequences, which are then stored on disk. The large language model reads these token sequences during training.</p>

    <p>You may want the training sets for the tokenizer and the large language model to be different. For example, when training the tokenizer, you might care about performance across many different languages and both code and non-code data. The mixture of languages and amount of code in your tokenizer training set will determine the number of merges and the density of token representation for those data types.</p>

    <p>Roughly speaking, if you add a large amount of data in a particular language to your tokenizer training set, more tokens in that language will get merged. This results in shorter sequences for that language, which can be beneficial for the LLM, as it has a finite context length it can work with in the token space.</p>

    <h2>Conclusion</h2>

    <p>In summary, the tokenizer is a crucial, separate stage from the LLM itself. It is trained on its own dataset using the BPE algorithm to create a vocabulary and merges. This allows it to translate between raw text and token sequences, which the LLM then operates on. The composition of the tokenizer's training set can significantly impact the token representation and sequence lengths for different types of data.</p>

</body>
</html>


<html>
<head>
  <title>Decoding Tokens to Strings in Large Language Models</title>
  <style>
    body {
      font-family: Arial, sans-serif;
      line-height: 1.6;
      margin: 40px;
    }
    h1, h2, h3 {
      margin-top: 30px;
    }
    img {
      display: block;
      margin: 20px auto;
      max-width: 100%;
    }
    pre {
      background-color: #f4f4f4;
      padding: 10px;
      overflow-x: auto;
    }
  </style>
</head>
<body>
  <h1>Decoding Tokens to Strings</h1>

  <p>Let's begin with decoding, which is the process of taking a token sequence and using the tokenizer to get back a Python string object representing the raw text. This is the function we'd like to implement.</p>

  <p>There are many different ways to implement the decode function. Here's one approach:</p>

  <pre>
def decode(ids):
  # given ids (list of integers), return Python string
  vocab = {idx: bytes([idx]) for idx in range(256)}
  for (p0, p1), idx in merges.items():
    vocab[idx] = vocab[p0] + vocab[p1]

  tokens = [vocab[idx] for idx in ids]
  text = tokens.decode("utf-8")
  return text
  </pre>

  <p>We create a preprocessing variable called <code>vocab</code>, which is a dictionary mapping from token ID to the bytes object for that token. We begin with the raw bytes for tokens 0 to 255. Then we iterate over the merges in the order they were inserted, populating the vocab dictionary by concatenating the bytes of the children tokens. It's important to iterate over the dictionary items in the same order they were inserted, which is guaranteed in Python 3.7+.</p>

  <p>Given the token IDs, we look up their bytes in the vocab, concatenate them together, and decode from UTF-8 to get the final string.</p>

  <img src="frames/00_45_10.jpg" alt="Decode function in Python">
  <figcaption>The decode function implemented in Python.</figcaption>

  <h2>Handling Invalid UTF-8 Sequences</h2>

  <p>One issue with this implementation is that it can throw an error if the token sequence is not a valid UTF-8 encoding. For example, trying to decode the single token 128 results in an error:</p>

  <pre>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
  </pre>

  <p>This is because the binary representation of 128 is <code>10000000</code>, which is not a valid UTF-8 sequence according to the encoding rules. To handle this, we can pass <code>errors="replace"</code> to the <code>decode</code> function, which replaces invalid bytes with the Unicode replacement character (�).</p>

  <p>The standard practice is to use <code>errors="replace"</code> when decoding. If you see the replacement character in your output, it means something went wrong and the language model output an invalid sequence of tokens.</p>

  <p>In summary, decoding is the process of converting a sequence of token IDs back into a human-readable string. It involves looking up the bytes for each token, concatenating them, and decoding the result from UTF-8. Handling invalid UTF-8 sequences is important to avoid errors and indicate when the model produces an invalid output.</p>

</body>
</html>


<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Encoding Strings to Tokens in Language Models</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            line-height: 1.6;
            margin: 0 auto;
            max-width: 800px;
            padding: 20px;
        }
        h1, h2, h3 {
            margin-bottom: 20px;
        }
        img {
            display: block;
            margin: 20px auto;
            max-width: 100%;
        }
        pre {
            background-color: #f4f4f4;
            padding: 10px;
            overflow-x: auto;
        }
        .caption {
            font-style: italic;
            text-align: center;
            margin-top: -10px;
            margin-bottom: 20px;
        }
    </style>
</head>
<body>
    <h1>Encoding Strings to Tokens in Language Models</h1>

    <p>In this article, we will explore how to implement the encoding of strings into tokens, an essential part of the tokenization process in large language models (LLMs).</p>

    <h2>Overview</h2>

    <p>The goal is to create a function that takes a string as input and returns a list of integers representing the tokens. This is the opposite direction of the decoding process covered in the previous section.</p>

    <p>The implementation will involve several steps, including:</p>
    <ol>
        <li>Encoding the text into UTF-8 bytes</li>
        <li>Converting the bytes into a list of integers</li>
        <li>Iteratively merging pairs of tokens based on the merges dictionary</li>
    </ol>

    <h2>Implementation</h2>

    <p>Here's the Python code for the <code>encode</code> function:</p>

    <pre>
def encode(text):
    # given a string, return list of integers (the tokens)
    tokens = list(text.encode("utf-8"))
    while True:
        stats = get_stats(tokens)
        pair = min(stats, key=lambda p: merges.get(p, float("inf")))
        if pair not in merges:
            break # nothing else can be merged
        idx = merges[pair]
        tokens = merge(tokens, pair, idx)
    return tokens
    </pre>

    <p>Let's break down the key steps:</p>

    <ol>
        <li>The input text is encoded into UTF-8 bytes using <code>text.encode("utf-8")</code>.</li>
        <li>The bytes are converted into a list of integers using <code>list(...)</code>.</li>
        <li>A while loop is used to iteratively merge pairs of tokens based on the merges dictionary. This continues until no more merges can be performed.</li>
        <li>Inside the loop, the <code>get_stats</code> function is used to count the occurrences of each pair in the current token sequence.</li>
        <li>The pair with the lowest index in the merges dictionary is selected for merging using <code>min(...)</code> and a custom key function.</li>
        <li>If the selected pair is not found in the merges dictionary, the loop breaks since no more merges can be performed.</li>
        <li>Otherwise, the selected pair is merged into a single token using the <code>merge</code> function and the corresponding index from the merges dictionary.</li>
        <li>The loop continues until no more merges can be performed, and the final list of tokens is returned.</li>
    </ol>

    <img src="frames/00_54_50.jpg" alt="Code snippet of the encode function">
    <p class="caption">The encode function implementation in Python</p>

    <h2>Handling Special Cases</h2>

    <p>It's important to handle special cases, such as when the input string is empty or contains only a single character. In such cases, there are no pairs to merge, so the function should return the tokens as is.</p>

    <p>To handle this, a condition can be added at the beginning of the function:</p>

    <pre>
if len(tokens) &lt;= 2:
    return tokens
    </pre>

    <h2>Testing and Validation</h2>

    <p>To ensure the correctness of the implementation, it's crucial to test the <code>encode</code> function with various input strings and validate the results. Here are a few test cases:</p>

    <ol>
        <li>Encoding and decoding a string should result in the same string.</li>
        <li>Encoding and decoding the training text used to train the tokenizer should produce the same text.</li>
        <li>Encoding and decoding validation data that the tokenizer hasn't seen before should also work correctly.</li>
    </ol>

    <img src="frames/00_57_00.jpg" alt="Diagram illustrating the encoding and decoding process">
    <p class="caption">The encoding and decoding process in the tokenizer</p>

    <h2>Conclusion</h2>

    <p>Implementing the encoding of strings into tokens is a fundamental component of tokenization in large language models. By following the steps outlined in this article, you can create a function that takes a string and returns a list of integers representing the tokens.</p>

    <p>Remember to handle special cases, test your implementation thoroughly, and validate the results to ensure the correctness of your tokenizer.</p>

</body>
</html>


<html>
<head>
  <title>Forced Splits Using Regex Patterns in GPT Tokenization</title>
  <style>
    body {
      font-family: Arial, sans-serif;
      line-height: 1.6;
      margin: 0 auto;
      max-width: 800px;
      padding: 20px;
    }

    h1, h2 {
      color: #333;
    }

    img {
      display: block;
      margin: 20px auto;
      max-width: 100%;
    }

    pre {
      background-color: #f4f4f4;
      padding: 10px;
    }

    .caption {
      font-style: italic;
      text-align: center;
    }
  </style>
</head>
<body>
  <h1>Forced Splits Using Regex Patterns in GPT Tokenization</h1>

  <p>In the GPT-2 series, the tokenizer uses the byte pair encoding (BPE) algorithm. The paper mentions that while the space of model-able strings is large, many common characters, including numerals, punctuation, and other symbols, are unified within the BPE's standard Unicode vocabulary. However, BPE implementations often operate on Unicode code points rather than byte sequences, which leads to suboptimal merges.</p>

  <p>To avoid this, GPT-2 prevents BPE from merging across character boundaries by using a regex pattern. The relevant code is found in the <code>encoder.py</code> file:</p>

  <img src="frames/00_59_35.jpg" alt="GPT-2 encoder.py code snippet">
  <p class="caption">The <code>re.compile()</code> function in GPT-2's <code>encoder.py</code> file.</p>

  <p>This regex pattern is designed to split the input text into chunks, ensuring that certain character types (letters, numbers, punctuation) are never merged together during the BPE process.</p>

  <h2>Analyzing the Regex Pattern</h2>

  <p>The regex pattern is composed of several parts, separated by the "or" operator (<code>|</code>). Let's break it down:</p>

  <ul>
    <li><code>'\s+\S'</code>: Matches one or more whitespace characters followed by a non-whitespace character.</li>
    <li><code>'\p{L}+'</code>: Matches one or more Unicode letters.</li>
    <li><code>'\p{N}+'</code>: Matches one or more Unicode digits.</li>
    <li><code>'[^\s\p{L}\p{N}]+'</code>: Matches one or more characters that are not whitespace, letters, or digits (e.g., punctuation).</li>
    <li><code>'\s+(?!\S)'</code>: Matches one or more whitespace characters that are not followed by a non-whitespace character (i.e., trailing whitespace).</li>
  </ul>

  <p>By applying this regex pattern to the input text, GPT-2 ensures that the BPE algorithm only merges tokens within each chunk, preventing undesired merges across character types.</p>

  <img src="frames/01_10_20.jpg" alt="Example of regex pattern splitting Python code">
  <p class="caption">The regex pattern splits the Python code into chunks based on character types.</p>

  <h2>Conclusion</h2>

  <p>GPT-2's tokenizer employs a regex pattern to force splits across character categories before applying the BPE algorithm. This approach helps maintain the integrity of the tokenization process and prevents suboptimal merges that could negatively impact the model's performance. While the exact training code for the GPT-2 tokenizer has not been released, understanding the regex pattern provides valuable insights into how the tokenizer handles input text.</p>

</body>