You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ai/conceptual/understanding-tokens.md
+33-36Lines changed: 33 additions & 36 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,34 +4,31 @@ description: "Understand how large language models (LLMs) use tokens to analyze
4
4
author: haywoodsloan
5
5
ms.topic: concept-article
6
6
ms.date: 12/19/2024
7
-
8
7
#customer intent: As a .NET developer, I want understand how large language models (LLMs) use tokens so I can add semantic analysis and text generation capabilities to my .NET projects.
9
-
10
8
---
11
-
12
9
# Understand tokens
13
10
14
11
Tokens are words, character sets, or combinations of words and punctuation that are generated by large language models (LLMs) when they decompose text. Tokenization is the first step in training. The LLM analyzes the semantic relationships between tokens, such as how commonly they're used together or whether they're used in similar contexts. After training, the LLM uses those patterns and relationships to generate a sequence of output tokens based on the input sequence.
15
12
16
-
## Turning text into tokens
13
+
## Turn text into tokens
17
14
18
15
The set of unique tokens that an LLM is trained on is known as its _vocabulary_.
19
16
20
17
For example, consider the following sentence:
21
18
22
-
> I heard a dog bark loudly at a cat
19
+
> `I heard a dog bark loudly at a cat`
23
20
24
21
This text could be tokenized as:
25
22
26
-
-I
27
-
- heard
28
-
-a
29
-
- dog
30
-
- bark
31
-
- loudly
32
-
-at
33
-
-a
34
-
- cat
23
+
-`I`
24
+
-`heard`
25
+
-`a`
26
+
-`dog`
27
+
-`bark`
28
+
-`loudly`
29
+
-`at`
30
+
-`a`
31
+
-`cat`
35
32
36
33
By having a sufficiently large set of training text, tokenization can compile a vocabulary of many thousands of tokens.
37
34
@@ -47,37 +44,37 @@ For example, the GPT models, developed by OpenAI, use a type of subword tokeniza
47
44
48
45
There are benefits and disadvantages to each tokenization method:
| Smaller tokens (character or subword tokenization) | - Enables the model to handle a wider range of inputs, such as unknown words, typos, or complex syntax.<br>- Might allow the vocabulary size to be reduced, requiring fewer memory resources. | - A given text is broken into more tokens, requiring additional computational resources while processing<br>- Given a fixed token limit, the maximum size of the model's input and output is smaller |
53
-
| Larger tokens (word tokenization) | - A given text is broken into fewer tokens, requiring fewer computational resources while processing.<br>- Given the same token limit, the maximum size of the model's input and output is larger. | - Might cause an increased vocabulary size, requiring more memory resources.<br>- Can limit the models ability to handle unknown words, typos, or complex syntax. |
| Smaller tokens (character or subword tokenization) | - Enables the model to handle a wider range of inputs, such as unknown words, typos, or complex syntax.<br>- Might allow the vocabulary size to be reduced, requiring fewer memory resources. | - A given text is broken into more tokens, requiring additional computational resources while processing.<br>- Given a fixed token limit, the maximum size of the model's input and output is smaller.|
50
+
| Larger tokens (word tokenization) | - A given text is broken into fewer tokens, requiring fewer computational resources while processing.<br>- Given the same token limit, the maximum size of the model's input and output is larger. | - Might cause an increased vocabulary size, requiring more memory resources.<br>- Can limit the models ability to handle unknown words, typos, or complex syntax. |
54
51
55
52
## How LLMs use tokens
56
53
57
54
After the LLM completes tokenization, it assigns an ID to each unique token.
58
55
59
56
Consider our example sentence:
60
57
61
-
> I heard a dog bark loudly at a cat
58
+
> `I heard a dog bark loudly at a cat`
62
59
63
60
After the model uses a word tokenization method, it could assign token IDs as follows:
64
61
65
-
-I (1)
66
-
- heard (2)
67
-
-a (3)
68
-
- dog (4)
69
-
- bark (5)
70
-
- loudly (6)
71
-
-at (7)
72
-
-a (the "a" token is already assigned an ID of 3)
73
-
- cat (8)
62
+
-`I` (1)
63
+
-`heard` (2)
64
+
-`a` (3)
65
+
-`dog` (4)
66
+
-`bark` (5)
67
+
-`loudly` (6)
68
+
-`at` (7)
69
+
-`a` (the "a" token is already assigned an ID of 3)
70
+
-`cat` (8)
74
71
75
-
By assigning IDs, text can be represented as a sequence of token IDs. The example sentence would be represented as [1, 2, 3, 4, 5, 6, 7, 3, 8]. The sentence "I heard a cat" would be represented as [1, 2, 3, 8].
72
+
By assigning IDs, text can be represented as a sequence of token IDs. The example sentence would be represented as [1, 2, 3, 4, 5, 6, 7, 3, 8]. The sentence "`I heard a cat`" would be represented as [1, 2, 3, 8].
76
73
77
74
As training continues, the model adds any new tokens in the training text to its vocabulary and assigns it an ID. For example:
78
75
79
-
- meow (9)
80
-
- run (10)
76
+
-`meow` (9)
77
+
-`run` (10)
81
78
82
79
The semantic relationships between the tokens can be analyzed by using these token ID sequences. Multi-valued numeric vectors, known as [embeddings](embeddings.md), are used to represent these relationships. An embedding is assigned to each token based on how commonly it's used together with, or in similar contexts to, the other tokens.
83
80
@@ -91,9 +88,9 @@ Output generation is an iterative operation. The model appends the predicted tok
91
88
92
89
LLMs have limitations regarding the maximum number of tokens that can be used as input or generated as output. This limitation often causes the input and output tokens to be combined into a maximum context window. Taken together, a model's token limit and tokenization method determine the maximum length of text that can be provided as input or generated as output.
93
90
94
-
For example, consider a model that has a maximum context window of 100 tokens. The model processes our example sentences as input text:
91
+
For example, consider a model that has a maximum context window of 100 tokens. The model processes the example sentences as input text:
95
92
96
-
> I heard a dog bark loudly at a cat
93
+
> `I heard a dog bark loudly at a cat`
97
94
98
95
By using a word-based tokenization method, the input is nine tokens. This leaves 91 **word** tokens available for the output.
99
96
@@ -107,6 +104,6 @@ Generative AI services might also be limited regarding the maximum number of tok
107
104
108
105
## Related content
109
106
110
-
-[How Generative AI and LLMs work](how-genai-and-llms-work.md)
111
-
-[Understanding embeddings](embeddings.md)
112
-
-[Working with vector databases](vector-databases.md)
107
+
-[How generative AI and LLMs work](how-genai-and-llms-work.md)
108
+
-[Understand embeddings](embeddings.md)
109
+
-[Work with vector databases](vector-databases.md)
0 commit comments