Skip to content

Commit 300b4bc

Browse files
authored
prevent overlocalization of tokens (#44627)
1 parent 6bad9ed commit 300b4bc

File tree

1 file changed

+33
-36
lines changed

1 file changed

+33
-36
lines changed

docs/ai/conceptual/understanding-tokens.md

Lines changed: 33 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -4,34 +4,31 @@ description: "Understand how large language models (LLMs) use tokens to analyze
44
author: haywoodsloan
55
ms.topic: concept-article
66
ms.date: 12/19/2024
7-
87
#customer intent: As a .NET developer, I want understand how large language models (LLMs) use tokens so I can add semantic analysis and text generation capabilities to my .NET projects.
9-
108
---
11-
129
# Understand tokens
1310

1411
Tokens are words, character sets, or combinations of words and punctuation that are generated by large language models (LLMs) when they decompose text. Tokenization is the first step in training. The LLM analyzes the semantic relationships between tokens, such as how commonly they're used together or whether they're used in similar contexts. After training, the LLM uses those patterns and relationships to generate a sequence of output tokens based on the input sequence.
1512

16-
## Turning text into tokens
13+
## Turn text into tokens
1714

1815
The set of unique tokens that an LLM is trained on is known as its _vocabulary_.
1916

2017
For example, consider the following sentence:
2118

22-
> I heard a dog bark loudly at a cat
19+
> `I heard a dog bark loudly at a cat`
2320
2421
This text could be tokenized as:
2522

26-
- I
27-
- heard
28-
- a
29-
- dog
30-
- bark
31-
- loudly
32-
- at
33-
- a
34-
- cat
23+
- `I`
24+
- `heard`
25+
- `a`
26+
- `dog`
27+
- `bark`
28+
- `loudly`
29+
- `at`
30+
- `a`
31+
- `cat`
3532

3633
By having a sufficiently large set of training text, tokenization can compile a vocabulary of many thousands of tokens.
3734

@@ -47,37 +44,37 @@ For example, the GPT models, developed by OpenAI, use a type of subword tokeniza
4744

4845
There are benefits and disadvantages to each tokenization method:
4946

50-
| Token size | Pros | Cons |
51-
| -------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
52-
| Smaller tokens (character or subword tokenization) | - Enables the model to handle a wider range of inputs, such as unknown words, typos, or complex syntax.<br>- Might allow the vocabulary size to be reduced, requiring fewer memory resources. | - A given text is broken into more tokens, requiring additional computational resources while processing<br>- Given a fixed token limit, the maximum size of the model's input and output is smaller |
53-
| Larger tokens (word tokenization) | - A given text is broken into fewer tokens, requiring fewer computational resources while processing.<br>- Given the same token limit, the maximum size of the model's input and output is larger. | - Might cause an increased vocabulary size, requiring more memory resources.<br>- Can limit the models ability to handle unknown words, typos, or complex syntax. |
47+
| Token size | Pros | Cons |
48+
|----------------------------------------------------|------|------|
49+
| Smaller tokens (character or subword tokenization) | - Enables the model to handle a wider range of inputs, such as unknown words, typos, or complex syntax.<br>- Might allow the vocabulary size to be reduced, requiring fewer memory resources. | - A given text is broken into more tokens, requiring additional computational resources while processing.<br>- Given a fixed token limit, the maximum size of the model's input and output is smaller. |
50+
| Larger tokens (word tokenization) | - A given text is broken into fewer tokens, requiring fewer computational resources while processing.<br>- Given the same token limit, the maximum size of the model's input and output is larger. | - Might cause an increased vocabulary size, requiring more memory resources.<br>- Can limit the models ability to handle unknown words, typos, or complex syntax. |
5451

5552
## How LLMs use tokens
5653

5754
After the LLM completes tokenization, it assigns an ID to each unique token.
5855

5956
Consider our example sentence:
6057

61-
> I heard a dog bark loudly at a cat
58+
> `I heard a dog bark loudly at a cat`
6259
6360
After the model uses a word tokenization method, it could assign token IDs as follows:
6461

65-
- I (1)
66-
- heard (2)
67-
- a (3)
68-
- dog (4)
69-
- bark (5)
70-
- loudly (6)
71-
- at (7)
72-
- a (the "a" token is already assigned an ID of 3)
73-
- cat (8)
62+
- `I` (1)
63+
- `heard` (2)
64+
- `a` (3)
65+
- `dog` (4)
66+
- `bark` (5)
67+
- `loudly` (6)
68+
- `at` (7)
69+
- `a` (the "a" token is already assigned an ID of 3)
70+
- `cat` (8)
7471

75-
By assigning IDs, text can be represented as a sequence of token IDs. The example sentence would be represented as [1, 2, 3, 4, 5, 6, 7, 3, 8]. The sentence "I heard a cat" would be represented as [1, 2, 3, 8].
72+
By assigning IDs, text can be represented as a sequence of token IDs. The example sentence would be represented as [1, 2, 3, 4, 5, 6, 7, 3, 8]. The sentence "`I heard a cat`" would be represented as [1, 2, 3, 8].
7673

7774
As training continues, the model adds any new tokens in the training text to its vocabulary and assigns it an ID. For example:
7875

79-
- meow (9)
80-
- run (10)
76+
- `meow` (9)
77+
- `run` (10)
8178

8279
The semantic relationships between the tokens can be analyzed by using these token ID sequences. Multi-valued numeric vectors, known as [embeddings](embeddings.md), are used to represent these relationships. An embedding is assigned to each token based on how commonly it's used together with, or in similar contexts to, the other tokens.
8380

@@ -91,9 +88,9 @@ Output generation is an iterative operation. The model appends the predicted tok
9188

9289
LLMs have limitations regarding the maximum number of tokens that can be used as input or generated as output. This limitation often causes the input and output tokens to be combined into a maximum context window. Taken together, a model's token limit and tokenization method determine the maximum length of text that can be provided as input or generated as output.
9390

94-
For example, consider a model that has a maximum context window of 100 tokens. The model processes our example sentences as input text:
91+
For example, consider a model that has a maximum context window of 100 tokens. The model processes the example sentences as input text:
9592

96-
> I heard a dog bark loudly at a cat
93+
> `I heard a dog bark loudly at a cat`
9794
9895
By using a word-based tokenization method, the input is nine tokens. This leaves 91 **word** tokens available for the output.
9996

@@ -107,6 +104,6 @@ Generative AI services might also be limited regarding the maximum number of tok
107104

108105
## Related content
109106

110-
- [How Generative AI and LLMs work](how-genai-and-llms-work.md)
111-
- [Understanding embeddings](embeddings.md)
112-
- [Working with vector databases](vector-databases.md)
107+
- [How generative AI and LLMs work](how-genai-and-llms-work.md)
108+
- [Understand embeddings](embeddings.md)
109+
- [Work with vector databases](vector-databases.md)

0 commit comments

Comments
 (0)