Skip to content

Commit a5e763a

Browse files
committed
human edits
1 parent 693150c commit a5e763a

File tree

1 file changed

+16
-19
lines changed

1 file changed

+16
-19
lines changed

docs/ai/how-to/use-tokenizers.md

Lines changed: 16 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,16 @@ description: Learn how to use the Microsoft.ML.Tokenizers library to tokenize te
44
ms.topic: how-to
55
ms.date: 10/29/2025
66
ai-usage: ai-assisted
7-
#customer intent: As a .NET developer, I want to use the Microsoft.ML.Tokenizers library to tokenize text so I can work with AI models, manage costs, and handle token limits effectively.
87
---
98
# Use Microsoft.ML.Tokenizers for text tokenization
109

11-
The [Microsoft.ML.Tokenizers](https://www.nuget.org/packages/Microsoft.ML.Tokenizers) library provides a comprehensive set of tools for tokenizing text in .NET applications. Tokenization is essential when working with large language models (LLMs), as it allows you to manage token counts, estimate costs, and preprocess text for AI models.
10+
The [Microsoft.ML.Tokenizers](https://www.nuget.org/packages/Microsoft.ML.Tokenizers) library provides a comprehensive set of tools for tokenizing text in .NET applications. Tokenization is essential when you work with large language models (LLMs), as it allows you to manage token counts, estimate costs, and preprocess text for AI models.
1211

1312
This article shows you how to use the library's key features and work with different tokenizer models.
1413

1514
## Prerequisites
1615

17-
- [.NET 8.0 SDK](https://dotnet.microsoft.com/download/dotnet/8.0) or later
16+
- [.NET 8 SDK](https://dotnet.microsoft.com/download/dotnet/8.0) or later
1817

1918
## Install the package
2019

@@ -35,7 +34,7 @@ dotnet add package Microsoft.ML.Tokenizers.Data.O200kBase
3534
The Microsoft.ML.Tokenizers library provides:
3635

3736
- **Extensible tokenizer architecture**: Allows specialization of Normalizer, PreTokenizer, Model/Encoder, and Decoder components.
38-
- **Multiple tokenization algorithms**: Supports BPE (Byte Pair Encoding), Tiktoken, Llama, CodeGen, and more.
37+
- **Multiple tokenization algorithms**: Supports BPE (byte-pair encoding), Tiktoken, Llama, CodeGen, and more.
3938
- **Token counting and estimation**: Helps manage costs and context limits when working with AI services.
4039
- **Flexible encoding options**: Provides methods to encode text to token IDs, count tokens, and decode tokens back to text.
4140

@@ -45,11 +44,9 @@ The Tiktoken tokenizer is commonly used with OpenAI models like GPT-4. The follo
4544

4645
:::code language="csharp" source="./snippets/use-tokenizers/csharp/TokenizersExamples/TiktokenExample.cs" id="TiktokenBasic":::
4746

48-
The tokenizer instance should be cached and reused throughout your application for better performance.
47+
For better performance, you should cache and reuse the tokenizer instance throughout your app.
4948

50-
### Manage token limits
51-
52-
When working with LLMs, you often need to manage text within token limits. The following example shows how to trim text to a specific token count:
49+
When you work with LLMs, you often need to manage text within token limits. The following example shows how to trim text to a specific token count:
5350

5451
:::code language="csharp" source="./snippets/use-tokenizers/csharp/TokenizersExamples/TiktokenExample.cs" id="TiktokenTrim":::
5552

@@ -59,30 +56,30 @@ The Llama tokenizer is designed for the Llama family of models. It requires a to
5956

6057
:::code language="csharp" source="./snippets/use-tokenizers/csharp/TokenizersExamples/LlamaExample.cs" id="LlamaBasic":::
6158

62-
### Advanced encoding options
63-
6459
The tokenizer supports advanced encoding options, such as controlling normalization and pretokenization:
6560

6661
:::code language="csharp" source="./snippets/use-tokenizers/csharp/TokenizersExamples/LlamaExample.cs" id="LlamaAdvanced":::
6762

6863
## Use BPE tokenizer
6964

70-
Byte Pair Encoding (BPE) is the underlying algorithm used by many tokenizers, including Tiktoken. The following example demonstrates BPE tokenization:
65+
Byte-pair encoding (BPE) is the underlying algorithm used by many tokenizers, including Tiktoken. The following example demonstrates BPE tokenization:
7166

7267
:::code language="csharp" source="./snippets/use-tokenizers/csharp/TokenizersExamples/BpeExample.cs" id="BpeBasic":::
7368

74-
The library also provides specialized tokenizers like `BpeTokenizer` and `EnglishRobertaTokenizer` that you can configure with custom vocabularies for specific models.
69+
The library also provides specialized tokenizers like <xref:Microsoft.ML.Tokenizers.BpeTokenizer> and <xref:Microsoft.ML.Tokenizers.EnglishRobertaTokenizer> that you can configure with custom vocabularies for specific models.
7570

7671
## Common tokenizer operations
7772

78-
All tokenizers in the library implement the `Tokenizer` base class, which provides a consistent API:
73+
All tokenizers in the library implement the <xref:Microsoft.ML.Tokenizers.Tokenizer> base class. The following table shows the available methods.
7974

80-
- **`EncodeToIds`**: Converts text to a list of token IDs
81-
- **`Decode`**: Converts token IDs back to text
82-
- **`CountTokens`**: Returns the number of tokens in a text string
83-
- **`EncodeToTokens`**: Returns detailed token information including values and IDs
84-
- **`GetIndexByTokenCount`**: Finds the character index for a specific token count from the start
85-
- **`GetIndexByTokenCountFromEnd`**: Finds the character index for a specific token count from the end
75+
| Method | Description |
76+
|-------------------------------------------------------|--------------------------------------|
77+
| <xref:Microsoft.ML.Tokenizers.Tokenizer.EncodeToIds*> | Converts text to a list of token IDs |
78+
| <xref:Microsoft.ML.Tokenizers.Tokenizer.Decode*> | Converts token IDs back to text |
79+
| <xref:Microsoft.ML.Tokenizers.Tokenizer.CountTokens*> | Returns the number of tokens in a text string |
80+
| <xref:Microsoft.ML.Tokenizers.Tokenizer.EncodeToTokens*> | Returns detailed token information including values and IDs |
81+
| <xref:Microsoft.ML.Tokenizers.Tokenizer.GetIndexByTokenCount*> | Finds the character index for a specific token count from the start |
82+
| <xref:Microsoft.ML.Tokenizers.Tokenizer.GetIndexByTokenCountFromEnd*> | Finds the character index for a specific token count from the end |
8683

8784
## Migration from other libraries
8885

0 commit comments

Comments
 (0)