- 
                Notifications
    
You must be signed in to change notification settings  - Fork 6.1k
 
Open
Labels
π seQUESTeredIdentifies that an issue has been imported into Quest.Identifies that an issue has been imported into Quest.intelligent-apps/subsvc
Description
Add a conceptual doc for using the Microsoft.ML.Tokenizers package, which is technically part of the ML.NET set of libraries. That's an area we don't have a lot of documentation around besides what's in the NuGet README but has received more investment in the past few months.
The new article should either live with the ML.NET docs or the .NET AI docs.
Content from NuGet readme:
About
Microsoft.ML.Tokenizers provides an abstraction for tokenizers as well as implementations of common tokenization algorithms.
Key Features
- Extensible tokenizer architecture that allows for specialization of Normalizer, PreTokenizer, Model/Encoder, Decoder
 - BPE - Byte pair encoding model
 - English Roberta model
 - Tiktoken model
 - Llama model
 - Phi2 model
 
How to Use
using Microsoft.ML.Tokenizers;
using System.IO;
using System.Net.Http;
//
// Using Tiktoken Tokenizer
//
// Initialize the tokenizer for the `gpt-4o` model. This instance should be cached for all subsequent use.
Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");
string source = "Text tokenization is the process of splitting a string into a list of tokens.";
Console.WriteLine($"Tokens: {tokenizer.CountTokens(source)}");
// prints: Tokens: 16
var trimIndex = tokenizer.GetIndexByTokenCountFromEnd(source, 5, out string processedText, out _);
Console.WriteLine($"5 tokens from end: {processedText.Substring(trimIndex)}");
// prints: 5 tokens from end:  a list of tokens.
trimIndex = tokenizer.GetIndexByTokenCount(source, 5, out processedText, out _);
Console.WriteLine($"5 tokens from start: {processedText.Substring(0, trimIndex)}");
// prints: 5 tokens from start: Text tokenization is the
IReadOnlyList<int> ids = tokenizer.EncodeToIds(source);
Console.WriteLine(string.Join(", ", ids));
// prints: 1199, 4037, 2065, 374, 279, 1920, 315, 45473, 264, 925, 1139, 264, 1160, 315, 11460, 13
//
// Using Llama Tokenizer
//
// Open a stream to the remote Llama tokenizer model data file.
using HttpClient httpClient = new();
const string modelUrl = @"https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer.model";
using Stream remoteStream = await httpClient.GetStreamAsync(modelUrl);
// Create the Llama tokenizer using the remote stream. This should be cached for all subsequent use.
Tokenizer llamaTokenizer = LlamaTokenizer.Create(remoteStream);
string input = "Hello, world!";
ids = llamaTokenizer.EncodeToIds(input);
Console.WriteLine(string.Join(", ", ids));
// prints: 1, 15043, 29892, 3186, 29991
Console.WriteLine($"Tokens: {llamaTokenizer.CountTokens(input)}");
// prints: Tokens: 5Main Types
The main types provided by this library are:
Microsoft.ML.Tokenizers.TokenizerMicrosoft.ML.Tokenizers.BpeTokenizerMicrosoft.ML.Tokenizers.EnglishRobertaTokenizerMicrosoft.ML.Tokenizers.TiktokenTokenizerMicrosoft.ML.Tokenizers.NormalizerMicrosoft.ML.Tokenizers.PreTokenizer
Additional Documentation
Copilot
Metadata
Metadata
Labels
π seQUESTeredIdentifies that an issue has been imported into Quest.Identifies that an issue has been imported into Quest.intelligent-apps/subsvc
Type
Projects
Status
π In review