|
| 1 | +## About |
| 2 | + |
| 3 | +Microsoft.ML.Tokenizers supports various the implmentation of the tokenization used in the NLP transforms. |
| 4 | + |
| 5 | +## Key Features |
| 6 | + |
| 7 | +* Extensisble tokenizer architecture that allows for specialization of Normalizer, PreTokenizer, Model/Encoder, Decoder |
| 8 | +* BPE - Byte pair encoding model |
| 9 | +* English Roberta model |
| 10 | +* Tiktoken model |
| 11 | + |
| 12 | +## How to Use |
| 13 | + |
| 14 | +```c# |
| 15 | +using Microsoft.ML.Tokenizers; |
| 16 | + |
| 17 | +// initialize the tokenizer for `gpt-4` model, downloading data files |
| 18 | +Tokenizer tokenizer = await Tiktoken.CreateByModelNameAsync("gpt-4"); |
| 19 | + |
| 20 | +string source = "Text tokenization is the process of splitting a string into a list of tokens."; |
| 21 | + |
| 22 | +Console.WriteLine($"Tokens: {tokenizer.CountTokens(source)}"); |
| 23 | +// print: Tokens: 16 |
| 24 | +
|
| 25 | +var trimIndex = tokenizer.LastIndexOfTokenCount(source, 5, out string processedText, out _); |
| 26 | +Console.WriteLine($"5 tokens from end: {processedText.Substring(trimIndex)}"); |
| 27 | +// 5 tokens from end: a list of tokens. |
| 28 | +
|
| 29 | +trimIndex = tokenizer.IndexOfTokenCount(source, 5, out processedText, out _); |
| 30 | +Console.WriteLine($"5 tokens from start: {processedText.Substring(0, trimIndex)}"); |
| 31 | +// 5 tokens from start: Text tokenization is the |
| 32 | +
|
| 33 | +IReadOnlyList<int> ids = tokenizer.EncodeToIds(source); |
| 34 | +Console.WriteLine(string.Join(", ", ids)); |
| 35 | +// prints: 1199, 4037, 2065, 374, 279, 1920, 315, 45473, 264, 925, 1139, 264, 1160, 315, 11460, 13 |
| 36 | +``` |
| 37 | + |
| 38 | +## Main Types |
| 39 | + |
| 40 | +The main types provided by this library are: |
| 41 | + |
| 42 | +* `Microsoft.ML.Tokenizers.Tokenizer` |
| 43 | +* `Microsoft.ML.Tokenizers.Bpe` |
| 44 | +* `Microsoft.ML.Tokenizers.EnglishRoberta` |
| 45 | +* `Microsoft.ML.Tokenizers.TikToken` |
| 46 | +* `Microsoft.ML.Tokenizers.TokenizerDecoder` |
| 47 | +* `Microsoft.ML.Tokenizers.Normalizer` |
| 48 | +* `Microsoft.ML.Tokenizers.PreTokenizer` |
| 49 | + |
| 50 | +## Additional Documentation |
| 51 | + |
| 52 | +* [Conceptual documentation](TODO) |
| 53 | +* [API documentation](https://learn.microsoft.com/en-us/dotnet/api/microsoft.ml.tokenizers) |
| 54 | + |
| 55 | +## Related Packages |
| 56 | + |
| 57 | +<!-- The related packages associated with this package --> |
| 58 | + |
| 59 | +## Feedback & Contributing |
| 60 | + |
| 61 | +Microsoft.ML.Tokenizers is released as open source under the [MIT license](https://licenses.nuget.org/MIT). Bug reports and contributions are welcome at [the GitHub repository](https://github.com/dotnet/machinelearning). |
0 commit comments