You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Move the Tokenizer's data into separate packages. (dotnet#7248)
* Move the Tokenizer's data into separate packages.
* Address the feedback
* More feedback addressing
* More feedback addressing
* Trimming/AoT support
* Make data types internal
<PackageDescription>The Microsoft.ML.Tokenizers.Data.Cl100kBase class includes the Tiktoken tokenizer data file cl100k_base.tiktoken, which is utilized by models such as GPT-4.</PackageDescription>
8
+
</PropertyGroup>
9
+
10
+
<ItemGroup>
11
+
<!--
12
+
The following file are compressed using the DeflateStream and embedded as resources in the assembly.
13
+
The files are downloaded from the following sources and compressed to the Destination.
The `Microsoft.ML.Tokenizers.Data.Cl100kBase` includes the Tiktoken tokenizer data file `cl100k_base.tiktoken`, which is utilized by models such as GPT-4.
4
+
5
+
## Key Features
6
+
7
+
* This package mainly contains the cl100k_base.tiktoken file, which is used by the Tiktoken tokenizer. This data file is used by the following models:
8
+
1. gpt-4
9
+
2. gpt-3.5-turbo
10
+
3. gpt-3.5-turbo-16k
11
+
4. gpt-35
12
+
5. gpt-35-turbo
13
+
6. gpt-35-turbo-16k
14
+
7. text-embedding-ada-002
15
+
8. text-embedding-3-small
16
+
9. text-embedding-3-large
17
+
18
+
## How to Use
19
+
20
+
Reference this package in your project to use the Tiktoken tokenizer with the specified models.
21
+
22
+
```csharp
23
+
24
+
// Create a tokenizer for the specified model or any other listed model name
<!-- The related packages associated with this package -->
43
+
Microsoft.ML.Tokenizers
44
+
45
+
## Feedback & Contributing
46
+
47
+
Microsoft.ML.Tokenizers.Data.Cl100kBase is released as open source under the [MIT license](https://licenses.nuget.org/MIT). Bug reports and contributions are welcome at [the GitHub repository](https://github.com/dotnet/machinelearning).
<PackageDescription>The Microsoft.ML.Tokenizers.Data.Gpt2 includes the Tiktoken tokenizer data file gpt2.tiktoken, which is utilized by models such as Gpt-2.</PackageDescription>
8
+
</PropertyGroup>
9
+
10
+
<ItemGroup>
11
+
<!--
12
+
The following file are compressed using the DeflateStream and embedded as resources in the assembly.
13
+
The files are downloaded from the following sources and compressed to the Destination.
<!-- The related packages associated with this package -->
31
+
Microsoft.ML.Tokenizers
32
+
33
+
## Feedback & Contributing
34
+
35
+
Microsoft.ML.Tokenizers.Data.Gpt2 is released as open source under the [MIT license](https://licenses.nuget.org/MIT). Bug reports and contributions are welcome at [the GitHub repository](https://github.com/dotnet/machinelearning).
0 commit comments