@@ -7,6 +7,9 @@ OpenAI's models.
7
7
import tiktoken
8
8
enc = tiktoken.get_encoding(" gpt2" )
9
9
assert enc.decode(enc.encode(" hello world" )) == " hello world"
10
+
11
+ # To get the tokeniser corresponding to a specific model in the OpenAI API:
12
+ enc = tiktoken.encoding_for_model(" text-davinci-003" )
10
13
```
11
14
12
15
The open source version of ` tiktoken ` can be installed from PyPI:
@@ -16,7 +19,9 @@ pip install tiktoken
16
19
17
20
The tokeniser API is documented in ` tiktoken/core.py ` .
18
21
19
- Example code using ` tiktoken ` can be found in the [ OpenAI Cookbook] ( https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb ) .
22
+ Example code using ` tiktoken ` can be found in the
23
+ [ OpenAI Cookbook] ( https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb ) .
24
+
20
25
21
26
## Performance
22
27
@@ -28,3 +33,72 @@ Performance measured on 1GB of text using the GPT-2 tokeniser, using `GPT2Tokeni
28
33
` tokenizers==0.13.2 ` and ` transformers==4.24.0 ` .
29
34
30
35
36
+ ## Getting help
37
+
38
+ Please post questions in the [ issue tracker] ( https://github.com/openai/tiktoken/issues ) .
39
+
40
+ If you work at OpenAI, make sure to check the internal documentation or feel free to contact
41
+ @shantanu .
42
+
43
+
44
+ ## Extending tiktoken
45
+
46
+ You may wish to extend ` tiktoken ` to support new encodings. There are two ways to do this.
47
+
48
+
49
+ ** Create your ` Encoding ` object exactly the way you want and simply pass it around.**
50
+
51
+ ``` python
52
+ cl100k_base = tiktoken.get_encoding(" cl100k_base" )
53
+
54
+ # In production, load the arguments directly instead of accessing private attributes
55
+ # See openai_public.py for examples of arguments for specific encodings
56
+ enc = tiktoken.Encoding(
57
+ # If you're changing the set of special tokens, make sure to use a different name
58
+ # It should be clear from the name what behaviour to expect.
59
+ name = " cl100k_im" ,
60
+ pat_str = cl100k_base._pat_str,
61
+ mergeable_ranks = cl100k_base._mergeable_ranks,
62
+ special_tokens = {
63
+ ** cl100k_base._special_tokens,
64
+ " <|im_start|>" : 100264 ,
65
+ " <|im_end|>" : 100265 ,
66
+ }
67
+ )
68
+ ```
69
+
70
+ ** Use the ` tiktoken_ext ` plugin mechanism to register your ` Encoding ` objects with ` tiktoken ` .**
71
+
72
+ This is only useful if you need ` tiktoken.get_encoding ` to find your encoding, otherwise prefer
73
+ option 1.
74
+
75
+ To do this, you'll need to create a namespace package under ` tiktoken_ext ` .
76
+
77
+ Layout your project like this, making sure to omit the ` tiktoken_ext/__init__.py ` file:
78
+ ```
79
+ my_tiktoken_extension
80
+ ├── tiktoken_ext
81
+ │ └── my_encodings.py
82
+ └── setup.py
83
+ ```
84
+
85
+ ` my_encodings.py ` should be a module that contains a variable named ` ENCODING_CONSTRUCTORS ` .
86
+ This is a dictionary from an encoding name to a function that takes no arguments and returns
87
+ arguments that can be passed to ` tiktoken.Encoding ` to construct that encoding. For an example, see
88
+ ` tiktoken_ext/openai_public.py ` . For precise details, see ` tiktoken/registry.py ` .
89
+
90
+ Your ` setup.py ` should look something like this:
91
+ ``` python
92
+ from setuptools import setup, find_namespace_packages
93
+
94
+ setup(
95
+ name = " my_tiktoken_extension" ,
96
+ packages = find_namespace_packages(include = [' tiktoken_ext.*' ])
97
+ install_requires = [" tiktoken" ],
98
+ ...
99
+ )
100
+ ```
101
+
102
+ Then simply ` pip install my_tiktoken_extension ` and you should be able to use your custom encodings!
103
+ Make sure ** not** to use an editable install.
104
+
0 commit comments