|
1 | | -# An optimization approach to tokenization |
| 1 | +# A partition cover approach to tokenization |
2 | 2 |
|
3 | | -### Greedy Approximate Solution |
4 | | -1. Install dependencies for C++ code, we use oneTBB to parallelize the code: |
| 3 | +### GreedTok |
| 4 | +1. Install dependencies for C++ code, we use oneTBB to parallelize the code, simplest way is to use Conda: |
5 | 5 | ``` |
6 | | -wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/96aa5993-5b22-4a9b-91ab-da679f422594/intel-oneapi-base-toolkit-2025.0.0.885_offline.sh |
7 | | -sudo sh ./intel-oneapi-base-toolkit-2025.0.0.885_offline.sh -a --cli |
8 | | -cd <install_dir> |
| 6 | +conda install tbb-devel |
9 | 7 | ``` |
10 | | -2. Initialize environment variables: |
11 | | -``` |
12 | | -cd <install_dir> |
13 | | -. ./oneapi/tbb/latest/env/vars.sh |
14 | | -``` |
15 | | -3. Compile greedy_cache.cpp: |
16 | | -``` |
17 | | -c++ -std=c++20 -o greedy.exe greedy_cache.cpp -ltbb -O3 |
18 | | -``` |
19 | | -4. Prepare inputs (refer to [cpp_inputs](https://github.com/PreferredAI/aoatt/blob/main/cpp_inputs) for examples): |
20 | | - * counts: a file with '\n' delimited integers |
21 | | - * words: a file with ' ' (space) delimited words |
22 | | -5. Run compiled program |
23 | | - * currently looks for domain inputs in fixed path under cpp_inputs/* |
24 | | - * To-do: pybind11 implementation |
25 | | -``` |
26 | | -./greedy.exe <domain> <k> |
27 | | -``` |
28 | | - |
29 | | -6. Now we obtained our ranked token sequence (refer to [cpp_outputs](https://github.com/PreferredAI/aoatt/blob/main/cpp_outputs/) for examples): |
30 | | - * merges: the number of merges at each step, delimited by '\n' |
31 | | - * tokens: byte sequences in hex-format, delimited by '\n' |
32 | | - * If only valid byte sequences are required, we have to prune the candidate token space [To-do] |
33 | | - * Current implementation sees every possible substring |
34 | | - |
35 | | -### Using the obtained ranked token sequence |
36 | | -To use the tokenizer, we also need the previous oneTBB dependency. |
37 | | -1. Additionally, install pybind11 dependency, simply: |
38 | | -``` |
39 | | -pip3 install pybind11 |
40 | | -``` |
41 | | -2. Compile greedy_builder.cpp |
42 | | -``` |
43 | | -c++ -O3 -Wall -shared -std=c++20 -ltbb -fPIC $(python3 -m pybind11 --includes) greedy_builder.cpp -o greedy_builder$(python3-config --extension-suffix) |
44 | | -``` |
45 | | -3. Import in python |
| 8 | +2. If using python wrapper (Todo: automate pip installation) |
46 | 9 |
|
47 | | -Examples in [eval_tokenizer_example.ipynb](https://github.com/PreferredAI/aoatt/blob/main/eval_tokenizer_example.ipynb) |
| 10 | + a. Install pybind11, simply: |
| 11 | + ``` |
| 12 | + pip install pybind11 |
| 13 | + ``` |
| 14 | + b. Compile greedy_builder |
| 15 | + ``` |
| 16 | + c++ -O3 -Wall -shared -std=c++20 \ |
| 17 | + -fPIC $(python3 -m pybind11 --includes) \ |
| 18 | + -I$CONDA_PREFIX/include/ \ |
| 19 | + -I$CONDA_PREFIX/include/tbb \ |
| 20 | + -I$CONDA_PREFIX/include/oneapi \ |
| 21 | + -L$CONDA_PREFIX/lib/ \ |
| 22 | + -l tbb \ |
| 23 | + ./pcatt/greedy_builder.cpp \ |
| 24 | + -o ./pcatt/greedy_builder$(python3-config --extension-suffix) |
| 25 | + ``` |
| 26 | + c. import and use! Examples in [eval_tokenizer_example.ipynb](https://github.com/PreferredAI/aoatt/blob/main/eval_tokenizer_example.ipynb) |
| 27 | +3. If using C++ files directly |
| 28 | +
|
| 29 | + a. Compile greedy_cache.py |
| 30 | + ``` |
| 31 | + c++ -O3 -std=c++20 \ |
| 32 | + -I$CONDA_PREFIX/include/ \ |
| 33 | + -I$CONDA_PREFIX/include/tbb \ |
| 34 | + -I$CONDA_PREFIX/include/oneapi \ |
| 35 | + -L$CONDA_PREFIX/lib/ \ |
| 36 | + -l tbb \ |
| 37 | + pcatt/greedy_cache.cpp \ |
| 38 | + -o pcatt/greedy.exe |
| 39 | + ``` |
| 40 | + b. Prepare inputs (refer to [cpp_inputs](https://github.com/PreferredAI/aoatt/blob/main/cpp_inputs) for examples): |
| 41 | + * counts: a file with '\n' delimited integers |
| 42 | + * words: a file with ' ' (space) delimited words |
| 43 | + |
| 44 | + c. Run compiled program (currently looks for domain inputs in fixed path under cpp_inputs/*) |
| 45 | + ``` |
| 46 | + ./greedy.exe <domain> <k> |
| 47 | + ``` |
| 48 | + d. Now we obtained our ranked token sequence (refer to [cpp_outputs](https://github.com/PreferredAI/aoatt/blob/main/cpp_outputs/) for examples): |
| 49 | + * merges: the number of covers at each step, delimited by '\n' |
| 50 | + * tokens: byte sequences in hex-format, delimited by '\n' |
48 | 51 |
|
49 | 52 | Evaluations in [eval_notebook.ipynb](https://github.com/PreferredAI/aoatt/blob/main/eval_notebook.ipynb) |
50 | 53 |
|
|
0 commit comments