|
1 | | -# aoatt |
| 1 | +# An optimization approach to tokenization |
2 | 2 |
|
3 | | -Install dependencies for C++ code |
| 3 | +### Greedy Approximate Solution |
| 4 | +1. Install dependencies for C++ code, we use oneTBB to parallelize the code: |
4 | 5 | ``` |
5 | 6 | wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/96aa5993-5b22-4a9b-91ab-da679f422594/intel-oneapi-base-toolkit-2025.0.0.885_offline.sh |
6 | 7 | sudo sh ./intel-oneapi-base-toolkit-2025.0.0.885_offline.sh -a --cli |
7 | 8 | cd <install_dir> |
8 | 9 | ``` |
9 | | -Initialize environment variables |
| 10 | +2. Initialize environment variables: |
10 | 11 | ``` |
11 | 12 | cd <install_dir> |
12 | 13 | . ./oneapi/tbb/latest/env/vars.sh |
13 | 14 | ``` |
| 15 | +3. Compile greedy_cache.cpp: |
| 16 | +``` |
| 17 | +c++ -std=c++20 -o greedy.exe greedy_cache.cpp -ltbb -O3 |
| 18 | +``` |
| 19 | +4. Prepare inputs (refer to [cpp_inputs](https://github.com/PreferredAI/aoatt/blob/main/cpp_inputs) for examples): |
| 20 | + * counts: a file with '\n' delimited integers |
| 21 | + * words: a file with ' ' (space) delimited words |
| 22 | +5. Run compiled program |
| 23 | + * currently looks for domain inputs in fixed path under cpp_inputs/* |
| 24 | + * To-do: pybind11 implementation |
| 25 | +``` |
| 26 | +./greedy.exe <domain> <k> |
| 27 | +``` |
| 28 | + |
| 29 | +6. Now we obtained our ranked token sequence (refer to [cpp_outputs](https://github.com/PreferredAI/aoatt/blob/main/cpp_outputs/) for examples): |
| 30 | + * merges: the number of merges at each step, delimited by '\n' |
| 31 | + * tokens: byte sequences in hex-format, delimited by '\n' |
| 32 | + * If only valid byte sequences are required, we have to prune the candidate token space [To-do] |
| 33 | + * Current implementation sees every possible substring |
14 | 34 |
|
| 35 | +### Using the obtained ranked token sequence |
| 36 | +To use the tokenizer, we also need the previous oneTBB dependency. |
| 37 | +1. Additionally, install pybind11 dependency, simply: |
| 38 | +``` |
| 39 | +pip3 install pybind11 |
| 40 | +``` |
| 41 | +2. Compile greedy_builder.cpp |
15 | 42 | ``` |
16 | 43 | c++ -O3 -Wall -shared -std=c++20 -ltbb -fPIC $(python3 -m pybind11 --includes) greedy_builder.cpp -o greedy_builder$(python3-config --extension-suffix) |
17 | | -``` |
| 44 | +``` |
| 45 | +3. Import in python |
| 46 | + |
| 47 | +Examples in [eval_tokenizer_example.ipynb](https://github.com/PreferredAI/aoatt/blob/main/eval_tokenizer_example.ipynb) |
| 48 | + |
| 49 | +Evaluations in [eval_notebook.ipynb](https://github.com/PreferredAI/aoatt/blob/main/eval_notebook.ipynb) |
| 50 | + |
| 51 | +### Citation |
| 52 | +TBD |
0 commit comments