Skip to content

Commit 18a1124

Browse files
authored
Update README.md
1 parent 45a7ae0 commit 18a1124

File tree

1 file changed

+39
-4
lines changed

1 file changed

+39
-4
lines changed

README.md

Lines changed: 39 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,52 @@
1-
# aoatt
1+
# An optimization approach to tokenization
22

3-
Install dependencies for C++ code
3+
### Greedy Approximate Solution
4+
1. Install dependencies for C++ code, we use oneTBB to parallelize the code:
45
```
56
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/96aa5993-5b22-4a9b-91ab-da679f422594/intel-oneapi-base-toolkit-2025.0.0.885_offline.sh
67
sudo sh ./intel-oneapi-base-toolkit-2025.0.0.885_offline.sh -a --cli
78
cd <install_dir>
89
```
9-
Initialize environment variables
10+
2. Initialize environment variables:
1011
```
1112
cd <install_dir>
1213
. ./oneapi/tbb/latest/env/vars.sh
1314
```
15+
3. Compile greedy_cache.cpp:
16+
```
17+
c++ -std=c++20 -o greedy.exe greedy_cache.cpp -ltbb -O3
18+
```
19+
4. Prepare inputs (refer to [cpp_inputs](https://github.com/PreferredAI/aoatt/blob/main/cpp_inputs) for examples):
20+
* counts: a file with '\n' delimited integers
21+
* words: a file with ' ' (space) delimited words
22+
5. Run compiled program
23+
* currently looks for domain inputs in fixed path under cpp_inputs/*
24+
* To-do: pybind11 implementation
25+
```
26+
./greedy.exe <domain> <k>
27+
```
28+
29+
6. Now we obtained our ranked token sequence (refer to [cpp_outputs](https://github.com/PreferredAI/aoatt/blob/main/cpp_outputs/) for examples):
30+
* merges: the number of merges at each step, delimited by '\n'
31+
* tokens: byte sequences in hex-format, delimited by '\n'
32+
* If only valid byte sequences are required, we have to prune the candidate token space [To-do]
33+
* Current implementation sees every possible substring
1434

35+
### Using the obtained ranked token sequence
36+
To use the tokenizer, we also need the previous oneTBB dependency.
37+
1. Additionally, install pybind11 dependency, simply:
38+
```
39+
pip3 install pybind11
40+
```
41+
2. Compile greedy_builder.cpp
1542
```
1643
c++ -O3 -Wall -shared -std=c++20 -ltbb -fPIC $(python3 -m pybind11 --includes) greedy_builder.cpp -o greedy_builder$(python3-config --extension-suffix)
17-
```
44+
```
45+
3. Import in python
46+
47+
Examples in [eval_tokenizer_example.ipynb](https://github.com/PreferredAI/aoatt/blob/main/eval_tokenizer_example.ipynb)
48+
49+
Evaluations in [eval_notebook.ipynb](https://github.com/PreferredAI/aoatt/blob/main/eval_notebook.ipynb)
50+
51+
### Citation
52+
TBD

0 commit comments

Comments
 (0)