Skip to content

Commit 337b492

Browse files
authored
Update README.md
1 parent 866ec4d commit 337b492

File tree

1 file changed

+46
-43
lines changed

1 file changed

+46
-43
lines changed

README.md

Lines changed: 46 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,53 @@
1-
# An optimization approach to tokenization
1+
# A partition cover approach to tokenization
22

3-
### Greedy Approximate Solution
4-
1. Install dependencies for C++ code, we use oneTBB to parallelize the code:
3+
### GreedTok
4+
1. Install dependencies for C++ code, we use oneTBB to parallelize the code, simplest way is to use Conda:
55
```
6-
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/96aa5993-5b22-4a9b-91ab-da679f422594/intel-oneapi-base-toolkit-2025.0.0.885_offline.sh
7-
sudo sh ./intel-oneapi-base-toolkit-2025.0.0.885_offline.sh -a --cli
8-
cd <install_dir>
6+
conda install tbb-devel
97
```
10-
2. Initialize environment variables:
11-
```
12-
cd <install_dir>
13-
. ./oneapi/tbb/latest/env/vars.sh
14-
```
15-
3. Compile greedy_cache.cpp:
16-
```
17-
c++ -std=c++20 -o greedy.exe greedy_cache.cpp -ltbb -O3
18-
```
19-
4. Prepare inputs (refer to [cpp_inputs](https://github.com/PreferredAI/aoatt/blob/main/cpp_inputs) for examples):
20-
* counts: a file with '\n' delimited integers
21-
* words: a file with ' ' (space) delimited words
22-
5. Run compiled program
23-
* currently looks for domain inputs in fixed path under cpp_inputs/*
24-
* To-do: pybind11 implementation
25-
```
26-
./greedy.exe <domain> <k>
27-
```
28-
29-
6. Now we obtained our ranked token sequence (refer to [cpp_outputs](https://github.com/PreferredAI/aoatt/blob/main/cpp_outputs/) for examples):
30-
* merges: the number of merges at each step, delimited by '\n'
31-
* tokens: byte sequences in hex-format, delimited by '\n'
32-
* If only valid byte sequences are required, we have to prune the candidate token space [To-do]
33-
* Current implementation sees every possible substring
34-
35-
### Using the obtained ranked token sequence
36-
To use the tokenizer, we also need the previous oneTBB dependency.
37-
1. Additionally, install pybind11 dependency, simply:
38-
```
39-
pip3 install pybind11
40-
```
41-
2. Compile greedy_builder.cpp
42-
```
43-
c++ -O3 -Wall -shared -std=c++20 -ltbb -fPIC $(python3 -m pybind11 --includes) greedy_builder.cpp -o greedy_builder$(python3-config --extension-suffix)
44-
```
45-
3. Import in python
8+
2. If using python wrapper (Todo: automate pip installation)
469

47-
Examples in [eval_tokenizer_example.ipynb](https://github.com/PreferredAI/aoatt/blob/main/eval_tokenizer_example.ipynb)
10+
a. Install pybind11, simply:
11+
```
12+
pip install pybind11
13+
```
14+
b. Compile greedy_builder
15+
```
16+
c++ -O3 -Wall -shared -std=c++20 \
17+
-fPIC $(python3 -m pybind11 --includes) \
18+
-I$CONDA_PREFIX/include/ \
19+
-I$CONDA_PREFIX/include/tbb \
20+
-I$CONDA_PREFIX/include/oneapi \
21+
-L$CONDA_PREFIX/lib/ \
22+
-l tbb \
23+
./pcatt/greedy_builder.cpp \
24+
-o ./pcatt/greedy_builder$(python3-config --extension-suffix)
25+
```
26+
c. import and use! Examples in [eval_tokenizer_example.ipynb](https://github.com/PreferredAI/aoatt/blob/main/eval_tokenizer_example.ipynb)
27+
3. If using C++ files directly
28+
29+
a. Compile greedy_cache.py
30+
```
31+
c++ -O3 -std=c++20 \
32+
-I$CONDA_PREFIX/include/ \
33+
-I$CONDA_PREFIX/include/tbb \
34+
-I$CONDA_PREFIX/include/oneapi \
35+
-L$CONDA_PREFIX/lib/ \
36+
-l tbb \
37+
pcatt/greedy_cache.cpp \
38+
-o pcatt/greedy.exe
39+
```
40+
b. Prepare inputs (refer to [cpp_inputs](https://github.com/PreferredAI/aoatt/blob/main/cpp_inputs) for examples):
41+
* counts: a file with '\n' delimited integers
42+
* words: a file with ' ' (space) delimited words
43+
44+
c. Run compiled program (currently looks for domain inputs in fixed path under cpp_inputs/*)
45+
```
46+
./greedy.exe <domain> <k>
47+
```
48+
d. Now we obtained our ranked token sequence (refer to [cpp_outputs](https://github.com/PreferredAI/aoatt/blob/main/cpp_outputs/) for examples):
49+
* merges: the number of covers at each step, delimited by '\n'
50+
* tokens: byte sequences in hex-format, delimited by '\n'
4851
4952
Evaluations in [eval_notebook.ipynb](https://github.com/PreferredAI/aoatt/blob/main/eval_notebook.ipynb)
5053

0 commit comments

Comments
 (0)