Skip to content

Commit 8863021

Browse files
committed
add compatibility to huggingface interface
1 parent 4e21361 commit 8863021

File tree

13 files changed

+3454
-2
lines changed

13 files changed

+3454
-2
lines changed

.gitattributes

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,6 @@ pcatt/slow_bpe.py export-ignore
66
pcatt/greedy_cache.cpp export-ignore
77
pcatt/max_cover.cpp export-ignore
88
eval_notebook.ipynb export-ignore
9-
eval_tokenizer_example.ipynb export-ignore
9+
eval_tokenizer_example.ipynb export-ignore
10+
pcatt/hf/examples-ignore
11+
eval_hf.ipynb-ignore

README.md

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,29 @@
22
In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm **GreedTok**.
33
Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/e)$-approximation greedy algorithm.
44

5-
To do: Huggingface AutoTokenizer interface
5+
### Beta: Huggingface AutoTokenizer interface
6+
7+
For "training" either:
8+
```
9+
from pcatt.hf.greedtok import GreedTok
10+
greedtok = GreedTok().train_new_from_iterator(word_iterator, 100, max_token_length=5, min_word_count=1)
11+
```
12+
or
13+
```
14+
from pcatt.hf.greedtok import GreedTok
15+
greedtok = GreedTok().train_new_from_counts(word_count_dict, 100, max_token_length=5, min_word_count=1)
16+
```
17+
To use either:
18+
```
19+
from pcatt.hf.greedtok import GreedTok
20+
greedtok = GreedTok.from_pretrained(greedtok_file_directory)
21+
```
22+
or
23+
```
24+
import pcatt.hf
25+
greedtok = AutoTokenizer.from_pretrained("greedtok_file_directory")
26+
```
27+
Refer to [eval_hf.ipynb](https://github.com/PreferredAI/aoatt/blob/main/eval_hf.ipynb)
628

729
### GreedTok
830
1. If using python wrapper

eval_hf.ipynb

Lines changed: 920 additions & 0 deletions
Large diffs are not rendered by default.

pcatt/__init__.py

Whitespace-only changes.

pcatt/hf/__init__.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
from pcatt.hf.greedtok import GreedTok
2+
from transformers import PretrainedConfig, AutoConfig, AutoTokenizer
3+
4+
5+
class GreedTokConfig(PretrainedConfig):
6+
model_type = "greedtok"
7+
8+
def __init__():
9+
pass
10+
11+
12+
AutoConfig.register("greedtok", GreedTokConfig)
13+
AutoTokenizer.register(GreedTokConfig, GreedTok)
Lines changed: 265 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,265 @@
1+
6161
2+
6262
3+
616263
4+
6263
5+
3132
6+
313233
7+
3334
8+
3c7061643e
9+
3c656f733e
10+
00
11+
01
12+
02
13+
03
14+
04
15+
05
16+
06
17+
07
18+
08
19+
09
20+
0a
21+
0b
22+
0c
23+
0d
24+
0e
25+
0f
26+
10
27+
11
28+
12
29+
13
30+
14
31+
15
32+
16
33+
17
34+
18
35+
19
36+
1a
37+
1b
38+
1c
39+
1d
40+
1e
41+
1f
42+
20
43+
21
44+
22
45+
23
46+
24
47+
25
48+
26
49+
27
50+
28
51+
29
52+
2a
53+
2b
54+
2c
55+
2d
56+
2e
57+
2f
58+
30
59+
31
60+
32
61+
33
62+
34
63+
35
64+
36
65+
37
66+
38
67+
39
68+
3a
69+
3b
70+
3c
71+
3d
72+
3e
73+
3f
74+
40
75+
41
76+
42
77+
43
78+
44
79+
45
80+
46
81+
47
82+
48
83+
49
84+
4a
85+
4b
86+
4c
87+
4d
88+
4e
89+
4f
90+
50
91+
51
92+
52
93+
53
94+
54
95+
55
96+
56
97+
57
98+
58
99+
59
100+
5a
101+
5b
102+
5c
103+
5d
104+
5e
105+
5f
106+
60
107+
61
108+
62
109+
63
110+
64
111+
65
112+
66
113+
67
114+
68
115+
69
116+
6a
117+
6b
118+
6c
119+
6d
120+
6e
121+
6f
122+
70
123+
71
124+
72
125+
73
126+
74
127+
75
128+
76
129+
77
130+
78
131+
79
132+
7a
133+
7b
134+
7c
135+
7d
136+
7e
137+
7f
138+
80
139+
81
140+
82
141+
83
142+
84
143+
85
144+
86
145+
87
146+
88
147+
89
148+
8a
149+
8b
150+
8c
151+
8d
152+
8e
153+
8f
154+
90
155+
91
156+
92
157+
93
158+
94
159+
95
160+
96
161+
97
162+
98
163+
99
164+
9a
165+
9b
166+
9c
167+
9d
168+
9e
169+
9f
170+
a0
171+
a1
172+
a2
173+
a3
174+
a4
175+
a5
176+
a6
177+
a7
178+
a8
179+
a9
180+
aa
181+
ab
182+
ac
183+
ad
184+
ae
185+
af
186+
b0
187+
b1
188+
b2
189+
b3
190+
b4
191+
b5
192+
b6
193+
b7
194+
b8
195+
b9
196+
ba
197+
bb
198+
bc
199+
bd
200+
be
201+
bf
202+
c0
203+
c1
204+
c2
205+
c3
206+
c4
207+
c5
208+
c6
209+
c7
210+
c8
211+
c9
212+
ca
213+
cb
214+
cc
215+
cd
216+
ce
217+
cf
218+
d0
219+
d1
220+
d2
221+
d3
222+
d4
223+
d5
224+
d6
225+
d7
226+
d8
227+
d9
228+
da
229+
db
230+
dc
231+
dd
232+
de
233+
df
234+
e0
235+
e1
236+
e2
237+
e3
238+
e4
239+
e5
240+
e6
241+
e7
242+
e8
243+
e9
244+
ea
245+
eb
246+
ec
247+
ed
248+
ee
249+
ef
250+
f0
251+
f1
252+
f2
253+
f3
254+
f4
255+
f5
256+
f6
257+
f7
258+
f8
259+
f9
260+
fa
261+
fb
262+
fc
263+
fd
264+
fe
265+
ff
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"eos_token": "<eos>",
3+
"pad_token": "<pad>"
4+
}
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"clean_up_tokenization_spaces": false,
3+
"model_max_length": 1000000000000000019884624838656,
4+
"tokenizer_class": "GreedTok"
5+
}

0 commit comments

Comments
 (0)