Commit 9f7f69d
Add possessive quantifiers to avoid catastrophic backtracking (#258)
Fixes the crash in #245 by
prohibiting the regex engine from backtracking catastrophically via
[possessive
quantifiers](https://www.regular-expressions.info/possessive.html).
<img width="400" alt="image"
src="https://github.com/openai/tiktoken/assets/1841944/ed341153-4cf4-4c1c-93d6-3f5e32133569">
Interestingly these possesives make the encoding a lot faster again in
`fancy-regex`.
Before this change (but with large byte pair merge PR cherry-picked):
```
num_threads: 1, num_bytes: 98379553
tiktoken 11,946,036 bytes / s
tiktoken 11,961,343 bytes / s
tiktoken 11,995,846 bytes / s
tiktoken 11,951,263 bytes / s
tiktoken 11,983,405 bytes / s
```
Same, with these changes applied:
```
num_threads: 1, num_bytes: 98379553
tiktoken 14,511,827 bytes / s
tiktoken 14,638,134 bytes / s
tiktoken 14,644,029 bytes / s
tiktoken 14,729,030 bytes / s
tiktoken 14,666,903 bytes / s
```
Updating the regex libs makes it a tiny bit faster still:
```
num_threads: 1, num_bytes: 98379553
tiktoken 14,485,590 bytes / s
tiktoken 14,854,049 bytes / s
tiktoken 14,891,086 bytes / s
tiktoken 14,843,007 bytes / s
tiktoken 14,874,520 bytes / s
```
This is almost 2x faster than [before any of the
optimizations](#234).
-------
Opened an issue for increasing the [default backtrack
limit](https://github.com/fancy-regex/fancy-regex/blob/bf2c807447f72ee20ae839e0f8cb3a06fc79982c/src/lib.rs#L407),
see: fancy-regex/fancy-regex#134, but it
shouldn't be necessary here anymore.
---------
Co-authored-by: Lőrinc <[email protected]>1 parent c0ba74c commit 9f7f69d
File tree
4 files changed
+43
-11
lines changed- src
- tests
- tiktoken_ext
4 files changed
+43
-11
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
15 | | - | |
16 | | - | |
| 15 | + | |
| 16 | + | |
17 | 17 | | |
18 | 18 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| 9 | + | |
9 | 10 | | |
10 | 11 | | |
11 | 12 | | |
| |||
417 | 418 | | |
418 | 419 | | |
419 | 420 | | |
420 | | - | |
| 421 | + | |
421 | 422 | | |
422 | 423 | | |
423 | 424 | | |
| |||
572 | 573 | | |
573 | 574 | | |
574 | 575 | | |
| 576 | + | |
575 | 577 | | |
576 | 578 | | |
577 | 579 | | |
| |||
596 | 598 | | |
597 | 599 | | |
598 | 600 | | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
599 | 613 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
14 | 30 | | |
15 | 31 | | |
16 | 32 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
9 | 14 | | |
10 | 15 | | |
11 | 16 | | |
| |||
17 | 22 | | |
18 | 23 | | |
19 | 24 | | |
20 | | - | |
21 | | - | |
22 | | - | |
23 | | - | |
| 25 | + | |
24 | 26 | | |
25 | 27 | | |
26 | 28 | | |
| |||
34 | 36 | | |
35 | 37 | | |
36 | 38 | | |
37 | | - | |
| 39 | + | |
38 | 40 | | |
39 | 41 | | |
40 | 42 | | |
| |||
48 | 50 | | |
49 | 51 | | |
50 | 52 | | |
51 | | - | |
| 53 | + | |
52 | 54 | | |
53 | 55 | | |
54 | 56 | | |
| |||
62 | 64 | | |
63 | 65 | | |
64 | 66 | | |
65 | | - | |
| 67 | + | |
66 | 68 | | |
67 | 69 | | |
68 | 70 | | |
| |||
82 | 84 | | |
83 | 85 | | |
84 | 86 | | |
85 | | - | |
| 87 | + | |
86 | 88 | | |
87 | 89 | | |
88 | 90 | | |
| |||
0 commit comments