Commit 32c79ca
authored
chore: use only regex for contains_english_word. (#382)
Updates the characters to split when creating candidate english words. Now uses regex to parse out non-alphabetic characters for each word
Note: This was originally an attempt to speedup contains_english_word() but there is no measurable change in performance.1 parent e5dd9d5 commit 32c79ca
File tree
4 files changed
+19
-8
lines changed- test_unstructured/partition
- unstructured
- partition
4 files changed
+19
-8
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| |||
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
47 | | - | |
48 | | - | |
49 | 47 | | |
50 | 48 | | |
51 | 49 | | |
| |||
57 | 55 | | |
58 | 56 | | |
59 | 57 | | |
| 58 | + | |
| 59 | + | |
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
190 | 190 | | |
191 | 191 | | |
192 | 192 | | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
193 | 198 | | |
194 | 199 | | |
195 | 200 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
| 23 | + | |
| 24 | + | |
24 | 25 | | |
25 | 26 | | |
26 | 27 | | |
| |||
188 | 189 | | |
189 | 190 | | |
190 | 191 | | |
191 | | - | |
192 | | - | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
193 | 200 | | |
194 | 201 | | |
195 | | - | |
196 | 202 | | |
197 | 203 | | |
198 | 204 | | |
| |||
0 commit comments