Commit cbc2cee
Improve URL_PATTERN and handling in tokenizer (#4374)
* Move prefix and suffix detection for URL_PATTERN
Move prefix and suffix detection for `URL_PATTERN` into the tokenizer.
Remove associated lookahead and lookbehind from `URL_PATTERN`.
Fix tokenization for Hungarian given new modified handling of prefixes
and suffixes.
* Match a wider range of URI schemes1 parent e65dffd commit cbc2cee
File tree
4 files changed
+15
-17
lines changed- spacy
- lang
- hu
- tests/tokenizer
4 files changed
+15
-17
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| 13 | + | |
13 | 14 | | |
14 | 15 | | |
15 | | - | |
16 | | - | |
| 16 | + | |
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
32 | | - | |
| 32 | + | |
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
13 | | - | |
14 | | - | |
15 | | - | |
16 | | - | |
17 | | - | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
18 | 16 | | |
19 | 17 | | |
20 | 18 | | |
| |||
43 | 41 | | |
44 | 42 | | |
45 | 43 | | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | | - | |
| 44 | + | |
51 | 45 | | |
52 | 46 | | |
53 | 47 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
| 15 | + | |
15 | 16 | | |
16 | 17 | | |
17 | 18 | | |
| |||
45 | 46 | | |
46 | 47 | | |
47 | 48 | | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
48 | 53 | | |
49 | 54 | | |
50 | 55 | | |
| |||
81 | 86 | | |
82 | 87 | | |
83 | 88 | | |
84 | | - | |
85 | 89 | | |
86 | 90 | | |
87 | 91 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
227 | 227 | | |
228 | 228 | | |
229 | 229 | | |
230 | | - | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
231 | 233 | | |
232 | 234 | | |
233 | 235 | | |
| |||
243 | 245 | | |
244 | 246 | | |
245 | 247 | | |
246 | | - | |
247 | | - | |
248 | 248 | | |
249 | 249 | | |
250 | 250 | | |
| |||
0 commit comments