Commit 638702f
committed
Fix bug in Longest Matching tokenizer to preprocess spaces consistently
Fixes #1061
Update the Longest Matching tokenizer to preprocess spaces consistently with the Multi-Cut tokenizer.
* Modify `pythainlp/tokenize/longest.py` to group consecutive spaces into one token using regex.
* Add test cases in `tests/core/test_tokenize.py` to verify consistent preprocessing of spaces between Longest Matching and Multi-Cut tokenizers.1 parent 9a9d11f commit 638702f
2 files changed
+23
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
| |||
38 | 39 | | |
39 | 40 | | |
40 | 41 | | |
| 42 | + | |
41 | 43 | | |
42 | 44 | | |
43 | 45 | | |
| |||
134 | 136 | | |
135 | 137 | | |
136 | 138 | | |
137 | | - | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
138 | 148 | | |
139 | 149 | | |
140 | 150 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
390 | 390 | | |
391 | 391 | | |
392 | 392 | | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
393 | 405 | | |
394 | 406 | | |
395 | 407 | | |
| |||
0 commit comments