Commit 5a24b9a
committed
Ensure that words never contain RTL/LTR char mixtures
Previously, we only used white space to perform word breaks. This could lead to mixtures of RTL and LTR characters in the same word string.
Which in turn made it impossible to produce satisfying extractions of RTL / LTR text mixtures.
This change ensures that every word string contains either no or only RTL characters.
Additional standard word delimiters:
0x202A: LEFT-TO-RIGHT EMBEDDING
0x202B: RIGHT-TO-LEFT EMBEDDING
0x202C: POP DIRECTIONAL FORMATTING
0x202D: LEFT-TO-RIGHT OVERRIDE
0x202E: RIGHT-TO-LEFT OVERRIDE
Word breaks will be generated at the occurrence of any of these characters.
In addition, breaks are also made if characters in a row are not both, either RTL or LTR.1 parent a2b4ba3 commit 5a24b9a
File tree
5 files changed
+52
-8
lines changed- src
- tests
- resources
5 files changed
+52
-8
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12800 | 12800 | | |
12801 | 12801 | | |
12802 | 12802 | | |
| 12803 | + | |
12803 | 12804 | | |
12804 | 12805 | | |
12805 | 12806 | | |
| |||
12825 | 12826 | | |
12826 | 12827 | | |
12827 | 12828 | | |
12828 | | - | |
12829 | | - | |
| 12829 | + | |
| 12830 | + | |
| 12831 | + | |
12830 | 12832 | | |
12831 | 12833 | | |
12832 | 12834 | | |
12833 | 12835 | | |
12834 | 12836 | | |
12835 | | - | |
| 12837 | + | |
| 12838 | + | |
12836 | 12839 | | |
12837 | 12840 | | |
| 12841 | + | |
12838 | 12842 | | |
12839 | 12843 | | |
12840 | 12844 | | |
| |||
15371 | 15375 | | |
15372 | 15376 | | |
15373 | 15377 | | |
15374 | | - | |
| 15378 | + | |
| 15379 | + | |
| 15380 | + | |
| 15381 | + | |
| 15382 | + | |
| 15383 | + | |
| 15384 | + | |
15375 | 15385 | | |
15376 | 15386 | | |
15377 | 15387 | | |
| |||
15380 | 15390 | | |
15381 | 15391 | | |
15382 | 15392 | | |
| 15393 | + | |
| 15394 | + | |
| 15395 | + | |
| 15396 | + | |
| 15397 | + | |
| 15398 | + | |
15383 | 15399 | | |
15384 | 15400 | | |
15385 | 15401 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1676 | 1676 | | |
1677 | 1677 | | |
1678 | 1678 | | |
| 1679 | + | |
| 1680 | + | |
| 1681 | + | |
| 1682 | + | |
1679 | 1683 | | |
1680 | 1684 | | |
1681 | 1685 | | |
| |||
1707 | 1711 | | |
1708 | 1712 | | |
1709 | 1713 | | |
| 1714 | + | |
| 1715 | + | |
| 1716 | + | |
| 1717 | + | |
| 1718 | + | |
| 1719 | + | |
1710 | 1720 | | |
1711 | 1721 | | |
1712 | 1722 | | |
| |||
3223 | 3233 | | |
3224 | 3234 | | |
3225 | 3235 | | |
| 3236 | + | |
3226 | 3237 | | |
3227 | 3238 | | |
3228 | 3239 | | |
| |||
3232 | 3243 | | |
3233 | 3244 | | |
3234 | 3245 | | |
3235 | | - | |
| 3246 | + | |
| 3247 | + | |
3236 | 3248 | | |
3237 | | - | |
| 3249 | + | |
3238 | 3250 | | |
3239 | 3251 | | |
3240 | 3252 | | |
| |||
3251 | 3263 | | |
3252 | 3264 | | |
3253 | 3265 | | |
3254 | | - | |
| 3266 | + | |
3255 | 3267 | | |
3256 | 3268 | | |
3257 | 3269 | | |
| 3270 | + | |
3258 | 3271 | | |
3259 | 3272 | | |
3260 | 3273 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
632 | 632 | | |
633 | 633 | | |
634 | 634 | | |
635 | | - | |
| 635 | + | |
636 | 636 | | |
637 | 637 | | |
638 | 638 | | |
| |||
Binary file not shown.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
0 commit comments