-
Notifications
You must be signed in to change notification settings - Fork 3
Description
This feature implements full UTF-16 / Unicode support by correctly interpreting UTF-16 surrogate pairs as single Unicode code points and extending the predefined character classes to the full Unicode code points range.
New syntax
- Add escape sequence
\Uhhhhhhhh(character with hex codehhhhhhhh)
New RegEx mode
(?u)— activates UTF-16 mode. The predefined character classes\w,\detc. are then extended by the full Unicode code points range (0x01 up to 0x10FFFF).(?-u)(default mode) — deactivates UTF-16 mode. The predefined character classes\w,\detc. then correspond as before only to the Unicode code points possible with UCS-2 (0x01 up to 0xFFFF).
New Parameter
AddNfa(..., "\w", #RegExMode_Unicode)is the same asAddNfa(..., "(?u)\w")
PureBasic's string functions use UCS-2 encoding in Unicode mode according to the official documentation. But PureBasic uses the API functions of the operating systems for displaying the strings and these all (Windows, Linux and macOS) interpret the PureBasic string as UTF-16, so programs written in PureBasic can display all Unicode characters.
Currently, the RegEx engine also works with this UCS-2 encoding, so UTF-16 surrogate pairs are interpreted as two separate UCS-2 characters.
In order to write Unicode code points outside the range supported by UCS-2 (Unicode's Basic Multilingual Plane only) in the regex, the UTF-16 surrogate pairs currently have to be written separately. Besides the disadvantage that this is inconvenient to write, the case-insensitivity mode then also does not work correctly, because to work correctly it would have to be able to interpret a UTF-16 surrogate pair as a single Unicode code point.