Add full support for UTF-16 / Unicode

This feature implements full UTF-16 / Unicode support by correctly interpreting UTF-16 surrogate pairs as single Unicode code points and extending the predefined character classes to the full Unicode code points range.

**New syntax**
- Add escape sequence `\Uhhhhhhhh` (character with hex code `hhhhhhhh`)

**New RegEx mode**
- `(?u)` — activates UTF-16 mode. The predefined character classes `\w`, `\d` etc. are then extended by the full Unicode code points range (0x01 up to 0x10FFFF).
- `(?-u)` (default mode) — deactivates UTF-16 mode. The predefined character classes `\w`, `\d` etc. then correspond as before only to the Unicode code points possible with UCS-2 (0x01 up to 0xFFFF).

**New Parameter**
- `AddNfa(..., "\w", #RegExMode_Unicode)` is the same as `AddNfa(..., "(?u)\w")`

PureBasic's string functions use UCS-2 encoding in Unicode mode according to the official documentation. But PureBasic uses the API functions of the operating systems for displaying the strings and these all (Windows, Linux and macOS) interpret the PureBasic string as UTF-16, so programs written in PureBasic can display all Unicode characters.

Currently, the RegEx engine also works with this UCS-2 encoding, so UTF-16 surrogate pairs are interpreted as two separate UCS-2 characters.

In order to write Unicode code points outside the range supported by UCS-2 (Unicode's Basic Multilingual Plane only) in the regex, the UTF-16 surrogate pairs currently have to be written separately. Besides the disadvantage that this is inconvenient to write, the case-insensitivity mode then also does not work correctly, because to work correctly it would have to be able to interpret a UTF-16 surrogate pair as a single Unicode code point.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add full support for UTF-16 / Unicode #31

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add full support for UTF-16 / Unicode #31

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions