Skip to content

Add full support for UTF-16 / Unicode #31

@SicroAtGit

Description

@SicroAtGit

This feature implements full UTF-16 / Unicode support by correctly interpreting UTF-16 surrogate pairs as single Unicode code points and extending the predefined character classes to the full Unicode code points range.

New syntax

  • Add escape sequence \Uhhhhhhhh (character with hex code hhhhhhhh)

New RegEx mode

  • (?u) — activates UTF-16 mode. The predefined character classes \w, \d etc. are then extended by the full Unicode code points range (0x01 up to 0x10FFFF).
  • (?-u) (default mode) — deactivates UTF-16 mode. The predefined character classes \w, \d etc. then correspond as before only to the Unicode code points possible with UCS-2 (0x01 up to 0xFFFF).

New Parameter

  • AddNfa(..., "\w", #RegExMode_Unicode) is the same as AddNfa(..., "(?u)\w")

PureBasic's string functions use UCS-2 encoding in Unicode mode according to the official documentation. But PureBasic uses the API functions of the operating systems for displaying the strings and these all (Windows, Linux and macOS) interpret the PureBasic string as UTF-16, so programs written in PureBasic can display all Unicode characters.

Currently, the RegEx engine also works with this UCS-2 encoding, so UTF-16 surrogate pairs are interpreted as two separate UCS-2 characters.

In order to write Unicode code points outside the range supported by UCS-2 (Unicode's Basic Multilingual Plane only) in the regex, the UTF-16 surrogate pairs currently have to be written separately. Besides the disadvantage that this is inconvenient to write, the case-insensitivity mode then also does not work correctly, because to work correctly it would have to be able to interpret a UTF-16 surrogate pair as a single Unicode code point.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions