Skip to content

Are the codes in UnicodeCharSets.h accurate, and what version of Unicode standard are they for? #123560

@mrolle45

Description

@mrolle45

The answer is YES, and Unicode 15.1 (for clang version 20.0). I learned about the 15.1 from the history of UnicodeCharSets.h.

Sorry, I thought there were differences between the clang source code (UnicodeCharSets.h) and the Unicode standard document (https://www.unicode.org/Public/15.1.0/ucd/DerivedCoreProperties.txt). I had been looking at the version in Public/13.0.0 instead. About 4,000 new XID_START codepoints were added between the two versions.

I have verified, via a Python script, that the clang and unicode standard both specify the exact same codes for XID_START and XID_CONTINUE.

By the way, I had been looking at 13.0 because I was writing a Python program to emulate the clang 20.0 preprocessor and clang lexes an identifier based on the XID_START and XID_CONTINUE properties of the codepoints it sees. I was using the unicodedata module and Python version 3.10.16. This module uses unicode version 13.0. The current latest Python, 3.13, uses unicode version 15.1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionA question, not bug report. Check out https://llvm.org/docs/GettingInvolved.html instead!

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions