Skip to content

Conversation

Danvil
Copy link

@Danvil Danvil commented Jul 16, 2025

This RFC extends the set of Unicode character which can be used in identifiers with ID_Compat_Math_Start and ID_Compat_Math_Continue, most notable: ∇, ∂, ∞, subscripts ⁰¹²³⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ and superscripts ₀₁₂₃₄₅₆₇₈₉₊₋₌₍₎.

This can be a boon to implementers of scientific concepts as they can write for example let ∇E₁₂ = 0.5;.

Rendered

@clarfonthey
Copy link

While I mostly sympathise with this and think that it's probably fine to do this, I think that an RFC suggesting this should at minimum:

  1. Reference the actual section of UAX 31 that defines these groups of characters: https://www.unicode.org/reports/tr31/#Standard_Profiles
  2. Reference the section of UTS 55 linked in the above section that explains why you might not want to use these groups of characters, which currently cites Rust as an existing example: https://www.unicode.org/reports/tr55/#General-Security-Profile
  3. Reference the section of UTS 39 linked in the above section that explains the exact mechanisms which the above can be made safe: https://www.unicode.org/reports/tr39/#General_Security_Profile

Note that your reference to NFKC is technically correct: Not_NFKC is one of the restricted security profile cases that is covered by UTS 39, but it's not the only one, and it's worth discussing whether Rust's handling would need to be expanded because of this case.

FWIW, I very much sympathise with both the desire to have more scientific characters in variables and the desire to hand-wave away the issues as being already solved. It's also harder than ever before to do proper research online due to the shift of focus toward crystal-ball-based decisionmaking. I mostly want to clarify where you can find the relevant Unicode resources discussing this issue, and I think that the RFC should be updated to directly reference them so that we don't try and reinvent the wheel and redo all their hard work.

Also, I think it's pretty great that Rust is explicitly mentioned in the Unicode standard as someone who does this right! I didn't know this was the case until now.

* Added links to UAX31 and others as requested in CR
* Fixed typos as requested in CR
* Extended the drawbacks section
* Other improvements
@Danvil
Copy link
Author

Danvil commented Jul 16, 2025

@clarfonthey Thanks for the review! I made the requested changes and added more links to the Unicode resources and expanded some sections.
@programmerjake Thanks for the review - typos are fixed.

@ehuss ehuss added the T-lang Relevant to the language team, which will review and decide on the RFC. label Jul 16, 2025

* Rust might want to decide in the future to give certain superscripts and subscripts syntactic meaning. For example they might want to interpret `` as `a * a` or `a₁` as `a[0]`. The latter sounds espcially unlikely though due to the general disagreement of 0-based vs 1-based indexing.

# Rationale and alternatives
Copy link
Member

@Noratrieb Noratrieb Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust currently just follows Unicode's recommendation on what should be allowed as a programming language identifier: https://rust-lang.github.io/rfcs/2457-non-ascii-idents.html (Annex 31).

This seems like a reasonable choice, letting the Unicode Consortium handle Unicode decisions, so while I can certainly see the motivation you presented, I am cautious about this change.

It would be very good to have a description here of why Annex 31 does not contain these symbols, if such discussion can be found anywhere, to ensure that we are not missing something important and are sure about our choice to deviate from the recommendation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be very good to have a description here of why Annex 31 does not contain these symbols

UAX 31 does contain these symbols, that's what this profile comes from: https://www.unicode.org/reports/tr31/#Mathematical_Compatibility_Notation_Profile

For the question "why are they not in the default profile", the answer is basically to leave room for languages that want to do custom operators, or use these as builtin operators.

It's also just caution in expanding the set to include new meanings: while the XID set expands with each Unicode release as new characters get added, it would not be good for new types of characters to get included: if a programming language cared only about linguistic content in identifiers; it would perhaps be surprised if mathematical subscripts entered the fray. This separate profile allows for explicit choice.

This seems like a reasonable choice, letting the Unicode Consortium handle Unicode decisions, so while I can certainly see the motivation you presented, I am cautious about this change.

The mathematical profile is included in UAX 31, the identifiers standard: that is the Unicode consortium making a Unicode decision that these are acceptable in identifiers. It's a choice from a menu that programming languages may choose from. Rust is currently following Unicode's recommendation, but this RFC would have Rust continuing to follow Unicode's recommendation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed formulation related to UAX31 a bit.

@Noratrieb
Copy link
Member

cc @Manishearth as our Unicode person

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems fine to me. I didn't include this in the original RFC since IIRC the mathematical profile was still being worked on, and I didn't wish to have this facet be another thing that needed to be discussed.


* Rust might want to decide in the future to give certain superscripts and subscripts syntactic meaning. For example they might want to interpret `` as `a * a` or `a₁` as `a[0]`. The latter sounds espcially unlikely though due to the general disagreement of 0-based vs 1-based indexing.

# Rationale and alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be very good to have a description here of why Annex 31 does not contain these symbols

UAX 31 does contain these symbols, that's what this profile comes from: https://www.unicode.org/reports/tr31/#Mathematical_Compatibility_Notation_Profile

For the question "why are they not in the default profile", the answer is basically to leave room for languages that want to do custom operators, or use these as builtin operators.

It's also just caution in expanding the set to include new meanings: while the XID set expands with each Unicode release as new characters get added, it would not be good for new types of characters to get included: if a programming language cared only about linguistic content in identifiers; it would perhaps be surprised if mathematical subscripts entered the fray. This separate profile allows for explicit choice.

This seems like a reasonable choice, letting the Unicode Consortium handle Unicode decisions, so while I can certainly see the motivation you presented, I am cautious about this change.

The mathematical profile is included in UAX 31, the identifiers standard: that is the Unicode consortium making a Unicode decision that these are acceptable in identifiers. It's a choice from a menu that programming languages may choose from. Rust is currently following Unicode's recommendation, but this RFC would have Rust continuing to follow Unicode's recommendation.

* Clarified choice between syntactic and identifier use
* Added link to a similar C++ proposal
* Expanded the alternatives section discussing how characters
  could be given syntactic meaning instead
@Danvil
Copy link
Author

Danvil commented Jul 17, 2025

@Manishearth @kennytm @Noratrieb thank you for the review! I added your suggestions and comments to the draft.

Similarly `let a = [2, 0]; let b = a₁;` will naturally give a compiler error that `a₁` is an unknown identifier and not be interpreted as `let b = a[0];`.
`∞` will just be a character usable in identifiers and not be a synonym to the likes of `f32::INFINITY`.

The characters 5) are added to the set of Rust identifiers, but will trigger an NFKC or `uncommon_codepoints` warning when used depending on their Unicode classification.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, it's worth noting that actually the characters from 3) and 4) would also be included in the NFKC warning; if you look at the definition of NFKC, superscripts and subscripts are an explicit example: https://unicode.org/reports/tr15/#Compatibility_Composite_Figure

So, it's worth noting that only three characters from this wouldn't trigger the warning. These are still three good characters to include, but it's worth noting for accuracy.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another side note to mention, also to clarify the above, is that NFKC effectively removes super/subscripts when normalizing, and I personally think it's kind of weird that this results in some normalized characters which are not normally allowed in identifiers. (for example, parentheses and +/- signs)

I think it's particularly strange that these are included in the definition at all, but I guess it kind of makes sense.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my personal experience x⁽³⁾, x⁺, x⁻, x₃ all appear readily in scientific texts. Unfortunately x* which is also quite common is not possible as there is not superscript * and x* itself is of course an invalid identifier.

@joshtriplett
Copy link
Member

@Manishearth Do the lists of confusables have appropriate entries for these new symbols? For instance, are the bold/italic/bold-italic/sans-serif variations of Nabla already marked as confusable with each other?

Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some grammar and readability suggestions

@clarfonthey
Copy link

@Manishearth Do the lists of confusables have appropriate entries for these new symbols? For instance, are the bold/italic/bold-italic/sans-serif variations of Nabla already marked as confusable with each other?

Yes. (Not Manish, but I also read all the docs and this is correct.)

Also, as mentioned in this comment thread all but the infinity, nabla, and partial differential symbols are not NFKC and would thus show up as confusable for that reason as well.

@Manishearth
Copy link
Member

Right, we have multiple reasons as to why the existing lints would fire on these things.

(There's a reason why I spent so much time trying to figure out the underlying concerns on the original non-ASCII issue: this way we were able to design lints that directly addressed those concerns; which is more future proof)

@joshtriplett
Copy link
Member

Given appropriate confusables tables, the ID_Compat_Math_Start symbols seem entirely fine.

For ID_Compat_Math_Continue, most of them seem fine. I think it's unlikely that we're going to add functionality that would make e.g. ² raise something to the second power. The operators (⁺⁻⁼₊₋₌) seem a little more odd, but hopefully people will use them reasonably and we will have lints helping people avoid confusion.

What are the expected use cases for those operators in superscripts/subscripts? In particular, I'm wondering if there'd be any value in a lint for "characters that usually shouldn't be the last character in an identifier". Because, for instance, x₌ seems like a much more confusing variable name than something with in the middle of the name.

@Jules-Bertholet
Copy link
Contributor

I could totally see x₌ being used to denote, e.g., “the quantity of X necessary to keep Y constant”. x₊ and x₋ also seem useful. (Though is potentially confusable with _.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-lang Relevant to the language team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.