-
Notifications
You must be signed in to change notification settings - Fork 1.6k
RFC: ID_Compat_Math characters allowed in identifiers #3840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
While I mostly sympathise with this and think that it's probably fine to do this, I think that an RFC suggesting this should at minimum:
Note that your reference to NFKC is technically correct: FWIW, I very much sympathise with both the desire to have more scientific characters in variables and the desire to hand-wave away the issues as being already solved. It's also harder than ever before to do proper research online due to the shift of focus toward crystal-ball-based decisionmaking. I mostly want to clarify where you can find the relevant Unicode resources discussing this issue, and I think that the RFC should be updated to directly reference them so that we don't try and reinvent the wheel and redo all their hard work. Also, I think it's pretty great that Rust is explicitly mentioned in the Unicode standard as someone who does this right! I didn't know this was the case until now. |
* Added links to UAX31 and others as requested in CR * Fixed typos as requested in CR * Extended the drawbacks section * Other improvements
@clarfonthey Thanks for the review! I made the requested changes and added more links to the Unicode resources and expanded some sections. |
text/0000-compat-math-identifiers.md
Outdated
|
||
* Rust might want to decide in the future to give certain superscripts and subscripts syntactic meaning. For example they might want to interpret `a²` as `a * a` or `a₁` as `a[0]`. The latter sounds espcially unlikely though due to the general disagreement of 0-based vs 1-based indexing. | ||
|
||
# Rationale and alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rust currently just follows Unicode's recommendation on what should be allowed as a programming language identifier: https://rust-lang.github.io/rfcs/2457-non-ascii-idents.html (Annex 31).
This seems like a reasonable choice, letting the Unicode Consortium handle Unicode decisions, so while I can certainly see the motivation you presented, I am cautious about this change.
It would be very good to have a description here of why Annex 31 does not contain these symbols, if such discussion can be found anywhere, to ensure that we are not missing something important and are sure about our choice to deviate from the recommendation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be very good to have a description here of why Annex 31 does not contain these symbols
UAX 31 does contain these symbols, that's what this profile comes from: https://www.unicode.org/reports/tr31/#Mathematical_Compatibility_Notation_Profile
For the question "why are they not in the default profile", the answer is basically to leave room for languages that want to do custom operators, or use these as builtin operators.
It's also just caution in expanding the set to include new meanings: while the XID set expands with each Unicode release as new characters get added, it would not be good for new types of characters to get included: if a programming language cared only about linguistic content in identifiers; it would perhaps be surprised if mathematical subscripts entered the fray. This separate profile allows for explicit choice.
This seems like a reasonable choice, letting the Unicode Consortium handle Unicode decisions, so while I can certainly see the motivation you presented, I am cautious about this change.
The mathematical profile is included in UAX 31, the identifiers standard: that is the Unicode consortium making a Unicode decision that these are acceptable in identifiers. It's a choice from a menu that programming languages may choose from. Rust is currently following Unicode's recommendation, but this RFC would have Rust continuing to follow Unicode's recommendation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed formulation related to UAX31 a bit.
cc @Manishearth as our Unicode person |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall seems fine to me. I didn't include this in the original RFC since IIRC the mathematical profile was still being worked on, and I didn't wish to have this facet be another thing that needed to be discussed.
text/0000-compat-math-identifiers.md
Outdated
|
||
* Rust might want to decide in the future to give certain superscripts and subscripts syntactic meaning. For example they might want to interpret `a²` as `a * a` or `a₁` as `a[0]`. The latter sounds espcially unlikely though due to the general disagreement of 0-based vs 1-based indexing. | ||
|
||
# Rationale and alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be very good to have a description here of why Annex 31 does not contain these symbols
UAX 31 does contain these symbols, that's what this profile comes from: https://www.unicode.org/reports/tr31/#Mathematical_Compatibility_Notation_Profile
For the question "why are they not in the default profile", the answer is basically to leave room for languages that want to do custom operators, or use these as builtin operators.
It's also just caution in expanding the set to include new meanings: while the XID set expands with each Unicode release as new characters get added, it would not be good for new types of characters to get included: if a programming language cared only about linguistic content in identifiers; it would perhaps be surprised if mathematical subscripts entered the fray. This separate profile allows for explicit choice.
This seems like a reasonable choice, letting the Unicode Consortium handle Unicode decisions, so while I can certainly see the motivation you presented, I am cautious about this change.
The mathematical profile is included in UAX 31, the identifiers standard: that is the Unicode consortium making a Unicode decision that these are acceptable in identifiers. It's a choice from a menu that programming languages may choose from. Rust is currently following Unicode's recommendation, but this RFC would have Rust continuing to follow Unicode's recommendation.
* Clarified choice between syntactic and identifier use * Added link to a similar C++ proposal * Expanded the alternatives section discussing how characters could be given syntactic meaning instead
@Manishearth @kennytm @Noratrieb thank you for the review! I added your suggestions and comments to the draft. |
Similarly `let a = [2, 0]; let b = a₁;` will naturally give a compiler error that `a₁` is an unknown identifier and not be interpreted as `let b = a[0];`. | ||
`∞` will just be a character usable in identifiers and not be a synonym to the likes of `f32::INFINITY`. | ||
|
||
The characters 5) are added to the set of Rust identifiers, but will trigger an NFKC or `uncommon_codepoints` warning when used depending on their Unicode classification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, it's worth noting that actually the characters from 3) and 4) would also be included in the NFKC warning; if you look at the definition of NFKC, superscripts and subscripts are an explicit example: https://unicode.org/reports/tr15/#Compatibility_Composite_Figure
So, it's worth noting that only three characters from this wouldn't trigger the warning. These are still three good characters to include, but it's worth noting for accuracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another side note to mention, also to clarify the above, is that NFKC effectively removes super/subscripts when normalizing, and I personally think it's kind of weird that this results in some normalized characters which are not normally allowed in identifiers. (for example, parentheses and +/- signs)
I think it's particularly strange that these are included in the definition at all, but I guess it kind of makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my personal experience x⁽³⁾
, x⁺
, x⁻
, x₃
all appear readily in scientific texts. Unfortunately x*
which is also quite common is not possible as there is not superscript *
and x*
itself is of course an invalid identifier.
@Manishearth Do the lists of confusables have appropriate entries for these new symbols? For instance, are the bold/italic/bold-italic/sans-serif variations of Nabla already marked as confusable with each other? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some grammar and readability suggestions
Yes. (Not Manish, but I also read all the docs and this is correct.) Also, as mentioned in this comment thread all but the infinity, nabla, and partial differential symbols are not NFKC and would thus show up as confusable for that reason as well. |
Right, we have multiple reasons as to why the existing lints would fire on these things. (There's a reason why I spent so much time trying to figure out the underlying concerns on the original non-ASCII issue: this way we were able to design lints that directly addressed those concerns; which is more future proof) |
Given appropriate confusables tables, the For What are the expected use cases for those operators in superscripts/subscripts? In particular, I'm wondering if there'd be any value in a lint for "characters that usually shouldn't be the last character in an identifier". Because, for instance, |
I could totally see |
This RFC extends the set of Unicode character which can be used in identifiers with ID_Compat_Math_Start and ID_Compat_Math_Continue, most notable: ∇, ∂, ∞, subscripts ⁰¹²³⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ and superscripts ₀₁₂₃₄₅₆₇₈₉₊₋₌₍₎.
This can be a boon to implementers of scientific concepts as they can write for example
let ∇E₁₂ = 0.5;
.Rendered