You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Elixir conforms to the clauses outlined in the [Unicode Technical Standard #39](https://unicode.org/reports/tr39/) on Security, version 15.0.
115
+
Elixir conforms to the clauses outlined in the [Unicode Technical Standard #39](https://unicode.org/reports/tr39/) on Security, version 17.0.
116
116
117
117
### C1. General Security Profile for Identifiers
118
118
119
-
Elixir will not allow tokenization of identifiers with codepoints in `\p{Identifier_Status=Restricted}`.
119
+
Elixir will not allow tokenization of identifiers with codepoints in `\p{Identifier_Status=Restricted}`, except for the outlined 'Additional normalizations' section below.
120
120
121
121
> An implementation following the General Security Profile does not permit any characters in \p{Identifier_Status=Restricted}, ...
122
122
123
-
For instance, the 'HANGUL FILLER' (`ㅤ`) character, which is often invisible, is an uncommon codepoint and will trigger this warning.
124
-
125
-
See the note below about additional normalizations, which can perform automatic replacement of some Restricted identifiers.
123
+
For instance, the 'HANGUL FILLER' (`ㅤ`) character, which is often invisible, is an uncommon codepoint and will trigger a warning.
126
124
127
125
### C2. Confusable detection
128
126
129
127
Elixir will warn on identifiers that look the same, but aren't. Examples: in `а = a = 1`, the two 'a' characters are Cyrillic and Latin, and could be confused for each other; in `力 = カ = 1`, both are Japanese, but different codepoints, in different scripts of that writing system. Confusable identifiers can lead to hard-to-catch bugs (say, due to copy-pasted code) and can be unsafe, so we will warn about identifiers within a single file that could be confused with each other.
130
128
131
-
We use the means described in Section 4, 'Confusable Detection', with one noted modification
129
+
We use the means described in Section 4, 'Confusable Detection', with one noted modification:
132
130
133
131
> Alternatively, it shall declare that it uses a modification, and provide a precise list of character mappings that are added to or removed from the provided ones.
134
132
135
133
Elixir will not warn on confusability for identifiers made up exclusively of characters in a-z, A-Z, 0-9, and _. This is because ASCII identifiers have existed for so long that the programming community has had their own means of dealing with confusability between identifiers like `l,1` or `O,0` (for instance, fonts designed for programming usually make it easy to differentiate between those characters).
136
134
137
135
### C3. Mixed Script Detection
138
136
139
-
Elixir will not allow tokenization of mixed-script identifiers unless it is via chunks separated by an underscore, like `http_сервер`. We use the means described in Section 5.1, Mixed-Script Detection, to determine if script mixing is occurring, with the modification documented in the section 'Additional Normalizations', below.
137
+
Elixir will not allow tokenization of mixed-script identifiers unless it is via chunks separated by an underscore, like `http_сервер`. We use the means described in Section 5.1, Mixed-Script Detection, to determine if script mixing is occurring, with the 'Additional Normalizations' documented in.
140
138
141
-
Examples: Elixir allows an identifiers like `幻ㄒㄧㄤ`, even though it includes characters from multiple 'scripts', because those scripts all 'resolve' to Japanese when applying the resolution rules from UTS 39 5.1. When mixing Latin and Japanese scripts, underscores are necessary, as in `:T_シャツ` (the Japanese word for 't-shirt' with an additional underscore separating the letter T).
139
+
Examples: Elixir allows an identifiers like `幻한`, even though it includes characters from multiple 'scripts', as Han characters may be mixed with Japanese and Korean, according to the rules from UTS 39 5.1. When mixing Latin and Japanese scripts, underscores are necessary, as in `:T_シャツ` (the Japanese word for 't-shirt' with an additional underscore separating the letter T).
142
140
143
141
Elixir does not allow code like `if аdmin, do: :ok, else: :err`, where the scriptset for the 'a' character is {Cyrillic} but all other characters have scriptsets of {Latin}. The scriptsets fail to resolve and a descriptive error is shown.
144
142
@@ -148,16 +146,14 @@ Elixir does not allow code like `if аdmin, do: :ok, else: :err`, where the scri
148
146
149
147
'C5 - Mixed number detection' conformance is inapplicable as Elixir does not support Unicode numbers.
150
148
151
-
### Addition normalizations and documented UTS 39 modifications
149
+
### Addition Normalizations
152
150
153
151
As of Elixir 1.14, some codepoints in `\p{Identifier_Status=Restricted}` are *normalized* to other, unrestricted codepoints.
154
152
155
-
Initially this is only done to translate MICRO SIGN `µ` to Greek lowercase mu, `μ`.
156
-
157
-
This is not a modification of UTS39 clauses C1 (General Security Profile) or C2 (Confusability Detection); however, it is a documented modification of C3, 'Mixed-Script detection'.
153
+
This is currently only applied to translate MICRO SIGN (`µ`) to Greek lowercase mu (`μ`).
158
154
159
-
Mixed-script detection is modified by these normalizations to the extent that the normalized codepoint is given the union of scriptsets from both characters.
155
+
The normalization avoids confusability and the mixed-script detection is modified to the extent that the normalized codepoint is given the union of scriptsets from both characters.
160
156
161
-
* For instance, in the example of MICRO => MU, Micro was a 'Common'-script character -- the same script given to the '_' underscore codepoint -- and thus the normalized character's scriptset will be {Greek, Common}. 'Common' intersects with all non-empty scriptsets, and thus the normalized character can be used in tokens written in any script without causing script mixing.
157
+
* For instance, in the example of MICRO => MU, MICRO was a 'Common'-script character - the same script given to the '_' underscore codepoint - and thus the normalized character's scriptset will be {Greek, Common}. 'Common' intersects with all non-empty scriptsets, and thus the normalized character can be used in tokens written in any script without causing script mixing.
162
158
163
159
* The code points normalized in this fashion are those that are in use in the community, and judged not likely to cause issues with unsafe script mixing. For instance, the MICRO or MU codepoint may be used in an atom or variable dealing with microseconds.
0 commit comments