Skip to content

Commit fd9dbf8

Browse files
committed
Update Unicode to version 17.0.0 (#14760)
This is an automated commit created by the Maintenance project https://github.com/eksperimental/maintenance Please read the release notes by visiting <http://www.unicode.org/versions/Unicode17.0.0/>.
1 parent 7f45bef commit fd9dbf8

File tree

13 files changed

+4179
-585
lines changed

13 files changed

+4179
-585
lines changed

lib/elixir/lib/string.ex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ defmodule String do
2222
"hello world"
2323
2424
The functions in this module act according to
25-
[The Unicode Standard, Version 16.0.0](http://www.unicode.org/versions/Unicode16.0.0/).
25+
[The Unicode Standard, Version 17.0.0](http://www.unicode.org/versions/Unicode17.0.0/).
2626
2727
## Interpolation
2828

lib/elixir/pages/references/unicode-syntax.md

Lines changed: 11 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ For the technical details, see the next sections that cover the technical Unicod
4848

4949
## Unicode Annex #31
5050

51-
Elixir implements the requirements outlined in the [Unicode Annex #31](https://unicode.org/reports/tr31/), version 15.0.
51+
Elixir implements the requirements outlined in the [Unicode Annex #31](https://unicode.org/reports/tr31/), version 17.0.
5252

5353
### R1. Default Identifiers
5454

@@ -112,33 +112,31 @@ Choosing requirement R4 automatically excludes requirements R5, R6, and R7.
112112

113113
## Unicode Technical Standard #39
114114

115-
Elixir conforms to the clauses outlined in the [Unicode Technical Standard #39](https://unicode.org/reports/tr39/) on Security, version 15.0.
115+
Elixir conforms to the clauses outlined in the [Unicode Technical Standard #39](https://unicode.org/reports/tr39/) on Security, version 17.0.
116116

117117
### C1. General Security Profile for Identifiers
118118

119-
Elixir will not allow tokenization of identifiers with codepoints in `\p{Identifier_Status=Restricted}`.
119+
Elixir will not allow tokenization of identifiers with codepoints in `\p{Identifier_Status=Restricted}`, except for the outlined 'Additional normalizations' section below.
120120

121121
> An implementation following the General Security Profile does not permit any characters in \p{Identifier_Status=Restricted}, ...
122122
123-
For instance, the 'HANGUL FILLER' (``) character, which is often invisible, is an uncommon codepoint and will trigger this warning.
124-
125-
See the note below about additional normalizations, which can perform automatic replacement of some Restricted identifiers.
123+
For instance, the 'HANGUL FILLER' (``) character, which is often invisible, is an uncommon codepoint and will trigger a warning.
126124

127125
### C2. Confusable detection
128126

129127
Elixir will warn on identifiers that look the same, but aren't. Examples: in `а = a = 1`, the two 'a' characters are Cyrillic and Latin, and could be confused for each other; in `力 = カ = 1`, both are Japanese, but different codepoints, in different scripts of that writing system. Confusable identifiers can lead to hard-to-catch bugs (say, due to copy-pasted code) and can be unsafe, so we will warn about identifiers within a single file that could be confused with each other.
130128

131-
We use the means described in Section 4, 'Confusable Detection', with one noted modification
129+
We use the means described in Section 4, 'Confusable Detection', with one noted modification:
132130

133131
> Alternatively, it shall declare that it uses a modification, and provide a precise list of character mappings that are added to or removed from the provided ones.
134132
135133
Elixir will not warn on confusability for identifiers made up exclusively of characters in a-z, A-Z, 0-9, and _. This is because ASCII identifiers have existed for so long that the programming community has had their own means of dealing with confusability between identifiers like `l,1` or `O,0` (for instance, fonts designed for programming usually make it easy to differentiate between those characters).
136134

137135
### C3. Mixed Script Detection
138136

139-
Elixir will not allow tokenization of mixed-script identifiers unless it is via chunks separated by an underscore, like `http_сервер`. We use the means described in Section 5.1, Mixed-Script Detection, to determine if script mixing is occurring, with the modification documented in the section 'Additional Normalizations', below.
137+
Elixir will not allow tokenization of mixed-script identifiers unless it is via chunks separated by an underscore, like `http_сервер`. We use the means described in Section 5.1, Mixed-Script Detection, to determine if script mixing is occurring, with the 'Additional Normalizations' documented in.
140138

141-
Examples: Elixir allows an identifiers like `幻ㄒㄧㄤ`, even though it includes characters from multiple 'scripts', because those scripts all 'resolve' to Japanese when applying the resolution rules from UTS 39 5.1. When mixing Latin and Japanese scripts, underscores are necessary, as in `:T_シャツ` (the Japanese word for 't-shirt' with an additional underscore separating the letter T).
139+
Examples: Elixir allows an identifiers like `幻한`, even though it includes characters from multiple 'scripts', as Han characters may be mixed with Japanese and Korean, according to the rules from UTS 39 5.1. When mixing Latin and Japanese scripts, underscores are necessary, as in `:T_シャツ` (the Japanese word for 't-shirt' with an additional underscore separating the letter T).
142140

143141
Elixir does not allow code like `if аdmin, do: :ok, else: :err`, where the scriptset for the 'a' character is {Cyrillic} but all other characters have scriptsets of {Latin}. The scriptsets fail to resolve and a descriptive error is shown.
144142

@@ -148,16 +146,14 @@ Elixir does not allow code like `if аdmin, do: :ok, else: :err`, where the scri
148146

149147
'C5 - Mixed number detection' conformance is inapplicable as Elixir does not support Unicode numbers.
150148

151-
### Addition normalizations and documented UTS 39 modifications
149+
### Addition Normalizations
152150

153151
As of Elixir 1.14, some codepoints in `\p{Identifier_Status=Restricted}` are *normalized* to other, unrestricted codepoints.
154152

155-
Initially this is only done to translate MICRO SIGN `µ` to Greek lowercase mu, `μ`.
156-
157-
This is not a modification of UTS39 clauses C1 (General Security Profile) or C2 (Confusability Detection); however, it is a documented modification of C3, 'Mixed-Script detection'.
153+
This is currently only applied to translate MICRO SIGN (`µ`) to Greek lowercase mu (`μ`).
158154

159-
Mixed-script detection is modified by these normalizations to the extent that the normalized codepoint is given the union of scriptsets from both characters.
155+
The normalization avoids confusability and the mixed-script detection is modified to the extent that the normalized codepoint is given the union of scriptsets from both characters.
160156

161-
* For instance, in the example of MICRO => MU, Micro was a 'Common'-script character -- the same script given to the '_' underscore codepoint -- and thus the normalized character's scriptset will be {Greek, Common}. 'Common' intersects with all non-empty scriptsets, and thus the normalized character can be used in tokens written in any script without causing script mixing.
157+
* For instance, in the example of MICRO => MU, MICRO was a 'Common'-script character - the same script given to the '_' underscore codepoint - and thus the normalized character's scriptset will be {Greek, Common}. 'Common' intersects with all non-empty scriptsets, and thus the normalized character can be used in tokens written in any script without causing script mixing.
162158

163159
* The code points normalized in this fashion are those that are in use in the community, and judged not likely to cause issues with unsafe script mixing. For instance, the MICRO or MU codepoint may be used in an atom or variable dealing with microseconds.

lib/elixir/test/elixir/kernel/string_tokenizer_test.exs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -134,8 +134,8 @@ defmodule Kernel.StringTokenizerTest do
134134

135135
test "allows legitimate script mixing" do
136136
# Mixed script with supersets, numbers, and underscores
137-
assert Code.eval_string("幻ㄒㄧㄤ = 1") == {1, [幻ㄒㄧㄤ: 1]}
138-
assert Code.eval_string("幻ㄒㄧㄤ1 = 1") == {1, [幻ㄒㄧㄤ1: 1]}
137+
assert Code.eval_string("幻한 = 1") == {1, [幻한: 1]}
138+
assert Code.eval_string("幻한1 = 1") == {1, [幻한1: 1]}
139139
assert Code.eval_string("__सवव_1? = 1") == {1, [__सवव_1?: 1]}
140140

141141
# Elixir's normalizations combine scriptsets of the 'from' and 'to' characters,

0 commit comments

Comments
 (0)