You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ldml/tr35-general.md
+14-3Lines changed: 14 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -516,13 +516,24 @@ Exemplars are characters used by a language, separated into different categories
516
516
| --------------- | ----------- | -------- |
517
517
| main / standard | Main letters used in the language | a-z å æ ø |
518
518
| auxiliary | Additional characters for common foreign words, technical usage | á à ă â å ä ã ā æ ç é è ĕ ê ë ē í ì ĭ î ï ī ñ ó ò ŏ ô ö ø ō œ ú ù ŭ û ü ū ÿ |
519
+
| numbers | Main characters needed to display the common number formats: decimal, percent, and currency. |\[\\u061C\\u200E \\- , ٫ ٬ . % ٪ ‰ ؉ + 0٠ 1١ 2٢ 3٣ 4٤ 5٥ 6٦ 7٧ 8٨ 9٩\]|
520
+
| numbers-auxiliary | Additional characters for use with numbers (technical or older usage) ||
| numbers | The characters needed to display the common number formats: decimal, percent, and currency. |\[\\u061C\\u200E \\- , ٫ ٬ . % ٪ ‰ ؉ + 0٠ 1١ 2٢ 3٣ 4٤ 5٥ 6٦ 7٧ 8٨ 9٩\]|
522
525
523
526
The basic exemplar character sets (main and auxiliary) contain the commonly used letters for a given modern form of a language, which can be for testing and for determining the appropriate repertoire of letters for charset conversion or collation. ("Letter" is interpreted broadly, as anything having the property Alphabetic in the [[UAX44](https://www.unicode.org/reports/tr41/#UAX44)], which also includes syllabaries and ideographs.) It is not a complete set of letters used for a language, nor should it be considered to apply to multiple languages in a particular country. Punctuation and other symbols should not be included in the main and auxiliary sets. In particular, format characters like CGJ are not included.
524
527
525
-
There are five sets altogether: main, auxiliary, punctuation, numbers, and index. The _main_ set should contain the minimal set required for users of the language, while the _auxiliary_ exemplar set is designed to encompass additional characters: those non-native or historical characters that would customarily occur in common publications, dictionaries, and so on. Major style guidelines are good references for the auxiliary set. So, for example, if Irish newspapers and magazines would commonly have Danish names using å, for example, then it would be appropriate to include å in the auxiliary exemplar characters; just not in the main exemplar set. Thus English has the following:
528
+
There are 4 types of sets altogether: main, numbers, punctuation, and index.
529
+
Within each type, there are are subtypes:
530
+
a _main_ set containing the minimal set required for users of the language,
531
+
and an _auxiliary_ set, which is designed to encompass additional characters —
532
+
those non-native or historical characters that would customarily occur in common publications, dictionaries, and so on.
533
+
There are two exceptions: an index set doesn't have an _auxiliary_ set,
534
+
and the punctuation set has an additional subtype for person-name punctuation (see [Person Name Validation](tr35-personNames.html#person-name-validation).
535
+
536
+
Major style guidelines are good references for an auxiliary set. So, for example, if Irish newspapers and magazines would commonly have Danish names using å, for example, then it would be appropriate to include å in the auxiliary exemplar characters; just not in the main exemplar set. Thus English has the following:
526
537
527
538
```xml
528
539
<exemplarCharacters>[a b c d e f g h i j k l m n o p q r s t u v w x y z]</exemplarCharacters>
Copy file name to clipboardExpand all lines: docs/ldml/tr35-personNames.md
+65Lines changed: 65 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1097,6 +1097,71 @@ For the expected sample name items, assume a name such as Mr. Richard “Rich”
1097
1097
1098
1098
The `nameField` values and their modifiers are described in the [Person Name Object](#person-name-object) and [namePattern Syntax](#namepattern-syntax) sections.
1099
1099
1100
+
## Person Name Validation
1101
+
1102
+
When implementations allow entry of person names, they are often too strict; there are many instances where people can’t enter their real names, such as O’Brian, Stéphanie, Wałęsa, Þjóðólfr. Conversely, when an implementation is too lenient, it allows names like Ȟěl̀a, or B🅾️b. (See also [Zalgo](https://en.wikipedia.org/wiki/Zalgo_text).)
1103
+
1104
+
Sometimes the constraints are imposed by limitations of outdated software or databases (such as not supporting Unicode character), or legal restrictions (such as only accepting names legal in Switzerland on native Swiss passports).
1105
+
1106
+
However, when the limitations are due to unfamiliarity with the kinds of characters that can appear in languages, Unicode properties and CLDR data can help implementers to avoid being either too strict or too lenient.
1107
+
1108
+
### Letters
1109
+
1110
+
A common restriction is that the letters in a name only come from a single script. That may be too lenient: there are over 1,453 letters in the Latin script in Unicode 17\!
1111
+
1112
+
To narrow it down, an implementation may form the union of exemplar characters from a set of languages in CLDR (together with their uppercase equivalents); these include letters and combining marks (accents). Here are some examples:
1113
+
1114
+
| Language | Exemplars (Main) |
1115
+
| :---- | :---: |
1116
+
| Icelandic | a á b d ð e é f g h i í j k l m n o ó p r s t u ú v x y ý þ æ ö |
1117
+
| Polish | a ą b c ć d e ę f g h i j k l ł m n ń o ó p r s ś t u w y z ź ż |
1118
+
| Arabic | ً ٌ ٍ َ ُ ِ ّ ْ ٰ ء أ ؤ إ ئ ا آ ب ة ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ى ي |
1119
+
| Urdu |**ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ہ ھ ء ی ے**|
1120
+
1121
+
There are also auxiliary exemplars (in the same script) that should be included, that are not part of the core alphabet, but are in use (typically loan words or names).
1122
+
For example, in English someone would not be surprised to see a name such as René or Schröder.
1123
+
1124
+
| Language | Exemplars (Auxilliary) |
1125
+
| :---- | :---: |
1126
+
| Polish (aux) | à â å ä æ ç é è ê ë î ï ô ö œ q ß ù û ü v x ÿ |
1127
+
1128
+
It is often useful to explicitly include the exemplars from multiple languages.
1129
+
For example, an implementation may choose to include the exemplars from official languages of the EU, or for major languages of Africa.
1130
+
There is data in CLDR for the populations of languages in countries, and their official status, that may be useful for this.
1131
+
1132
+
### Non-Letters
1133
+
1134
+
Names, even for a single name field like the family name, may have spaces, such as “de Silva”. Some additional punctuation characters commonly used in names are provided by the punctuation-person exemplars.
1135
+
1136
+
| Polish (punct-person) | , . \- / |
1137
+
| :---- | :---: |
1138
+
1139
+
Those may include some variants of the ASCII hyphen; typically the best approach is to normalize them as below.
1140
+
1141
+
Examples include: Jean-Luc; Dr. Doom; James Smith Jr., MD
1142
+
1143
+
### Normalization
1144
+
1145
+
When names are input from the keyboard, they should be normalized before validation. Typically the best foundation for that is Unicode NFC format. Additional useful normalizations are
1146
+
1147
+
* Replacement of arbitrary sequences of whitespace characters by a single space .
1148
+
*\\p{whitespace}{2,∞} → U+0020
1149
+
* Replacement of U+2010 HYPHEN and U+2011 NON-BREAKING HYPHEN
1150
+
*\[‐‑\] → \-
1151
+
1152
+
### Additional possible constraints
1153
+
1154
+
Other useful constraints include testing for extremely unusual cases, which may be mistakes or jokes ([Zalgo](https://en.wikipedia.org/wiki/Zalgo_text)). For these it is helpful to transform first into NFD, then apply the tests.
1155
+
1156
+
* Too many identical grapheme clusters in a sequence
1157
+
* (Tóóóóóm)
1158
+
* Too many non-letters in a row
1159
+
* (Jean—Luc Jr..,, MD)
1160
+
* Too many combining marks in a row
1161
+
* Faruq̣̣̈̈
1162
+
1163
+
For further information, including confusables, mixed script detection, and so on, see [UTS \#39: Unicode Security Mechanisms](https://www.unicode.org/reports/tr39/).
0 commit comments