Skip to content

Commit 8b3718c

Browse files
authored
CLDR-18987 Update tr35-personNames.md (#5045)
1 parent 08286bb commit 8b3718c

File tree

2 files changed

+79
-3
lines changed

2 files changed

+79
-3
lines changed

docs/ldml/tr35-general.md

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -516,13 +516,24 @@ Exemplars are characters used by a language, separated into different categories
516516
| --------------- | ----------- | -------- |
517517
| main / standard | Main letters used in the language | a-z å æ ø |
518518
| auxiliary | Additional characters for common foreign words, technical usage | á à ă â å ä ã ā æ ç é è ĕ ê ë ē í ì ĭ î ï ī ñ ó ò ŏ ô ö ø ō œ ú ù ŭ û ü ū ÿ |
519+
| numbers | Main characters needed to display the common number formats: decimal, percent, and currency. | \[\\u061C\\u200E \\- , ٫ ٬ . % ٪ ‰ ؉ + 0٠ 1١ 2٢ 3٣ 4٤ 5٥ 6٦ 7٧ 8٨ 9٩\] |
520+
| numbers-auxiliary | Additional characters for use with numbers (technical or older usage) | |
521+
| punctuation | Main punctuation characters | - ‐ – — , ; \\: ! ? . … “ ” ‘ ’ ( ) [ ] § @ * / & # † ‡ ′ ″ |
522+
| punctuation-auxiliary | Additional punctuation (technical or older usage) | |
523+
| punctuation-person | Punctuation used in people names, such as "Jean-Luc Smith Ph.D., MD. | - / . , |
519524
| index | Characters for the header of an index | A B C D E F G H I J K L M N O P Q R S T U V W X Y Z |
520-
| punctuation | Common punctuation | - ‐ – — , ; \\: ! ? . … “ ” ‘ ’ ( ) [ ] § @ * / & # † ‡ ′ ″ |
521-
| numbers | The characters needed to display the common number formats: decimal, percent, and currency. | \[\\u061C\\u200E \\- , ٫ ٬ . % ٪ ‰ ؉ + 0٠ 1١ 2٢ 3٣ 4٤ 5٥ 6٦ 7٧ 8٨ 9٩\] |
522525

523526
The basic exemplar character sets (main and auxiliary) contain the commonly used letters for a given modern form of a language, which can be for testing and for determining the appropriate repertoire of letters for charset conversion or collation. ("Letter" is interpreted broadly, as anything having the property Alphabetic in the [[UAX44](https://www.unicode.org/reports/tr41/#UAX44)], which also includes syllabaries and ideographs.) It is not a complete set of letters used for a language, nor should it be considered to apply to multiple languages in a particular country. Punctuation and other symbols should not be included in the main and auxiliary sets. In particular, format characters like CGJ are not included.
524527

525-
There are five sets altogether: main, auxiliary, punctuation, numbers, and index. The _main_ set should contain the minimal set required for users of the language, while the _auxiliary_ exemplar set is designed to encompass additional characters: those non-native or historical characters that would customarily occur in common publications, dictionaries, and so on. Major style guidelines are good references for the auxiliary set. So, for example, if Irish newspapers and magazines would commonly have Danish names using å, for example, then it would be appropriate to include å in the auxiliary exemplar characters; just not in the main exemplar set. Thus English has the following:
528+
There are 4 types of sets altogether: main, numbers, punctuation, and index.
529+
Within each type, there are are subtypes:
530+
a _main_ set containing the minimal set required for users of the language,
531+
and an _auxiliary_ set, which is designed to encompass additional characters —
532+
those non-native or historical characters that would customarily occur in common publications, dictionaries, and so on.
533+
There are two exceptions: an index set doesn't have an _auxiliary_ set,
534+
and the punctuation set has an additional subtype for person-name punctuation (see [Person Name Validation](tr35-personNames.html#person-name-validation).
535+
536+
Major style guidelines are good references for an auxiliary set. So, for example, if Irish newspapers and magazines would commonly have Danish names using å, for example, then it would be appropriate to include å in the auxiliary exemplar characters; just not in the main exemplar set. Thus English has the following:
526537

527538
```xml
528539
<exemplarCharacters>[a b c d e f g h i j k l m n o p q r s t u v w x y z]</exemplarCharacters>

docs/ldml/tr35-personNames.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1097,6 +1097,71 @@ For the expected sample name items, assume a name such as Mr. Richard “Rich”
10971097

10981098
The `nameField` values and their modifiers are described in the [Person Name Object](#person-name-object) and [namePattern Syntax](#namepattern-syntax) sections.
10991099

1100+
## Person Name Validation
1101+
1102+
When implementations allow entry of person names, they are often too strict; there are many instances where people can’t enter their real names, such as O’Brian, Stéphanie, Wałęsa, Þjóðólfr. Conversely, when an implementation is too lenient, it allows names like Ȟěl̀a, or B🅾️b. (See also [Zalgo](https://en.wikipedia.org/wiki/Zalgo_text).)
1103+
1104+
Sometimes the constraints are imposed by limitations of outdated software or databases (such as not supporting Unicode character), or legal restrictions (such as only accepting names legal in Switzerland on native Swiss passports).
1105+
1106+
However, when the limitations are due to unfamiliarity with the kinds of characters that can appear in languages, Unicode properties and CLDR data can help implementers to avoid being either too strict or too lenient.
1107+
1108+
### Letters
1109+
1110+
A common restriction is that the letters in a name only come from a single script. That may be too lenient: there are over 1,453 letters in the Latin script in Unicode 17\!
1111+
1112+
To narrow it down, an implementation may form the union of exemplar characters from a set of languages in CLDR (together with their uppercase equivalents); these include letters and combining marks (accents). Here are some examples:
1113+
1114+
| Language | Exemplars (Main) |
1115+
| :---- | :---: |
1116+
| Icelandic | a á b d ð e é f g h i í j k l m n o ó p r s t u ú v x y ý þ æ ö |
1117+
| Polish | a ą b c ć d e ę f g h i j k l ł m n ń o ó p r s ś t u w y z ź ż |
1118+
| Arabic | ً ٌ ٍ َ ُ ِ ّ ْ ٰ ء أ ؤ إ ئ ا آ ب ة ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ى ي |
1119+
| Urdu | **ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ہ ھ ء ی ے** |
1120+
1121+
There are also auxiliary exemplars (in the same script) that should be included, that are not part of the core alphabet, but are in use (typically loan words or names).
1122+
For example, in English someone would not be surprised to see a name such as René or Schröder.
1123+
1124+
| Language | Exemplars (Auxilliary) |
1125+
| :---- | :---: |
1126+
| Polish (aux) | à â å ä æ ç é è ê ë î ï ô ö œ q ß ù û ü v x ÿ |
1127+
1128+
It is often useful to explicitly include the exemplars from multiple languages.
1129+
For example, an implementation may choose to include the exemplars from official languages of the EU, or for major languages of Africa.
1130+
There is data in CLDR for the populations of languages in countries, and their official status, that may be useful for this.
1131+
1132+
### Non-Letters
1133+
1134+
Names, even for a single name field like the family name, may have spaces, such as “de Silva”. Some additional punctuation characters commonly used in names are provided by the punctuation-person exemplars.
1135+
1136+
| Polish (punct-person) | , . \- / |
1137+
| :---- | :---: |
1138+
1139+
Those may include some variants of the ASCII hyphen; typically the best approach is to normalize them as below.
1140+
1141+
Examples include: Jean-Luc; Dr. Doom; James Smith Jr., MD
1142+
1143+
### Normalization
1144+
1145+
When names are input from the keyboard, they should be normalized before validation. Typically the best foundation for that is Unicode NFC format. Additional useful normalizations are
1146+
1147+
* Replacement of arbitrary sequences of whitespace characters by a single space .
1148+
* \\p{whitespace}{2,∞} → U+0020
1149+
* Replacement of U+2010 HYPHEN and U+2011 NON-BREAKING HYPHEN
1150+
* \[‐‑\]\-
1151+
1152+
### Additional possible constraints
1153+
1154+
Other useful constraints include testing for extremely unusual cases, which may be mistakes or jokes ([Zalgo](https://en.wikipedia.org/wiki/Zalgo_text)). For these it is helpful to transform first into NFD, then apply the tests.
1155+
1156+
* Too many identical grapheme clusters in a sequence
1157+
* (Tóóóóóm)
1158+
* Too many non-letters in a row
1159+
* (Jean—Luc Jr..,, MD)
1160+
* Too many combining marks in a row
1161+
* Faruq̣̣̈̈
1162+
1163+
For further information, including confusables, mixed script detection, and so on, see [UTS \#39: Unicode Security Mechanisms](https://www.unicode.org/reports/tr39/).
1164+
11001165
## PersonName Data Interface Examples
11011166

11021167
### Example 1

0 commit comments

Comments
 (0)