CLDR-18987 Update tr35-personNames.md (#5045)

macchiati · web-flow · commit 8b3718c0ee97 · 2025-09-17T15:40:17.000-07:00
diff --git a/docs/ldml/tr35-general.md b/docs/ldml/tr35-general.md
@@ -516,13 +516,24 @@ Exemplars are characters used by a language, separated into different categories
 | --------------- | ----------- | -------- |
 | main / standard | Main letters used in the language | a-z å æ ø |
 | auxiliary       | Additional characters for common foreign words, technical usage | á à ă â å ä ã ā æ ç é è ĕ ê ë ē í ì ĭ î ï ī ñ ó ò ŏ ô ö ø ō œ ú ù ŭ û ü ū ÿ |
+| numbers         | Main characters needed to display the common number formats: decimal, percent, and currency. | \[\\u061C\\u200E \\- , ٫ ٬ . % ٪ ‰ ؉ + 0٠ 1١ 2٢ 3٣ 4٤ 5٥ 6٦ 7٧ 8٨ 9٩\] |
+| numbers-auxiliary         | Additional characters for use with numbers (technical or older usage) |  |
+| punctuation     | Main punctuation characters | - ‐ – — , ; \\: ! ? . … “ ” ‘ ’ ( ) [ ] § @ * / & # † ‡ ′ ″ |
+| punctuation-auxiliary     | Additional punctuation (technical or older usage) |  |
+| punctuation-person     | Punctuation used in people names, such as "Jean-Luc Smith Ph.D., MD. | - / . , |
 | index           | Characters for the header of an index | A B C D E F G H I J K L M N O P Q R S T U V W X Y Z |
-| punctuation     | Common punctuation | - ‐ – — , ; \\: ! ? . … “ ” ‘ ’ ( ) [ ] § @ * / & # † ‡ ′ ″ |
-| numbers         | The characters needed to display the common number formats: decimal, percent, and currency. | \[\\u061C\\u200E \\- , ٫ ٬ . % ٪ ‰ ؉ + 0٠ 1١ 2٢ 3٣ 4٤ 5٥ 6٦ 7٧ 8٨ 9٩\] |
 
 The basic exemplar character sets (main and auxiliary) contain the commonly used letters for a given modern form of a language, which can be for testing and for determining the appropriate repertoire of letters for charset conversion or collation. ("Letter" is interpreted broadly, as anything having the property Alphabetic in the [[UAX44](https://www.unicode.org/reports/tr41/#UAX44)], which also includes syllabaries and ideographs.) It is not a complete set of letters used for a language, nor should it be considered to apply to multiple languages in a particular country. Punctuation and other symbols should not be included in the main and auxiliary sets. In particular, format characters like CGJ are not included.
 
-There are five sets altogether: main, auxiliary, punctuation, numbers, and index. The _main_ set should contain the minimal set required for users of the language, while the _auxiliary_ exemplar set is designed to encompass additional characters: those non-native or historical characters that would customarily occur in common publications, dictionaries, and so on. Major style guidelines are good references for the auxiliary set. So, for example, if Irish newspapers and magazines would commonly have Danish names using å, for example, then it would be appropriate to include å in the auxiliary exemplar characters; just not in the main exemplar set. Thus English has the following:
+There are 4 types of sets altogether: main, numbers, punctuation, and index.
+Within each type, there are are subtypes: 
+a _main_ set containing the minimal set required for users of the language, 
+and an _auxiliary_ set, which is designed to encompass additional characters —
+those non-native or historical characters that would customarily occur in common publications, dictionaries, and so on.
+There are two exceptions: an index set doesn't have an _auxiliary_ set, 
+and the punctuation set has an additional subtype for person-name punctuation (see [Person Name Validation](tr35-personNames.html#person-name-validation).
+
+Major style guidelines are good references for an auxiliary set. So, for example, if Irish newspapers and magazines would commonly have Danish names using å, for example, then it would be appropriate to include å in the auxiliary exemplar characters; just not in the main exemplar set. Thus English has the following:
 
 ```xml
 <exemplarCharacters>[a b c d e f g h i j k l m n o p q r s t u v w x y z]</exemplarCharacters>
diff --git a/docs/ldml/tr35-personNames.md b/docs/ldml/tr35-personNames.md
@@ -1097,6 +1097,71 @@ For the expected sample name items, assume a name such as Mr. Richard “Rich”
 
 The `nameField` values and their modifiers are described in the [Person Name Object](#person-name-object) and [namePattern Syntax](#namepattern-syntax) sections.
 
+## Person Name Validation
+
+When implementations allow entry of person names, they are often too strict; there are many instances where people can’t enter their real names, such as O’Brian, Stéphanie, Wałęsa, Þjóðólfr. Conversely, when an implementation is too lenient, it allows names like Ȟěl̀a, or B🅾️b. (See also [Zalgo](https://en.wikipedia.org/wiki/Zalgo_text).) 
+
+Sometimes the constraints are imposed by limitations of outdated software or databases (such as not supporting Unicode character), or legal restrictions (such as only accepting names legal in Switzerland on native Swiss passports). 
+
+However, when the limitations are due to unfamiliarity with the kinds of characters that can appear in languages, Unicode properties and CLDR data can help implementers to avoid being either too strict or too lenient.
+
+### Letters
+
+A common restriction is that the letters in a name only come from a single script. That may be too lenient: there are over 1,453 letters in the Latin script in Unicode 17\!
+
+To narrow it down, an implementation may form the union of exemplar characters from a set of languages in CLDR (together with their uppercase equivalents); these include letters and combining marks (accents). Here are some examples:
+
+| Language | Exemplars (Main) |
+| :---- | :---: |
+| Icelandic | a á b d ð e é f g h i í j k l m n o ó p r s t u ú v x y ý þ æ ö |
+| Polish | a ą b c ć d e ę f g h i j k l ł m n ń o ó p r s ś t u w y z ź ż |
+| Arabic | ً ٌ ٍ َ ُ ِ ّ ْ ٰ ء أ ؤ إ ئ ا آ ب ة ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ى ي |
+| Urdu | **ا ب پ ت ٹ ث ج چ ح خ د ڈ ذ ر ڑ ز ژ س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ہ ھ ء ی ے** |
+
+There are also auxiliary exemplars (in the same script) that should be included, that are not part of the core alphabet, but are in use (typically loan words or names).
+For example, in English someone would not be surprised to see a name such as René or Schröder. 
+
+| Language | Exemplars (Auxilliary) |
+| :---- | :---: |
+| Polish (aux) | à â å ä æ ç é è ê ë î ï ô ö œ q ß ù û ü v x ÿ |
+
+It is often useful to explicitly include the exemplars from multiple languages. 
+For example, an implementation may choose to include the exemplars from official languages of the EU, or for major languages of Africa.
+There is data in CLDR for the populations of languages in countries, and their official status, that may be useful for this.
+
+### Non-Letters
+
+Names, even for a single name field like the family name, may have spaces, such as “de Silva”. Some additional punctuation characters commonly used in names are provided by the punctuation-person exemplars.
+
+| Polish (punct-person) | , . \- / |
+| :---- | :---: |
+
+Those may include some variants of the ASCII hyphen; typically the best approach is to normalize them as below.
+
+Examples include: Jean-Luc; Dr. Doom; James Smith Jr., MD
+
+### Normalization
+
+When names are input from the keyboard, they should be normalized before validation. Typically the best foundation for that is Unicode NFC format. Additional useful normalizations are 
+
+* Replacement of arbitrary sequences of whitespace characters by a single space .  
+  * \\p{whitespace}{2,∞} → U+0020  
+* Replacement of  U+2010 HYPHEN and U+2011 NON-BREAKING HYPHEN   
+  * \[‐‑\] → \-
+
+### Additional possible constraints
+
+Other useful constraints include testing for extremely unusual cases, which may be mistakes or jokes ([Zalgo](https://en.wikipedia.org/wiki/Zalgo_text)). For these it is helpful to transform first into NFD, then apply the tests.
+
+* Too many identical grapheme clusters in a sequence  
+  *  (Tóóóóóm)  
+* Too many non-letters in a row   
+  * (Jean—Luc Jr..,, MD)  
+* Too many combining marks in a row  
+  * Faruq̣̣̈̈
+
+For further information, including confusables, mixed script detection, and so on, see [UTS \#39: Unicode Security Mechanisms](https://www.unicode.org/reports/tr39/). 
+
 ## PersonName Data Interface Examples
 
 ### Example 1