Simplify Names section

encukou · StanFromIreland · blaisep · encukou · commit 4606120878c1 · 2025-10-08T17:58:54.000+02:00
Co-authored-by: Stan Ulbrych &lt;89152624+StanFromIreland@users.noreply.github.com&gt;
Co-authored-by: Blaise Pabon &lt;blaise@gmail.com&gt;
Co-authored-by: Micha Albert &lt;info@micha.zone&gt;
Co-authored-by: KeithTheEE &lt;kmurrayis@gmail.com&gt;
diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst
@@ -386,73 +386,29 @@ Names (identifiers and keywords)
 :data:`~token.NAME` tokens represent *identifiers*, *keywords*, and
 *soft keywords*.
 
-Within the ASCII range (U+0001..U+007F), the valid characters for names
-include the uppercase and lowercase letters (``A-Z`` and ``a-z``),
-the underscore ``_`` and, except for the first character, the digits
-``0`` through ``9``.
+Names are composed of the following characters:
+
+* Uppercase and lowercase letters (``A-Z`` and ``a-z``)
+* The underscore (``_``)
+* Digits (``0`` through ``9``), which cannot appear as the first character
+* Non-ASCII characters. Valid names may only contain "letter-like" and
+  "digit-like" characters; see :ref:`lexical-names-nonascii` for details.
 
 Names must contain at least one character, but have no upper length limit.
 Case is significant.
 
-Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can also use "letter-like"
-and "number-like" characters from outside the ASCII range, as detailed below.
-
-All identifiers are converted into the `normalization form`_ NFKC while
-parsing; comparison of identifiers is based on NFKC.
-
-Formally, the first character of a normalized identifier must belong to the
-set ``id_start``, which is the union of:
-
-* Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``)
-* Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``)
-* Unicode category ``<Lt>`` - titlecase letters
-* Unicode category ``<Lm>`` - modifier letters
-* Unicode category ``<Lo>`` - other letters
-* Unicode category ``<Nl>`` - letter numbers
-* {``"_"``} - the underscore
-* ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_
-  to support backwards compatibility
-
-The remaining characters must belong to the set ``id_continue``, which is the
-union of:
-
-* all characters in ``id_start``
-* Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``)
-* Unicode category ``<Pc>`` - connector punctuations
-* Unicode category ``<Mn>`` - nonspacing marks
-* Unicode category ``<Mc>`` - spacing combining marks
-* ``<Other_ID_Continue>`` - another explicit set of characters in
-  `PropList.txt`_ to support backwards compatibility
-
-Unicode categories use the version of the Unicode Character Database as
-included in the :mod:`unicodedata` module.
-
-These sets are based on the Unicode standard annex `UAX-31`_.
-See also :pep:`3131` for further details.
-
-Even more formally, names are described by the following lexical definitions:
+Formally, names are described by the following lexical definitions:
 
 .. grammar-snippet::
    :group: python-grammar
 
-   NAME:         `xid_start` `xid_continue`*
-   id_start:     <Lu> | <Ll> | <Lt> | <Lm> | <Lo> | <Nl> | "_" | <Other_ID_Start>
-   id_continue:  `id_start` | <Nd> | <Pc> | <Mn> | <Mc> | <Other_ID_Continue>
-   xid_start:    <all characters in `id_start` whose NFKC normalization is
-                  in (`id_start` `xid_continue`*)">
-   xid_continue: <all characters in `id_continue` whose NFKC normalization is
-                  in (`id_continue`*)">
-   identifier:   <`NAME`, except keywords>
-
-A non-normative listing of all valid identifier characters as defined by
-Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode
-Character Database.
-
+   NAME:          `name_start` `name_continue`*
+   name_start:    "a".."z" | "A".."Z" | "_" | <non-ASCII character>
+   name_continue: name_start | "0".."9"
+   identifier:    <`NAME`, except keywords>
 
-.. _UAX-31: https://www.unicode.org/reports/tr31/
-.. _PropList.txt: https://www.unicode.org/Public/17.0.0/ucd/PropList.txt
-.. _DerivedCoreProperties.txt: https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt
-.. _normalization form: https://www.unicode.org/reports/tr15/#Norm_Forms
+Note that not all names matched by this grammar are valid; see
+:ref:`lexical-names-nonascii` for details.
 
 
 .. _keywords:
@@ -555,6 +511,74 @@ characters:
    :ref:`atom-identifiers`.
 
 
+.. _lexical-names-nonascii:
+
+Non-ASCII characters in names
+-----------------------------
+
+Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can use "letter-like"
+and "number-like" characters from outside the ASCII range,
+as detailed in this sections.
+
+All names are converted into the `normalization form`_ NFKC while parsing.
+This means that, for example, some typographic variants of characters are
+converted to their "basic" form, for example::
+
+   >>> nᵘₘᵇₑʳ = 3
+   >>> number
+   3
+
+.. note::
+
+   Normalization is done at the lexical level only.
+   Run-time functions that take names as *strings* generally do not normalize
+   their arguments.
+   For example, the variable defined above is accessible in the
+   :func:`globals` dictionary as ``globals()["number"]`` but not
+   ``globals()["nᵘₘᵇₑʳ"]``.
+
+The first character of a normalized identifier must be "letter-like".
+Formally, this means it must belong to the set ``id_start``,
+which is the union of:
+
+* Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``)
+* Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``)
+* Unicode category ``<Lt>`` - titlecase letters
+* Unicode category ``<Lm>`` - modifier letters
+* Unicode category ``<Lo>`` - other letters
+* Unicode category ``<Nl>`` - letter numbers
+* {``"_"``} - the underscore
+* ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_
+  to support backwards compatibility
+
+The remaining characters must be "letter-like" or "digit-like".
+Formally, they must belong to the set ``id_continue``, which is the union of:
+
+* ``id_start`` (see above)
+* Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``)
+* Unicode category ``<Pc>`` - connector punctuations
+* Unicode category ``<Mn>`` - nonspacing marks
+* Unicode category ``<Mc>`` - spacing combining marks
+* ``<Other_ID_Continue>`` - another explicit set of characters in
+  `PropList.txt`_ to support backwards compatibility
+
+Unicode categories use the version of the Unicode Character Database as
+included in the :mod:`unicodedata` module.
+
+These sets are based on the Unicode standard annex `UAX-31`_.
+See also :pep:`3131` for further details.
+
+A non-normative listing of all valid identifier characters as defined by
+Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode
+Character Database.
+
+
+.. _UAX-31: https://www.unicode.org/reports/tr31/
+.. _PropList.txt: https://www.unicode.org/Public/17.0.0/ucd/PropList.txt
+.. _DerivedCoreProperties.txt: https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt
+.. _normalization form: https://www.unicode.org/reports/tr15/#Norm_Forms
+
+
 .. _literals:
 
 Literals