Skip to content

Commit 4606120

Browse files
encukouStanFromIrelandblaisepMichaByteKeithTheEE
committed
Simplify Names section
Co-authored-by: Stan Ulbrych <[email protected]> Co-authored-by: Blaise Pabon <[email protected]> Co-authored-by: Micha Albert <[email protected]> Co-authored-by: KeithTheEE <[email protected]>
1 parent 59a6f9d commit 4606120

File tree

1 file changed

+82
-58
lines changed

1 file changed

+82
-58
lines changed

Doc/reference/lexical_analysis.rst

Lines changed: 82 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -386,73 +386,29 @@ Names (identifiers and keywords)
386386
:data:`~token.NAME` tokens represent *identifiers*, *keywords*, and
387387
*soft keywords*.
388388

389-
Within the ASCII range (U+0001..U+007F), the valid characters for names
390-
include the uppercase and lowercase letters (``A-Z`` and ``a-z``),
391-
the underscore ``_`` and, except for the first character, the digits
392-
``0`` through ``9``.
389+
Names are composed of the following characters:
390+
391+
* Uppercase and lowercase letters (``A-Z`` and ``a-z``)
392+
* The underscore (``_``)
393+
* Digits (``0`` through ``9``), which cannot appear as the first character
394+
* Non-ASCII characters. Valid names may only contain "letter-like" and
395+
"digit-like" characters; see :ref:`lexical-names-nonascii` for details.
393396

394397
Names must contain at least one character, but have no upper length limit.
395398
Case is significant.
396399

397-
Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can also use "letter-like"
398-
and "number-like" characters from outside the ASCII range, as detailed below.
399-
400-
All identifiers are converted into the `normalization form`_ NFKC while
401-
parsing; comparison of identifiers is based on NFKC.
402-
403-
Formally, the first character of a normalized identifier must belong to the
404-
set ``id_start``, which is the union of:
405-
406-
* Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``)
407-
* Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``)
408-
* Unicode category ``<Lt>`` - titlecase letters
409-
* Unicode category ``<Lm>`` - modifier letters
410-
* Unicode category ``<Lo>`` - other letters
411-
* Unicode category ``<Nl>`` - letter numbers
412-
* {``"_"``} - the underscore
413-
* ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_
414-
to support backwards compatibility
415-
416-
The remaining characters must belong to the set ``id_continue``, which is the
417-
union of:
418-
419-
* all characters in ``id_start``
420-
* Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``)
421-
* Unicode category ``<Pc>`` - connector punctuations
422-
* Unicode category ``<Mn>`` - nonspacing marks
423-
* Unicode category ``<Mc>`` - spacing combining marks
424-
* ``<Other_ID_Continue>`` - another explicit set of characters in
425-
`PropList.txt`_ to support backwards compatibility
426-
427-
Unicode categories use the version of the Unicode Character Database as
428-
included in the :mod:`unicodedata` module.
429-
430-
These sets are based on the Unicode standard annex `UAX-31`_.
431-
See also :pep:`3131` for further details.
432-
433-
Even more formally, names are described by the following lexical definitions:
400+
Formally, names are described by the following lexical definitions:
434401

435402
.. grammar-snippet::
436403
:group: python-grammar
437404

438-
NAME: `xid_start` `xid_continue`*
439-
id_start: <Lu> | <Ll> | <Lt> | <Lm> | <Lo> | <Nl> | "_" | <Other_ID_Start>
440-
id_continue: `id_start` | <Nd> | <Pc> | <Mn> | <Mc> | <Other_ID_Continue>
441-
xid_start: <all characters in `id_start` whose NFKC normalization is
442-
in (`id_start` `xid_continue`*)">
443-
xid_continue: <all characters in `id_continue` whose NFKC normalization is
444-
in (`id_continue`*)">
445-
identifier: <`NAME`, except keywords>
446-
447-
A non-normative listing of all valid identifier characters as defined by
448-
Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode
449-
Character Database.
450-
405+
NAME: `name_start` `name_continue`*
406+
name_start: "a".."z" | "A".."Z" | "_" | <non-ASCII character>
407+
name_continue: name_start | "0".."9"
408+
identifier: <`NAME`, except keywords>
451409

452-
.. _UAX-31: https://www.unicode.org/reports/tr31/
453-
.. _PropList.txt: https://www.unicode.org/Public/17.0.0/ucd/PropList.txt
454-
.. _DerivedCoreProperties.txt: https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt
455-
.. _normalization form: https://www.unicode.org/reports/tr15/#Norm_Forms
410+
Note that not all names matched by this grammar are valid; see
411+
:ref:`lexical-names-nonascii` for details.
456412

457413

458414
.. _keywords:
@@ -555,6 +511,74 @@ characters:
555511
:ref:`atom-identifiers`.
556512

557513

514+
.. _lexical-names-nonascii:
515+
516+
Non-ASCII characters in names
517+
-----------------------------
518+
519+
Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can use "letter-like"
520+
and "number-like" characters from outside the ASCII range,
521+
as detailed in this sections.
522+
523+
All names are converted into the `normalization form`_ NFKC while parsing.
524+
This means that, for example, some typographic variants of characters are
525+
converted to their "basic" form, for example::
526+
527+
>>> nᵘₘᵇₑʳ = 3
528+
>>> number
529+
3
530+
531+
.. note::
532+
533+
Normalization is done at the lexical level only.
534+
Run-time functions that take names as *strings* generally do not normalize
535+
their arguments.
536+
For example, the variable defined above is accessible in the
537+
:func:`globals` dictionary as ``globals()["number"]`` but not
538+
``globals()["nᵘₘᵇₑʳ"]``.
539+
540+
The first character of a normalized identifier must be "letter-like".
541+
Formally, this means it must belong to the set ``id_start``,
542+
which is the union of:
543+
544+
* Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``)
545+
* Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``)
546+
* Unicode category ``<Lt>`` - titlecase letters
547+
* Unicode category ``<Lm>`` - modifier letters
548+
* Unicode category ``<Lo>`` - other letters
549+
* Unicode category ``<Nl>`` - letter numbers
550+
* {``"_"``} - the underscore
551+
* ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_
552+
to support backwards compatibility
553+
554+
The remaining characters must be "letter-like" or "digit-like".
555+
Formally, they must belong to the set ``id_continue``, which is the union of:
556+
557+
* ``id_start`` (see above)
558+
* Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``)
559+
* Unicode category ``<Pc>`` - connector punctuations
560+
* Unicode category ``<Mn>`` - nonspacing marks
561+
* Unicode category ``<Mc>`` - spacing combining marks
562+
* ``<Other_ID_Continue>`` - another explicit set of characters in
563+
`PropList.txt`_ to support backwards compatibility
564+
565+
Unicode categories use the version of the Unicode Character Database as
566+
included in the :mod:`unicodedata` module.
567+
568+
These sets are based on the Unicode standard annex `UAX-31`_.
569+
See also :pep:`3131` for further details.
570+
571+
A non-normative listing of all valid identifier characters as defined by
572+
Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode
573+
Character Database.
574+
575+
576+
.. _UAX-31: https://www.unicode.org/reports/tr31/
577+
.. _PropList.txt: https://www.unicode.org/Public/17.0.0/ucd/PropList.txt
578+
.. _DerivedCoreProperties.txt: https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt
579+
.. _normalization form: https://www.unicode.org/reports/tr15/#Norm_Forms
580+
581+
558582
.. _literals:
559583

560584
Literals

0 commit comments

Comments
 (0)