@@ -386,73 +386,29 @@ Names (identifiers and keywords)
386386:data: `~token.NAME ` tokens represent *identifiers *, *keywords *, and
387387*soft keywords *.
388388
389- Within the ASCII range (U+0001..U+007F), the valid characters for names
390- include the uppercase and lowercase letters (``A-Z `` and ``a-z ``),
391- the underscore ``_ `` and, except for the first character, the digits
392- ``0 `` through ``9 ``.
389+ Names are composed of the following characters:
390+
391+ * Uppercase and lowercase letters (``A-Z `` and ``a-z ``)
392+ * The underscore (``_ ``)
393+ * Digits (``0 `` through ``9 ``), which cannot appear as the first character
394+ * Non-ASCII characters. Valid names may only contain "letter-like" and
395+ "digit-like" characters; see :ref: `lexical-names-nonascii ` for details.
393396
394397Names must contain at least one character, but have no upper length limit.
395398Case is significant.
396399
397- Besides ``A-Z ``, ``a-z ``, ``_ `` and ``0-9 ``, names can also use "letter-like"
398- and "number-like" characters from outside the ASCII range, as detailed below.
399-
400- All identifiers are converted into the `normalization form `_ NFKC while
401- parsing; comparison of identifiers is based on NFKC.
402-
403- Formally, the first character of a normalized identifier must belong to the
404- set ``id_start ``, which is the union of:
405-
406- * Unicode category ``<Lu> `` - uppercase letters (includes ``A `` to ``Z ``)
407- * Unicode category ``<Ll> `` - lowercase letters (includes ``a `` to ``z ``)
408- * Unicode category ``<Lt> `` - titlecase letters
409- * Unicode category ``<Lm> `` - modifier letters
410- * Unicode category ``<Lo> `` - other letters
411- * Unicode category ``<Nl> `` - letter numbers
412- * {``"_" ``} - the underscore
413- * ``<Other_ID_Start> `` - an explicit set of characters in `PropList.txt `_
414- to support backwards compatibility
415-
416- The remaining characters must belong to the set ``id_continue ``, which is the
417- union of:
418-
419- * all characters in ``id_start ``
420- * Unicode category ``<Nd> `` - decimal numbers (includes ``0 `` to ``9 ``)
421- * Unicode category ``<Pc> `` - connector punctuations
422- * Unicode category ``<Mn> `` - nonspacing marks
423- * Unicode category ``<Mc> `` - spacing combining marks
424- * ``<Other_ID_Continue> `` - another explicit set of characters in
425- `PropList.txt `_ to support backwards compatibility
426-
427- Unicode categories use the version of the Unicode Character Database as
428- included in the :mod: `unicodedata ` module.
429-
430- These sets are based on the Unicode standard annex `UAX-31 `_.
431- See also :pep: `3131 ` for further details.
432-
433- Even more formally, names are described by the following lexical definitions:
400+ Formally, names are described by the following lexical definitions:
434401
435402.. grammar-snippet ::
436403 :group: python-grammar
437404
438- NAME: `xid_start ` `xid_continue`*
439- id_start: <Lu> | <Ll> | <Lt> | <Lm> | <Lo> | <Nl> | "_" | <Other_ID_Start>
440- id_continue: `id_start ` | <Nd> | <Pc> | <Mn> | <Mc> | <Other_ID_Continue>
441- xid_start: <all characters in `id_start ` whose NFKC normalization is
442- in (`id_start ` `xid_continue`*)">
443- xid_continue: <all characters in `id_continue ` whose NFKC normalization is
444- in (`id_continue`*)">
445- identifier: <`NAME `, except keywords>
446-
447- A non-normative listing of all valid identifier characters as defined by
448- Unicode is available in the `DerivedCoreProperties.txt `_ file in the Unicode
449- Character Database.
450-
405+ NAME: `name_start ` `name_continue`*
406+ name_start: "a".."z" | "A".."Z" | "_" | <non-ASCII character>
407+ name_continue: name_start | "0".."9"
408+ identifier: <`NAME `, except keywords>
451409
452- .. _UAX-31 : https://www.unicode.org/reports/tr31/
453- .. _PropList.txt : https://www.unicode.org/Public/17.0.0/ucd/PropList.txt
454- .. _DerivedCoreProperties.txt : https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt
455- .. _normalization form : https://www.unicode.org/reports/tr15/#Norm_Forms
410+ Note that not all names matched by this grammar are valid; see
411+ :ref: `lexical-names-nonascii ` for details.
456412
457413
458414.. _keywords :
@@ -555,6 +511,74 @@ characters:
555511 :ref: `atom-identifiers `.
556512
557513
514+ .. _lexical-names-nonascii :
515+
516+ Non-ASCII characters in names
517+ -----------------------------
518+
519+ Besides ``A-Z ``, ``a-z ``, ``_ `` and ``0-9 ``, names can use "letter-like"
520+ and "number-like" characters from outside the ASCII range,
521+ as detailed in this sections.
522+
523+ All names are converted into the `normalization form `_ NFKC while parsing.
524+ This means that, for example, some typographic variants of characters are
525+ converted to their "basic" form, for example::
526+
527+ >>> nᵘₘᵇₑʳ = 3
528+ >>> number
529+ 3
530+
531+ .. note ::
532+
533+ Normalization is done at the lexical level only.
534+ Run-time functions that take names as *strings * generally do not normalize
535+ their arguments.
536+ For example, the variable defined above is accessible in the
537+ :func: `globals ` dictionary as ``globals()["number"] `` but not
538+ ``globals()["nᵘₘᵇₑʳ"] ``.
539+
540+ The first character of a normalized identifier must be "letter-like".
541+ Formally, this means it must belong to the set ``id_start ``,
542+ which is the union of:
543+
544+ * Unicode category ``<Lu> `` - uppercase letters (includes ``A `` to ``Z ``)
545+ * Unicode category ``<Ll> `` - lowercase letters (includes ``a `` to ``z ``)
546+ * Unicode category ``<Lt> `` - titlecase letters
547+ * Unicode category ``<Lm> `` - modifier letters
548+ * Unicode category ``<Lo> `` - other letters
549+ * Unicode category ``<Nl> `` - letter numbers
550+ * {``"_" ``} - the underscore
551+ * ``<Other_ID_Start> `` - an explicit set of characters in `PropList.txt `_
552+ to support backwards compatibility
553+
554+ The remaining characters must be "letter-like" or "digit-like".
555+ Formally, they must belong to the set ``id_continue ``, which is the union of:
556+
557+ * ``id_start `` (see above)
558+ * Unicode category ``<Nd> `` - decimal numbers (includes ``0 `` to ``9 ``)
559+ * Unicode category ``<Pc> `` - connector punctuations
560+ * Unicode category ``<Mn> `` - nonspacing marks
561+ * Unicode category ``<Mc> `` - spacing combining marks
562+ * ``<Other_ID_Continue> `` - another explicit set of characters in
563+ `PropList.txt `_ to support backwards compatibility
564+
565+ Unicode categories use the version of the Unicode Character Database as
566+ included in the :mod: `unicodedata ` module.
567+
568+ These sets are based on the Unicode standard annex `UAX-31 `_.
569+ See also :pep: `3131 ` for further details.
570+
571+ A non-normative listing of all valid identifier characters as defined by
572+ Unicode is available in the `DerivedCoreProperties.txt `_ file in the Unicode
573+ Character Database.
574+
575+
576+ .. _UAX-31 : https://www.unicode.org/reports/tr31/
577+ .. _PropList.txt : https://www.unicode.org/Public/17.0.0/ucd/PropList.txt
578+ .. _DerivedCoreProperties.txt : https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt
579+ .. _normalization form : https://www.unicode.org/reports/tr15/#Norm_Forms
580+
581+
558582.. _literals :
559583
560584Literals
0 commit comments