Skip to content

Commit 1b4e8fa

Browse files
committed
Lexical analysis: improve section on names
1 parent 3bace0a commit 1b4e8fa

File tree

1 file changed

+59
-45
lines changed

1 file changed

+59
-45
lines changed

Doc/reference/lexical_analysis.rst

Lines changed: 59 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -272,67 +272,80 @@ possible string that forms a legal token, when read from left to right.
272272

273273
.. _identifiers:
274274

275-
Identifiers and keywords
276-
========================
275+
Names (identifiers and keywords)
276+
================================
277277

278278
.. index:: identifier, name
279279

280280
:data:`~token.NAME` tokens represent *identifiers*, *keywords*, and
281281
*soft keywords*.
282282

283283
Within the ASCII range (U+0001..U+007F), the valid characters for names
284-
include the uppercase and lowercase letters (``A`` through
285-
``Z``), the underscore ``_`` and, except for the first character, the digits
284+
include the uppercase and lowercase letters (``A`` through ``Z`` and ``a`` to
285+
``z``), the underscore ``_`` and, except for the first character, the digits
286286
``0`` through ``9``.
287287

288288
Names must contain at least one character, but have no upper length limit.
289289
Case is significant.
290290

291-
Besizes ``A-Z`` and ``0-9``, names can also use "letter-like" and "number-like"
292-
characters from outside the ASCII range. For these characters, the
293-
classification uses the version of the Unicode Character Database as included
294-
in the :mod:`unicodedata` module.
291+
Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can also use "letter-like"
292+
and "number-like" characters from outside the ASCII range, as detailed below.
295293

296-
The exact definition of "letter-like" and "number-like" characters is based on
297-
the Unicode standard annex `UAX-31`_, with elaboration and changes as
298-
defined below. See also :pep:`3131` for further details.
294+
All identifiers are converted into the `normalization form`_ NFKC while
295+
parsing; comparison of identifiers is based on NFKC.
299296

300-
All identifiers are converted into the normal form NFKC while parsing;
301-
comparison of identifiers is based on NFKC.
297+
Formally, the first character of a normalized identifier must belong to the
298+
set ``id_start``, which is the union of:
302299

303-
Formally, names are described by the following lexical definitions.
300+
* Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``)
301+
* Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``)
302+
* Unicode category ``<Lt>`` - titlecase letters
303+
* Unicode category ``<Lm>`` - modifier letters
304+
* Unicode category ``<Lo>`` - other letters
305+
* Unicode category ``<Nl>`` - letter numbers
306+
* {``"_"``} - the underscore
307+
* ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_
308+
to support backwards compatibility
304309

305-
.. productionlist:: python-grammar
306-
NAME: `xid_start` `xid_continue`*
307-
id_start: <all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property>
308-
id_continue: <all characters in `id_start`, plus characters in the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
309-
xid_start: <all characters in `id_start` whose NFKC normalization is in "id_start xid_continue*">
310-
xid_continue: <all characters in `id_continue` whose NFKC normalization is in "id_continue*">
311-
identifier: <`NAME`, except keywords>
312-
313-
The Unicode category codes mentioned above stand for:
314-
315-
* *Lu* - uppercase letters
316-
* *Ll* - lowercase letters
317-
* *Lt* - titlecase letters
318-
* *Lm* - modifier letters
319-
* *Lo* - other letters
320-
* *Nl* - letter numbers
321-
* *Mn* - nonspacing marks
322-
* *Mc* - spacing combining marks
323-
* *Nd* - decimal numbers
324-
* *Pc* - connector punctuations
325-
* *Other_ID_Start* - explicit list of characters in `PropList.txt
326-
<https://www.unicode.org/Public/16.0.0/ucd/PropList.txt>`_ to support backwards
327-
compatibility
328-
* *Other_ID_Continue* - likewise
310+
The remaining characters must belong to the set ``id_continue``, which is the
311+
union of:
312+
313+
* all characters in ``id_start``
314+
* Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``)
315+
* Unicode category ``<Pc>`` - connector punctuations
316+
* Unicode category ``<Mn>`` - nonspacing marks
317+
* Unicode category ``<Mc>`` - spacing combining marks
318+
* ``<Other_ID_Continue>`` - another explicit set of characters in
319+
`PropList.txt`_ to support backwards compatibility
320+
321+
Unicode categories use the version of the Unicode Character Database as
322+
included in the :mod:`unicodedata` module.
323+
324+
These sets are based on the Unicode standard annex `UAX-31`_.
325+
See also :pep:`3131` for further details.
326+
327+
Even more formally, names are described by the following lexical definitions:
328+
329+
.. grammar-snippet::
330+
:group: python-grammar
331+
332+
NAME: `xid_start` `xid_continue`*
333+
id_start: <Lu> | <Ll> | <Lt> | <Lm> | <Lo> | <Nl> | "_" | <Other_ID_Start>
334+
id_continue: `id_start` | <Nd> | <Pc> | <Mn> | <Mc> | <Other_ID_Continue>
335+
xid_start: <all characters in `id_start` whose NFKC normalization is
336+
in (`id_start` `xid_continue`*)">
337+
xid_continue: <all characters in `id_continue` whose NFKC normalization is
338+
in (`id_continue`*)">
339+
identifier: <`NAME`, except keywords>
329340

330341
A non-normative HTML file listing all valid identifier characters for Unicode
331342
16.0.0 can be found at
332343
https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt
333344

334345

335346
.. _UAX-31: https://www.unicode.org/reports/tr31/
347+
.. _PropList.txt: https://www.unicode.org/Public/16.0.0/ucd/PropList.txt
348+
.. _normalization form: https://www.unicode.org/reports/tr15/#Norm_Forms
336349

337350

338351
.. _keywords:
@@ -344,7 +357,7 @@ Keywords
344357
single: keyword
345358
single: reserved word
346359

347-
The following identifiers are used as reserved words, or *keywords* of the
360+
The following names are used as reserved words, or *keywords* of the
348361
language, and cannot be used as ordinary identifiers. They must be spelled
349362
exactly as written here:
350363

@@ -368,18 +381,19 @@ Soft Keywords
368381

369382
.. versionadded:: 3.10
370383

371-
Some identifiers are only reserved under specific contexts. These are known as
372-
*soft keywords*. The identifiers ``match``, ``case``, ``type`` and ``_`` can
373-
syntactically act as keywords in certain contexts,
384+
Some names are only reserved under specific contexts. These are known as
385+
*soft keywords*:
386+
387+
- ``match``, ``case``, and ``_``, when used in the :keyword:`match` statement.
388+
- ``type``, when used in the :keyword:`type` statement.
389+
390+
These syntactically act as keywords in their specific contexts,
374391
but this distinction is done at the parser level, not when tokenizing.
375392

376393
As soft keywords, their use in the grammar is possible while still
377394
preserving compatibility with existing code that uses these names as
378395
identifier names.
379396

380-
``match``, ``case``, and ``_`` are used in the :keyword:`match` statement.
381-
``type`` is used in the :keyword:`type` statement.
382-
383397
.. versionchanged:: 3.12
384398
``type`` is now a soft keyword.
385399

0 commit comments

Comments
 (0)