Skip to content

Commit 4f2b85b

Browse files
committed
gh-135676: Add a summary of source characters
1 parent 0dbbf61 commit 4f2b85b

File tree

1 file changed

+76
-5
lines changed

1 file changed

+76
-5
lines changed

Doc/reference/lexical_analysis.rst

Lines changed: 76 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,81 @@ Lexical analysis
1010
A Python program is read by a *parser*. Input to the parser is a stream of
1111
:term:`tokens <token>`, generated by the *lexical analyzer* (also known as
1212
the *tokenizer*).
13-
This chapter describes how the lexical analyzer breaks a file into tokens.
13+
This chapter describes how the lexical analyzer produces these tokens.
1414

15-
Python reads program text as Unicode code points; the encoding of a source file
16-
can be given by an encoding declaration and defaults to UTF-8, see :pep:`3120`
17-
for details. If the source file cannot be decoded, a :exc:`SyntaxError` is
18-
raised.
15+
.. note::
16+
17+
A ":dfn:`stream`" is a *sequence*, in the general sense of the word
18+
(not necessarily a Python :term:`sequence object <sequence>`).
19+
20+
The lexical analyzer determines the program text's :ref:`encoding <encodings>`
21+
(UTF-8 by default), and decodes the text into
22+
:ref:`source characters <lexical-source-character>`.
23+
If the text cannot be decoded, a :exc:`SyntaxError` is raised.
24+
25+
The lexical analyzer then generates a stream of tokens from the source
26+
characters.
27+
The type of each generated token, or other special behavior of the analyzer,
28+
generally depends on the first source character that hasn't yet been processed.
29+
The following table gives a quick summary of these characters,
30+
with links to sections that contain more information.
31+
32+
.. list-table::
33+
:header-rows: 1
34+
35+
* * Character
36+
* Next token (or other relevant documentation)
37+
38+
* * * space
39+
* tab
40+
* formfeed
41+
* * :ref:`Whitespace <whitespace>`
42+
43+
* * * CR, LF
44+
* * :ref:`New line <line-structure>`
45+
* :ref:`Indentation <indentation>`
46+
47+
* * * backslash (``\``)
48+
* * :ref:`Explicit line joining <explicit-joining>`
49+
* (Also significant in :ref:`string escape sequences <escape-sequences>`)
50+
51+
* * * hash (``#``)
52+
* * :ref:`Comment <comments>`
53+
54+
* * * quote (``'``, ``"``)
55+
* * :ref:`String literal <strings>`
56+
57+
* * * ASCII letter (``a``-``z``, ``A``-``Z``)
58+
* non-ASCII character
59+
* * :ref:`Name <identifiers>`
60+
* Prefixed :ref:`string or bytes literal <strings>`
61+
62+
* * * underscore (``_``)
63+
* * :ref:`Name <identifiers>`
64+
* (Can also be part of :ref:`numeric literals <numbers>`)
65+
66+
* * * number (``0``-``9``)
67+
* * :ref:`Numeric literal <numbers>`
68+
69+
* * * dot (``.``)
70+
* * :ref:`Numeric literal <numbers>`
71+
* :ref:`Operator <operators>`
72+
73+
* * * question mark (``?``)
74+
* dollar (``$``)
75+
*
76+
.. (the following uses zero-width-joiner characters to render
77+
.. a literal backquote)
78+
79+
backquote (``‍`‍``)
80+
* control character
81+
* * Error (outside string literals and comments)
82+
83+
* * * other printing character
84+
* * :ref:`Operator or delimiter <operators>`
85+
86+
* * * end of file
87+
* * :ref:`End marker <endmarker-token>`
1988

2089

2190
.. _line-structure:
@@ -120,6 +189,8 @@ If an encoding is declared, the encoding name must be recognized by Python
120189
encoding is used for all lexical analysis, including string literals, comments
121190
and identifiers.
122191

192+
.. _lexical-source-character:
193+
123194
All lexical analysis, including string literals, comments
124195
and identifiers, works on Unicode text decoded using the source encoding.
125196
Any Unicode code point, except the NUL control character, can appear in

0 commit comments

Comments
 (0)