Skip to content

Commit 1c1a0bd

Browse files
miss-islingtonencukouwillingcStanFromIrelandblaisep
authored
[3.14] gh-135676: Add a summary of source characters (GH-138194) (GH-139781)
(cherry picked from commit 59a6f9d) Co-authored-by: Petr Viktorin <[email protected]> Co-authored-by: Carol Willing <[email protected]> Co-authored-by: Stan Ulbrych <[email protected]> Co-authored-by: Blaise Pabon <[email protected]> Co-authored-by: Micha Albert <[email protected]> Co-authored-by: KeithTheEE <[email protected]>
1 parent 14c923c commit 1c1a0bd

File tree

1 file changed

+71
-5
lines changed

1 file changed

+71
-5
lines changed

Doc/reference/lexical_analysis.rst

Lines changed: 71 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,76 @@ Lexical analysis
1010
A Python program is read by a *parser*. Input to the parser is a stream of
1111
:term:`tokens <token>`, generated by the *lexical analyzer* (also known as
1212
the *tokenizer*).
13-
This chapter describes how the lexical analyzer breaks a file into tokens.
13+
This chapter describes how the lexical analyzer produces these tokens.
1414

15-
Python reads program text as Unicode code points; the encoding of a source file
16-
can be given by an encoding declaration and defaults to UTF-8, see :pep:`3120`
17-
for details. If the source file cannot be decoded, a :exc:`SyntaxError` is
18-
raised.
15+
The lexical analyzer determines the program text's :ref:`encoding <encodings>`
16+
(UTF-8 by default), and decodes the text into
17+
:ref:`source characters <lexical-source-character>`.
18+
If the text cannot be decoded, a :exc:`SyntaxError` is raised.
19+
20+
Next, the lexical analyzer uses the source characters to generate a stream of tokens.
21+
The type of a generated token generally depends on the next source character to
22+
be processed. Similarly, other special behavior of the analyzer depends on
23+
the first source character that hasn't yet been processed.
24+
The following table gives a quick summary of these source characters,
25+
with links to sections that contain more information.
26+
27+
.. list-table::
28+
:header-rows: 1
29+
30+
* - Character
31+
- Next token (or other relevant documentation)
32+
33+
* - * space
34+
* tab
35+
* formfeed
36+
- * :ref:`Whitespace <whitespace>`
37+
38+
* - * CR, LF
39+
- * :ref:`New line <line-structure>`
40+
* :ref:`Indentation <indentation>`
41+
42+
* - * backslash (``\``)
43+
- * :ref:`Explicit line joining <explicit-joining>`
44+
* (Also significant in :ref:`string escape sequences <escape-sequences>`)
45+
46+
* - * hash (``#``)
47+
- * :ref:`Comment <comments>`
48+
49+
* - * quote (``'``, ``"``)
50+
- * :ref:`String literal <strings>`
51+
52+
* - * ASCII letter (``a``-``z``, ``A``-``Z``)
53+
* non-ASCII character
54+
- * :ref:`Name <identifiers>`
55+
* Prefixed :ref:`string or bytes literal <strings>`
56+
57+
* - * underscore (``_``)
58+
- * :ref:`Name <identifiers>`
59+
* (Can also be part of :ref:`numeric literals <numbers>`)
60+
61+
* - * number (``0``-``9``)
62+
- * :ref:`Numeric literal <numbers>`
63+
64+
* - * dot (``.``)
65+
- * :ref:`Numeric literal <numbers>`
66+
* :ref:`Operator <operators>`
67+
68+
* - * question mark (``?``)
69+
* dollar (``$``)
70+
*
71+
.. (the following uses zero-width space characters to render
72+
.. a literal backquote)
73+
74+
backquote (``​`​``)
75+
* control character
76+
- * Error (outside string literals and comments)
77+
78+
* - * other printing character
79+
- * :ref:`Operator or delimiter <operators>`
80+
81+
* - * end of file
82+
- * :ref:`End marker <endmarker-token>`
1983

2084

2185
.. _line-structure:
@@ -120,6 +184,8 @@ If an encoding is declared, the encoding name must be recognized by Python
120184
encoding is used for all lexical analysis, including string literals, comments
121185
and identifiers.
122186

187+
.. _lexical-source-character:
188+
123189
All lexical analysis, including string literals, comments
124190
and identifiers, works on Unicode text decoded using the source encoding.
125191
Any Unicode code point, except the NUL control character, can appear in

0 commit comments

Comments
 (0)