@@ -10,12 +10,76 @@ Lexical analysis
10
10
A Python program is read by a *parser *. Input to the parser is a stream of
11
11
:term: `tokens <token> `, generated by the *lexical analyzer * (also known as
12
12
the *tokenizer *).
13
- This chapter describes how the lexical analyzer breaks a file into tokens.
13
+ This chapter describes how the lexical analyzer produces these tokens.
14
14
15
- Python reads program text as Unicode code points; the encoding of a source file
16
- can be given by an encoding declaration and defaults to UTF-8, see :pep: `3120 `
17
- for details. If the source file cannot be decoded, a :exc: `SyntaxError ` is
18
- raised.
15
+ The lexical analyzer determines the program text's :ref: `encoding <encodings >`
16
+ (UTF-8 by default), and decodes the text into
17
+ :ref: `source characters <lexical-source-character >`.
18
+ If the text cannot be decoded, a :exc: `SyntaxError ` is raised.
19
+
20
+ Next, the lexical analyzer uses the source characters to generate a stream of tokens.
21
+ The type of a generated token generally depends on the next source character to
22
+ be processed. Similarly, other special behavior of the analyzer depends on
23
+ the first source character that hasn't yet been processed.
24
+ The following table gives a quick summary of these source characters,
25
+ with links to sections that contain more information.
26
+
27
+ .. list-table ::
28
+ :header-rows: 1
29
+
30
+ * - Character
31
+ - Next token (or other relevant documentation)
32
+
33
+ * - * space
34
+ * tab
35
+ * formfeed
36
+ - * :ref: `Whitespace <whitespace >`
37
+
38
+ * - * CR, LF
39
+ - * :ref: `New line <line-structure >`
40
+ * :ref: `Indentation <indentation >`
41
+
42
+ * - * backslash (``\ ``)
43
+ - * :ref: `Explicit line joining <explicit-joining >`
44
+ * (Also significant in :ref: `string escape sequences <escape-sequences >`)
45
+
46
+ * - * hash (``# ``)
47
+ - * :ref: `Comment <comments >`
48
+
49
+ * - * quote (``' ``, ``" ``)
50
+ - * :ref: `String literal <strings >`
51
+
52
+ * - * ASCII letter (``a ``-``z ``, ``A ``-``Z ``)
53
+ * non-ASCII character
54
+ - * :ref: `Name <identifiers >`
55
+ * Prefixed :ref: `string or bytes literal <strings >`
56
+
57
+ * - * underscore (``_ ``)
58
+ - * :ref: `Name <identifiers >`
59
+ * (Can also be part of :ref: `numeric literals <numbers >`)
60
+
61
+ * - * number (``0 ``-``9 ``)
62
+ - * :ref: `Numeric literal <numbers >`
63
+
64
+ * - * dot (``. ``)
65
+ - * :ref: `Numeric literal <numbers >`
66
+ * :ref: `Operator <operators >`
67
+
68
+ * - * question mark (``? ``)
69
+ * dollar (``$ ``)
70
+ *
71
+ .. (the following uses zero-width space characters to render
72
+ .. a literal backquote)
73
+
74
+ backquote (``` ``)
75
+ * control character
76
+ - * Error (outside string literals and comments)
77
+
78
+ * - * other printing character
79
+ - * :ref: `Operator or delimiter <operators >`
80
+
81
+ * - * end of file
82
+ - * :ref: `End marker <endmarker-token >`
19
83
20
84
21
85
.. _line-structure :
@@ -120,6 +184,8 @@ If an encoding is declared, the encoding name must be recognized by Python
120
184
encoding is used for all lexical analysis, including string literals, comments
121
185
and identifiers.
122
186
187
+ .. _lexical-source-character :
188
+
123
189
All lexical analysis, including string literals, comments
124
190
and identifiers, works on Unicode text decoded using the source encoding.
125
191
Any Unicode code point, except the NUL control character, can appear in
0 commit comments