@@ -10,12 +10,76 @@ Lexical analysis
1010A Python program is read by a *parser *. Input to the parser is a stream of
1111:term: `tokens <token> `, generated by the *lexical analyzer * (also known as
1212the *tokenizer *).
13- This chapter describes how the lexical analyzer breaks a file into tokens.
13+ This chapter describes how the lexical analyzer produces these tokens.
1414
15- Python reads program text as Unicode code points; the encoding of a source file
16- can be given by an encoding declaration and defaults to UTF-8, see :pep: `3120 `
17- for details. If the source file cannot be decoded, a :exc: `SyntaxError ` is
18- raised.
15+ The lexical analyzer determines the program text's :ref: `encoding <encodings >`
16+ (UTF-8 by default), and decodes the text into
17+ :ref: `source characters <lexical-source-character >`.
18+ If the text cannot be decoded, a :exc: `SyntaxError ` is raised.
19+
20+ Next, the lexical analyzer uses the source characters to generate a stream of tokens.
21+ The type of a generated token generally depends on the next source character to
22+ be processed. Similarly, other special behavior of the analyzer depends on
23+ the first source character that hasn't yet been processed.
24+ The following table gives a quick summary of these source characters,
25+ with links to sections that contain more information.
26+
27+ .. list-table ::
28+ :header-rows: 1
29+
30+ * - Character
31+ - Next token (or other relevant documentation)
32+
33+ * - * space
34+ * tab
35+ * formfeed
36+ - * :ref: `Whitespace <whitespace >`
37+
38+ * - * CR, LF
39+ - * :ref: `New line <line-structure >`
40+ * :ref: `Indentation <indentation >`
41+
42+ * - * backslash (``\ ``)
43+ - * :ref: `Explicit line joining <explicit-joining >`
44+ * (Also significant in :ref: `string escape sequences <escape-sequences >`)
45+
46+ * - * hash (``# ``)
47+ - * :ref: `Comment <comments >`
48+
49+ * - * quote (``' ``, ``" ``)
50+ - * :ref: `String literal <strings >`
51+
52+ * - * ASCII letter (``a ``-``z ``, ``A ``-``Z ``)
53+ * non-ASCII character
54+ - * :ref: `Name <identifiers >`
55+ * Prefixed :ref: `string or bytes literal <strings >`
56+
57+ * - * underscore (``_ ``)
58+ - * :ref: `Name <identifiers >`
59+ * (Can also be part of :ref: `numeric literals <numbers >`)
60+
61+ * - * number (``0 ``-``9 ``)
62+ - * :ref: `Numeric literal <numbers >`
63+
64+ * - * dot (``. ``)
65+ - * :ref: `Numeric literal <numbers >`
66+ * :ref: `Operator <operators >`
67+
68+ * - * question mark (``? ``)
69+ * dollar (``$ ``)
70+ *
71+ .. (the following uses zero-width space characters to render
72+ .. a literal backquote)
73+
74+ backquote (``` ``)
75+ * control character
76+ - * Error (outside string literals and comments)
77+
78+ * - * other printing character
79+ - * :ref: `Operator or delimiter <operators >`
80+
81+ * - * end of file
82+ - * :ref: `End marker <endmarker-token >`
1983
2084
2185.. _line-structure :
@@ -120,6 +184,8 @@ If an encoding is declared, the encoding name must be recognized by Python
120184encoding is used for all lexical analysis, including string literals, comments
121185and identifiers.
122186
187+ .. _lexical-source-character :
188+
123189All lexical analysis, including string literals, comments
124190and identifiers, works on Unicode text decoded using the source encoding.
125191Any Unicode code point, except the NUL control character, can appear in
0 commit comments