@@ -10,12 +10,81 @@ Lexical analysis
1010A Python program is read by a *parser *. Input to the parser is a stream of
1111:term: `tokens <token> `, generated by the *lexical analyzer * (also known as
1212the *tokenizer *).
13- This chapter describes how the lexical analyzer breaks a file into tokens.
13+ This chapter describes how the lexical analyzer produces these tokens.
1414
15- Python reads program text as Unicode code points; the encoding of a source file
16- can be given by an encoding declaration and defaults to UTF-8, see :pep: `3120 `
17- for details. If the source file cannot be decoded, a :exc: `SyntaxError ` is
18- raised.
15+ .. note ::
16+
17+ A ":dfn: `stream `" is a *sequence *, in the general sense of the word
18+ (not necessarily a Python :term: `sequence object <sequence> `).
19+
20+ The lexical analyzer determines the program text's :ref: `encoding <encodings >`
21+ (UTF-8 by default), and decodes the text into
22+ :ref: `source characters <lexical-source-character >`.
23+ If the text cannot be decoded, a :exc: `SyntaxError ` is raised.
24+
25+ The lexical analyzer then generates a stream of tokens from the source
26+ characters.
27+ The type of each generated token, or other special behavior of the analyzer,
28+ generally depends on the first source character that hasn't yet been processed.
29+ The following table gives a quick summary of these characters,
30+ with links to sections that contain more information.
31+
32+ .. list-table ::
33+ :header-rows: 1
34+
35+ * * Character
36+ * Next token (or other relevant documentation)
37+
38+ * * * space
39+ * tab
40+ * formfeed
41+ * * :ref: `Whitespace <whitespace >`
42+
43+ * * * CR, LF
44+ * * :ref: `New line <line-structure >`
45+ * :ref: `Indentation <indentation >`
46+
47+ * * * backslash (``\ ``)
48+ * * :ref: `Explicit line joining <explicit-joining >`
49+ * (Also significant in :ref: `string escape sequences <escape-sequences >`)
50+
51+ * * * hash (``# ``)
52+ * * :ref: `Comment <comments >`
53+
54+ * * * quote (``' ``, ``" ``)
55+ * * :ref: `String literal <strings >`
56+
57+ * * * ASCII letter (``a ``-``z ``, ``A ``-``Z ``)
58+ * non-ASCII character
59+ * * :ref: `Name <identifiers >`
60+ * Prefixed :ref: `string or bytes literal <strings >`
61+
62+ * * * underscore (``_ ``)
63+ * * :ref: `Name <identifiers >`
64+ * (Can also be part of :ref: `numeric literals <numbers >`)
65+
66+ * * * number (``0 ``-``9 ``)
67+ * * :ref: `Numeric literal <numbers >`
68+
69+ * * * dot (``. ``)
70+ * * :ref: `Numeric literal <numbers >`
71+ * :ref: `Operator <operators >`
72+
73+ * * * question mark (``? ``)
74+ * dollar (``$ ``)
75+ *
76+ .. (the following uses zero-width-joiner characters to render
77+ .. a literal backquote)
78+
79+ backquote (``` ``)
80+ * control character
81+ * * Error (outside string literals and comments)
82+
83+ * * * other printing character
84+ * * :ref: `Operator or delimiter <operators >`
85+
86+ * * * end of file
87+ * * :ref: `End marker <endmarker-token >`
1988
2089
2190.. _line-structure :
@@ -120,6 +189,8 @@ If an encoding is declared, the encoding name must be recognized by Python
120189encoding is used for all lexical analysis, including string literals, comments
121190and identifiers.
122191
192+ .. _lexical-source-character :
193+
123194All lexical analysis, including string literals, comments
124195and identifiers, works on Unicode text decoded using the source encoding.
125196Any Unicode code point, except the NUL control character, can appear in
0 commit comments