diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst index e320eedfa67a27..1bbbe2c696973f 100644 --- a/Doc/reference/lexical_analysis.rst +++ b/Doc/reference/lexical_analysis.rst @@ -10,12 +10,76 @@ Lexical analysis A Python program is read by a *parser*. Input to the parser is a stream of :term:`tokens `, generated by the *lexical analyzer* (also known as the *tokenizer*). -This chapter describes how the lexical analyzer breaks a file into tokens. +This chapter describes how the lexical analyzer produces these tokens. -Python reads program text as Unicode code points; the encoding of a source file -can be given by an encoding declaration and defaults to UTF-8, see :pep:`3120` -for details. If the source file cannot be decoded, a :exc:`SyntaxError` is -raised. +The lexical analyzer determines the program text's :ref:`encoding ` +(UTF-8 by default), and decodes the text into +:ref:`source characters `. +If the text cannot be decoded, a :exc:`SyntaxError` is raised. + +Next, the lexical analyzer uses the source characters to generate a stream of tokens. +The type of a generated token generally depends on the next source character to +be processed. Similarly, other special behavior of the analyzer depends on +the first source character that hasn't yet been processed. +The following table gives a quick summary of these source characters, +with links to sections that contain more information. + +.. list-table:: + :header-rows: 1 + + * - Character + - Next token (or other relevant documentation) + + * - * space + * tab + * formfeed + - * :ref:`Whitespace ` + + * - * CR, LF + - * :ref:`New line ` + * :ref:`Indentation ` + + * - * backslash (``\``) + - * :ref:`Explicit line joining ` + * (Also significant in :ref:`string escape sequences `) + + * - * hash (``#``) + - * :ref:`Comment ` + + * - * quote (``'``, ``"``) + - * :ref:`String literal ` + + * - * ASCII letter (``a``-``z``, ``A``-``Z``) + * non-ASCII character + - * :ref:`Name ` + * Prefixed :ref:`string or bytes literal ` + + * - * underscore (``_``) + - * :ref:`Name ` + * (Can also be part of :ref:`numeric literals `) + + * - * number (``0``-``9``) + - * :ref:`Numeric literal ` + + * - * dot (``.``) + - * :ref:`Numeric literal ` + * :ref:`Operator ` + + * - * question mark (``?``) + * dollar (``$``) + * + .. (the following uses zero-width space characters to render + .. a literal backquote) + + backquote (``​`​``) + * control character + - * Error (outside string literals and comments) + + * - * other printing character + - * :ref:`Operator or delimiter ` + + * - * end of file + - * :ref:`End marker ` .. _line-structure: @@ -120,6 +184,8 @@ If an encoding is declared, the encoding name must be recognized by Python encoding is used for all lexical analysis, including string literals, comments and identifiers. +.. _lexical-source-character: + All lexical analysis, including string literals, comments and identifiers, works on Unicode text decoded using the source encoding. Any Unicode code point, except the NUL control character, can appear in