gh-135676: Add a summary of source characters

encukou · encukou · commit 4f2b85b5b90f · 2025-08-27T17:41:15.000+02:00
diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst
@@ -10,12 +10,81 @@ Lexical analysis
 A Python program is read by a *parser*.  Input to the parser is a stream of
 :term:`tokens <token>`, generated by the *lexical analyzer* (also known as
 the *tokenizer*).
-This chapter describes how the lexical analyzer breaks a file into tokens.
+This chapter describes how the lexical analyzer produces these tokens.
 
-Python reads program text as Unicode code points; the encoding of a source file
-can be given by an encoding declaration and defaults to UTF-8, see :pep:`3120`
-for details.  If the source file cannot be decoded, a :exc:`SyntaxError` is
-raised.
+.. note::
+
+   A ":dfn:`stream`" is a *sequence*, in the general sense of the word
+   (not necessarily a Python :term:`sequence object <sequence>`).
+
+The lexical analyzer determines the program text's :ref:`encoding <encodings>`
+(UTF-8 by default), and decodes the text into
+:ref:`source characters <lexical-source-character>`.
+If the text cannot be decoded, a :exc:`SyntaxError` is raised.
+
+The lexical analyzer then generates a stream of tokens from the source
+characters.
+The type of each generated token, or other special behavior of the analyzer,
+generally depends on the first source character that hasn't yet been processed.
+The following table gives a quick summary of these characters,
+with links to sections that contain more information.
+
+.. list-table::
+   :header-rows: 1
+
+   * * Character
+     * Next token (or other relevant documentation)
+
+   * * * space
+       * tab
+       * formfeed
+     * * :ref:`Whitespace <whitespace>`
+
+   * * * CR, LF
+     * * :ref:`New line <line-structure>`
+       * :ref:`Indentation <indentation>`
+
+   * * * backslash (``\``)
+     * * :ref:`Explicit line joining <explicit-joining>`
+       * (Also significant in :ref:`string escape sequences <escape-sequences>`)
+
+   * * * hash (``#``)
+     * * :ref:`Comment <comments>`
+
+   * * * quote (``'``, ``"``)
+     * * :ref:`String literal <strings>`
+
+   * * * ASCII letter (``a``-``z``, ``A``-``Z``)
+       * non-ASCII character
+     * * :ref:`Name <identifiers>`
+       * Prefixed :ref:`string or bytes literal <strings>`
+
+   * * * underscore (``_``)
+     * * :ref:`Name <identifiers>`
+       * (Can also be part of :ref:`numeric literals <numbers>`)
+
+   * * * number (``0``-``9``)
+     * * :ref:`Numeric literal <numbers>`
+
+   * * * dot (``.``)
+     * * :ref:`Numeric literal <numbers>`
+       * :ref:`Operator <operators>`
+
+   * * * question mark (``?``)
+       * dollar (``$``)
+       *
+         .. (the following uses zero-width-joiner characters to render
+         .. a literal backquote)
+
+         backquote (``‍`‍``)
+       * control character
+     * * Error (outside string literals and comments)
+
+   * * * other printing character
+     * * :ref:`Operator or delimiter <operators>`
+
+   * * * end of file
+     * * :ref:`End marker <endmarker-token>`
 
 
 .. _line-structure:
@@ -120,6 +189,8 @@ If an encoding is declared, the encoding name must be recognized by Python
 encoding is used for all lexical analysis, including string literals, comments
 and identifiers.
 
+.. _lexical-source-character:
+
 All lexical analysis, including string literals, comments
 and identifiers, works on Unicode text decoded using the source encoding.
 Any Unicode code point, except the NUL control character, can appear in