From 4f2b85b5b90f6221ab407b5258bcd1c57d8519ee Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 27 Aug 2025 17:41:15 +0200 Subject: [PATCH 1/5] gh-135676: Add a summary of source characters --- Doc/reference/lexical_analysis.rst | 81 ++++++++++++++++++++++++++++-- 1 file changed, 76 insertions(+), 5 deletions(-) diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst index e320eedfa67a27..75e77bc1767a22 100644 --- a/Doc/reference/lexical_analysis.rst +++ b/Doc/reference/lexical_analysis.rst @@ -10,12 +10,81 @@ Lexical analysis A Python program is read by a *parser*. Input to the parser is a stream of :term:`tokens `, generated by the *lexical analyzer* (also known as the *tokenizer*). -This chapter describes how the lexical analyzer breaks a file into tokens. +This chapter describes how the lexical analyzer produces these tokens. -Python reads program text as Unicode code points; the encoding of a source file -can be given by an encoding declaration and defaults to UTF-8, see :pep:`3120` -for details. If the source file cannot be decoded, a :exc:`SyntaxError` is -raised. +.. note:: + + A ":dfn:`stream`" is a *sequence*, in the general sense of the word + (not necessarily a Python :term:`sequence object `). + +The lexical analyzer determines the program text's :ref:`encoding ` +(UTF-8 by default), and decodes the text into +:ref:`source characters `. +If the text cannot be decoded, a :exc:`SyntaxError` is raised. + +The lexical analyzer then generates a stream of tokens from the source +characters. +The type of each generated token, or other special behavior of the analyzer, +generally depends on the first source character that hasn't yet been processed. +The following table gives a quick summary of these characters, +with links to sections that contain more information. + +.. list-table:: + :header-rows: 1 + + * * Character + * Next token (or other relevant documentation) + + * * * space + * tab + * formfeed + * * :ref:`Whitespace ` + + * * * CR, LF + * * :ref:`New line ` + * :ref:`Indentation ` + + * * * backslash (``\``) + * * :ref:`Explicit line joining ` + * (Also significant in :ref:`string escape sequences `) + + * * * hash (``#``) + * * :ref:`Comment ` + + * * * quote (``'``, ``"``) + * * :ref:`String literal ` + + * * * ASCII letter (``a``-``z``, ``A``-``Z``) + * non-ASCII character + * * :ref:`Name ` + * Prefixed :ref:`string or bytes literal ` + + * * * underscore (``_``) + * * :ref:`Name ` + * (Can also be part of :ref:`numeric literals `) + + * * * number (``0``-``9``) + * * :ref:`Numeric literal ` + + * * * dot (``.``) + * * :ref:`Numeric literal ` + * :ref:`Operator ` + + * * * question mark (``?``) + * dollar (``$``) + * + .. (the following uses zero-width-joiner characters to render + .. a literal backquote) + + backquote (``‍`‍``) + * control character + * * Error (outside string literals and comments) + + * * * other printing character + * * :ref:`Operator or delimiter ` + + * * * end of file + * * :ref:`End marker ` .. _line-structure: @@ -120,6 +189,8 @@ If an encoding is declared, the encoding name must be recognized by Python encoding is used for all lexical analysis, including string literals, comments and identifiers. +.. _lexical-source-character: + All lexical analysis, including string literals, comments and identifiers, works on Unicode text decoded using the source encoding. Any Unicode code point, except the NUL control character, can appear in From d9157bb923e4065158a3d1c52ff435681dfa1599 Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 3 Sep 2025 16:33:00 +0200 Subject: [PATCH 2/5] Use zero-width space instead of joiner --- Doc/reference/lexical_analysis.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst index 75e77bc1767a22..543ebc6bb83d73 100644 --- a/Doc/reference/lexical_analysis.rst +++ b/Doc/reference/lexical_analysis.rst @@ -73,10 +73,10 @@ with links to sections that contain more information. * * * question mark (``?``) * dollar (``$``) * - .. (the following uses zero-width-joiner characters to render + .. (the following uses zero-width space characters to render .. a literal backquote) - backquote (``‍`‍``) + backquote (``​`​``) * control character * * Error (outside string literals and comments) From f085358ebb6ce91e3e5a68ee8282e3162269aaee Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 8 Oct 2025 16:05:53 +0200 Subject: [PATCH 3/5] Update Doc/reference/lexical_analysis.rst Co-authored-by: Carol Willing --- Doc/reference/lexical_analysis.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst index 543ebc6bb83d73..242318bababac2 100644 --- a/Doc/reference/lexical_analysis.rst +++ b/Doc/reference/lexical_analysis.rst @@ -22,11 +22,11 @@ The lexical analyzer determines the program text's :ref:`encoding ` :ref:`source characters `. If the text cannot be decoded, a :exc:`SyntaxError` is raised. -The lexical analyzer then generates a stream of tokens from the source -characters. -The type of each generated token, or other special behavior of the analyzer, -generally depends on the first source character that hasn't yet been processed. -The following table gives a quick summary of these characters, +Next, the lexical analyzer uses the source characters to generate a stream of tokens. +The type of a generated token generally depends on the next source character to +be processed. Similarly, other special behavior of the analyzer depends on +the first source character that hasn't yet been processed. +The following table gives a quick summary of these source characters, with links to sections that contain more information. .. list-table:: From a30747fcb14f890d35c4effa47039af744780567 Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 8 Oct 2025 16:22:06 +0200 Subject: [PATCH 4/5] Remove note explaining *stream* --- Doc/reference/lexical_analysis.rst | 5 ----- 1 file changed, 5 deletions(-) diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst index 242318bababac2..5bce92dd39b5a4 100644 --- a/Doc/reference/lexical_analysis.rst +++ b/Doc/reference/lexical_analysis.rst @@ -12,11 +12,6 @@ A Python program is read by a *parser*. Input to the parser is a stream of the *tokenizer*). This chapter describes how the lexical analyzer produces these tokens. -.. note:: - - A ":dfn:`stream`" is a *sequence*, in the general sense of the word - (not necessarily a Python :term:`sequence object `). - The lexical analyzer determines the program text's :ref:`encoding ` (UTF-8 by default), and decodes the text into :ref:`source characters `. From 300cc8ccf795b0e84bd8ce79940272e9a4c51fcc Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 8 Oct 2025 16:22:39 +0200 Subject: [PATCH 5/5] Alternate list markers in list-table --- Doc/reference/lexical_analysis.rst | 52 +++++++++++++++--------------- 1 file changed, 26 insertions(+), 26 deletions(-) diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst index 5bce92dd39b5a4..1bbbe2c696973f 100644 --- a/Doc/reference/lexical_analysis.rst +++ b/Doc/reference/lexical_analysis.rst @@ -27,45 +27,45 @@ with links to sections that contain more information. .. list-table:: :header-rows: 1 - * * Character - * Next token (or other relevant documentation) + * - Character + - Next token (or other relevant documentation) - * * * space + * - * space * tab * formfeed - * * :ref:`Whitespace ` + - * :ref:`Whitespace ` - * * * CR, LF - * * :ref:`New line ` + * - * CR, LF + - * :ref:`New line ` * :ref:`Indentation ` - * * * backslash (``\``) - * * :ref:`Explicit line joining ` + * - * backslash (``\``) + - * :ref:`Explicit line joining ` * (Also significant in :ref:`string escape sequences `) - * * * hash (``#``) - * * :ref:`Comment ` + * - * hash (``#``) + - * :ref:`Comment ` - * * * quote (``'``, ``"``) - * * :ref:`String literal ` + * - * quote (``'``, ``"``) + - * :ref:`String literal ` - * * * ASCII letter (``a``-``z``, ``A``-``Z``) + * - * ASCII letter (``a``-``z``, ``A``-``Z``) * non-ASCII character - * * :ref:`Name ` + - * :ref:`Name ` * Prefixed :ref:`string or bytes literal ` - * * * underscore (``_``) - * * :ref:`Name ` + * - * underscore (``_``) + - * :ref:`Name ` * (Can also be part of :ref:`numeric literals `) - * * * number (``0``-``9``) - * * :ref:`Numeric literal ` + * - * number (``0``-``9``) + - * :ref:`Numeric literal ` - * * * dot (``.``) - * * :ref:`Numeric literal ` + * - * dot (``.``) + - * :ref:`Numeric literal ` * :ref:`Operator ` - * * * question mark (``?``) + * - * question mark (``?``) * dollar (``$``) * .. (the following uses zero-width space characters to render @@ -73,13 +73,13 @@ with links to sections that contain more information. backquote (``​`​``) * control character - * * Error (outside string literals and comments) + - * Error (outside string literals and comments) - * * * other printing character - * * :ref:`Operator or delimiter ` + * - * other printing character + - * :ref:`Operator or delimiter ` - * * * end of file - * * :ref:`End marker ` + * - * end of file + - * :ref:`End marker ` .. _line-structure: