Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 76 additions & 5 deletions Doc/reference/lexical_analysis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,81 @@ Lexical analysis
A Python program is read by a *parser*. Input to the parser is a stream of
:term:`tokens <token>`, generated by the *lexical analyzer* (also known as
the *tokenizer*).
This chapter describes how the lexical analyzer breaks a file into tokens.
This chapter describes how the lexical analyzer produces these tokens.

Python reads program text as Unicode code points; the encoding of a source file
can be given by an encoding declaration and defaults to UTF-8, see :pep:`3120`
for details. If the source file cannot be decoded, a :exc:`SyntaxError` is
raised.
.. note::

A ":dfn:`stream`" is a *sequence*, in the general sense of the word
(not necessarily a Python :term:`sequence object <sequence>`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this note is needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @AA-Turner. Stream and sequence are both overloaded terms that may be better unpacked by the reader in context.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK; I've removed it


The lexical analyzer determines the program text's :ref:`encoding <encodings>`
(UTF-8 by default), and decodes the text into
:ref:`source characters <lexical-source-character>`.
If the text cannot be decoded, a :exc:`SyntaxError` is raised.

The lexical analyzer then generates a stream of tokens from the source
characters.
The type of each generated token, or other special behavior of the analyzer,
generally depends on the first source character that hasn't yet been processed.
The following table gives a quick summary of these characters,
with links to sections that contain more information.

.. list-table::
:header-rows: 1
Comment on lines +27 to +28
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general for list tables it can be useful to alternate list markers, e.g. using - to denote items of the second-level list. Not essential, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my list-tables will do that from now on :)


* * Character
* Next token (or other relevant documentation)

* * * space
* tab
* formfeed
* * :ref:`Whitespace <whitespace>`

* * * CR, LF
* * :ref:`New line <line-structure>`
* :ref:`Indentation <indentation>`

* * * backslash (``\``)
* * :ref:`Explicit line joining <explicit-joining>`
* (Also significant in :ref:`string escape sequences <escape-sequences>`)

* * * hash (``#``)
* * :ref:`Comment <comments>`

* * * quote (``'``, ``"``)
* * :ref:`String literal <strings>`

* * * ASCII letter (``a``-``z``, ``A``-``Z``)
* non-ASCII character
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 'non-ASCII character' too broad here? Not all characters can form valid identifiers, especially if expanding to the full Unicode space!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is broad, but: if the tokenizer sees a non-ASCII character, the next token can only be a NAME (or error). (Except inside strings/comments, but then it's not deciding what the next token will be.)

If I remember correctly¹, the tokenizer implementation does lump non-ASCII characters with the letters, and only checks validity after it parses an identifier-like token.

¹ Maybe I don't, but it certainly could do that :)

* * :ref:`Name <identifiers>`
* Prefixed :ref:`string or bytes literal <strings>`

* * * underscore (``_``)
* * :ref:`Name <identifiers>`
* (Can also be part of :ref:`numeric literals <numbers>`)

* * * number (``0``-``9``)
* * :ref:`Numeric literal <numbers>`

* * * dot (``.``)
* * :ref:`Numeric literal <numbers>`
* :ref:`Operator <operators>`

* * * question mark (``?``)
* dollar (``$``)
*
.. (the following uses zero-width-joiner characters to render
.. a literal backquote)

backquote (``‍`‍``)
* control character
* * Error (outside string literals and comments)

* * * other printing character
* * :ref:`Operator or delimiter <operators>`

* * * end of file
* * :ref:`End marker <endmarker-token>`


.. _line-structure:
Expand Down Expand Up @@ -120,6 +189,8 @@ If an encoding is declared, the encoding name must be recognized by Python
encoding is used for all lexical analysis, including string literals, comments
and identifiers.

.. _lexical-source-character:

All lexical analysis, including string literals, comments
and identifiers, works on Unicode text decoded using the source encoding.
Any Unicode code point, except the NUL control character, can appear in
Expand Down
Loading