-
-
Notifications
You must be signed in to change notification settings - Fork 33k
gh-135676: Add a summary of source characters #138194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
4f2b85b
d9157bb
f085358
a30747f
300cc8c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,12 +10,81 @@ Lexical analysis | |
A Python program is read by a *parser*. Input to the parser is a stream of | ||
:term:`tokens <token>`, generated by the *lexical analyzer* (also known as | ||
the *tokenizer*). | ||
This chapter describes how the lexical analyzer breaks a file into tokens. | ||
This chapter describes how the lexical analyzer produces these tokens. | ||
|
||
Python reads program text as Unicode code points; the encoding of a source file | ||
can be given by an encoding declaration and defaults to UTF-8, see :pep:`3120` | ||
for details. If the source file cannot be decoded, a :exc:`SyntaxError` is | ||
raised. | ||
.. note:: | ||
|
||
A ":dfn:`stream`" is a *sequence*, in the general sense of the word | ||
(not necessarily a Python :term:`sequence object <sequence>`). | ||
|
||
The lexical analyzer determines the program text's :ref:`encoding <encodings>` | ||
(UTF-8 by default), and decodes the text into | ||
:ref:`source characters <lexical-source-character>`. | ||
If the text cannot be decoded, a :exc:`SyntaxError` is raised. | ||
encukou marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The lexical analyzer then generates a stream of tokens from the source | ||
characters. | ||
The type of each generated token, or other special behavior of the analyzer, | ||
generally depends on the first source character that hasn't yet been processed. | ||
The following table gives a quick summary of these characters, | ||
with links to sections that contain more information. | ||
encukou marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
.. list-table:: | ||
:header-rows: 1 | ||
Comment on lines
+27
to
+28
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In general for list tables it can be useful to alternate list markers, e.g. using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All my list-tables will do that from now on :) |
||
|
||
* * Character | ||
* Next token (or other relevant documentation) | ||
|
||
* * * space | ||
* tab | ||
* formfeed | ||
* * :ref:`Whitespace <whitespace>` | ||
|
||
* * * CR, LF | ||
encukou marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
* * :ref:`New line <line-structure>` | ||
* :ref:`Indentation <indentation>` | ||
|
||
* * * backslash (``\``) | ||
* * :ref:`Explicit line joining <explicit-joining>` | ||
* (Also significant in :ref:`string escape sequences <escape-sequences>`) | ||
|
||
* * * hash (``#``) | ||
* * :ref:`Comment <comments>` | ||
|
||
* * * quote (``'``, ``"``) | ||
* * :ref:`String literal <strings>` | ||
|
||
* * * ASCII letter (``a``-``z``, ``A``-``Z``) | ||
* non-ASCII character | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is 'non-ASCII character' too broad here? Not all characters can form valid identifiers, especially if expanding to the full Unicode space! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is broad, but: if the tokenizer sees a non-ASCII character, the next token can only be a NAME (or error). (Except inside strings/comments, but then it's not deciding what the next token will be.) If I remember correctly¹, the tokenizer implementation does lump non-ASCII characters with the letters, and only checks validity after it parses an identifier-like token. ¹ Maybe I don't, but it certainly could do that :) |
||
* * :ref:`Name <identifiers>` | ||
* Prefixed :ref:`string or bytes literal <strings>` | ||
|
||
* * * underscore (``_``) | ||
* * :ref:`Name <identifiers>` | ||
* (Can also be part of :ref:`numeric literals <numbers>`) | ||
|
||
* * * number (``0``-``9``) | ||
* * :ref:`Numeric literal <numbers>` | ||
|
||
* * * dot (``.``) | ||
* * :ref:`Numeric literal <numbers>` | ||
* :ref:`Operator <operators>` | ||
|
||
* * * question mark (``?``) | ||
* dollar (``$``) | ||
* | ||
.. (the following uses zero-width-joiner characters to render | ||
.. a literal backquote) | ||
|
||
backquote (`````) | ||
encukou marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
* control character | ||
* * Error (outside string literals and comments) | ||
|
||
* * * other printing character | ||
* * :ref:`Operator or delimiter <operators>` | ||
|
||
* * * end of file | ||
* * :ref:`End marker <endmarker-token>` | ||
|
||
|
||
.. _line-structure: | ||
|
@@ -120,6 +189,8 @@ If an encoding is declared, the encoding name must be recognized by Python | |
encoding is used for all lexical analysis, including string literals, comments | ||
and identifiers. | ||
|
||
.. _lexical-source-character: | ||
|
||
All lexical analysis, including string literals, comments | ||
and identifiers, works on Unicode text decoded using the source encoding. | ||
Any Unicode code point, except the NUL control character, can appear in | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this note is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @AA-Turner. Stream and sequence are both overloaded terms that may be better unpacked by the reader in context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK; I've removed it