-
-
Notifications
You must be signed in to change notification settings - Fork 33k
gh-135676: Add a summary of source characters #138194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
4f2b85b
d9157bb
f085358
a30747f
300cc8c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,12 +10,76 @@ Lexical analysis | |
A Python program is read by a *parser*. Input to the parser is a stream of | ||
:term:`tokens <token>`, generated by the *lexical analyzer* (also known as | ||
the *tokenizer*). | ||
This chapter describes how the lexical analyzer breaks a file into tokens. | ||
This chapter describes how the lexical analyzer produces these tokens. | ||
|
||
Python reads program text as Unicode code points; the encoding of a source file | ||
can be given by an encoding declaration and defaults to UTF-8, see :pep:`3120` | ||
for details. If the source file cannot be decoded, a :exc:`SyntaxError` is | ||
raised. | ||
The lexical analyzer determines the program text's :ref:`encoding <encodings>` | ||
(UTF-8 by default), and decodes the text into | ||
:ref:`source characters <lexical-source-character>`. | ||
If the text cannot be decoded, a :exc:`SyntaxError` is raised. | ||
|
||
Next, the lexical analyzer uses the source characters to generate a stream of tokens. | ||
The type of a generated token generally depends on the next source character to | ||
be processed. Similarly, other special behavior of the analyzer depends on | ||
the first source character that hasn't yet been processed. | ||
The following table gives a quick summary of these source characters, | ||
with links to sections that contain more information. | ||
|
||
.. list-table:: | ||
:header-rows: 1 | ||
Comment on lines
+27
to
+28
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In general for list tables it can be useful to alternate list markers, e.g. using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All my list-tables will do that from now on :) |
||
|
||
* - Character | ||
- Next token (or other relevant documentation) | ||
|
||
* - * space | ||
* tab | ||
* formfeed | ||
- * :ref:`Whitespace <whitespace>` | ||
|
||
* - * CR, LF | ||
- * :ref:`New line <line-structure>` | ||
* :ref:`Indentation <indentation>` | ||
|
||
* - * backslash (``\``) | ||
- * :ref:`Explicit line joining <explicit-joining>` | ||
* (Also significant in :ref:`string escape sequences <escape-sequences>`) | ||
|
||
* - * hash (``#``) | ||
- * :ref:`Comment <comments>` | ||
|
||
* - * quote (``'``, ``"``) | ||
- * :ref:`String literal <strings>` | ||
|
||
* - * ASCII letter (``a``-``z``, ``A``-``Z``) | ||
* non-ASCII character | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is 'non-ASCII character' too broad here? Not all characters can form valid identifiers, especially if expanding to the full Unicode space! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is broad, but: if the tokenizer sees a non-ASCII character, the next token can only be a NAME (or error). (Except inside strings/comments, but then it's not deciding what the next token will be.) If I remember correctly¹, the tokenizer implementation does lump non-ASCII characters with the letters, and only checks validity after it parses an identifier-like token. ¹ Maybe I don't, but it certainly could do that :) |
||
- * :ref:`Name <identifiers>` | ||
* Prefixed :ref:`string or bytes literal <strings>` | ||
|
||
* - * underscore (``_``) | ||
- * :ref:`Name <identifiers>` | ||
* (Can also be part of :ref:`numeric literals <numbers>`) | ||
|
||
* - * number (``0``-``9``) | ||
- * :ref:`Numeric literal <numbers>` | ||
|
||
* - * dot (``.``) | ||
- * :ref:`Numeric literal <numbers>` | ||
* :ref:`Operator <operators>` | ||
|
||
* - * question mark (``?``) | ||
* dollar (``$``) | ||
* | ||
.. (the following uses zero-width space characters to render | ||
.. a literal backquote) | ||
|
||
backquote (`````) | ||
* control character | ||
- * Error (outside string literals and comments) | ||
|
||
* - * other printing character | ||
- * :ref:`Operator or delimiter <operators>` | ||
|
||
* - * end of file | ||
- * :ref:`End marker <endmarker-token>` | ||
|
||
|
||
.. _line-structure: | ||
|
@@ -120,6 +184,8 @@ If an encoding is declared, the encoding name must be recognized by Python | |
encoding is used for all lexical analysis, including string literals, comments | ||
and identifiers. | ||
|
||
.. _lexical-source-character: | ||
|
||
All lexical analysis, including string literals, comments | ||
and identifiers, works on Unicode text decoded using the source encoding. | ||
Any Unicode code point, except the NUL control character, can appear in | ||
|
Uh oh!
There was an error while loading. Please reload this page.