@@ -90,44 +90,122 @@ Notation
9090
9191.. index :: BNF, grammar, syntax, notation
9292
93- The descriptions of lexical analysis and syntax use a modified
94- `Backus–Naur form (BNF) <https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form >`_ grammar
95- notation. This uses the following style of definition:
96-
97- .. productionlist :: notation
98- name: `lc_letter ` (`lc_letter ` | "_")*
99- lc_letter: "a"..."z"
100-
101- The first line says that a ``name `` is an ``lc_letter `` followed by a sequence
102- of zero or more ``lc_letter ``\ s and underscores. An ``lc_letter `` in turn is
103- any of the single characters ``'a' `` through ``'z' ``. (This rule is actually
104- adhered to for the names defined in lexical and grammar rules in this document.)
105-
106- Each rule begins with a name (which is the name defined by the rule) and
107- ``::= ``. A vertical bar (``| ``) is used to separate alternatives; it is the
108- least binding operator in this notation. A star (``* ``) means zero or more
109- repetitions of the preceding item; likewise, a plus (``+ ``) means one or more
110- repetitions, and a phrase enclosed in square brackets (``[ ] ``) means zero or
111- one occurrences (in other words, the enclosed phrase is optional). The ``* ``
112- and ``+ `` operators bind as tightly as possible; parentheses are used for
113- grouping. Literal strings are enclosed in quotes. White space is only
114- meaningful to separate tokens. Rules are normally contained on a single line;
115- rules with many alternatives may be formatted alternatively with each line after
116- the first beginning with a vertical bar.
117-
118- .. index :: lexical definitions, ASCII
119-
120- In lexical definitions (as the example above), two more conventions are used:
121- Two literal characters separated by three dots mean a choice of any single
122- character in the given (inclusive) range of ASCII characters. A phrase between
123- angular brackets (``<...> ``) gives an informal description of the symbol
124- defined; e.g., this could be used to describe the notion of 'control character'
125- if needed.
126-
127- Even though the notation used is almost the same, there is a big difference
128- between the meaning of lexical and syntactic definitions: a lexical definition
129- operates on the individual characters of the input source, while a syntax
130- definition operates on the stream of tokens generated by the lexical analysis.
131- All uses of BNF in the next chapter ("Lexical Analysis") are lexical
132- definitions; uses in subsequent chapters are syntactic definitions.
133-
93+ The descriptions of lexical analysis and syntax use a grammar notation that
94+ is a mixture of
95+ `EBNF <https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form >`_
96+ and `PEG <https://en.wikipedia.org/wiki/Parsing_expression_grammar >`_.
97+ For example:
98+
99+ .. grammar-snippet ::
100+ :group: notation
101+
102+ name: `letter ` (`letter ` | `digit ` | "_")*
103+ letter: "a"..."z" | "A"..."Z"
104+ digit: "0"..."9"
105+
106+ In this example, the first line says that a ``name `` is a ``letter `` followed
107+ by a sequence of zero or more ``letter ``\ s, ``digit ``\ s, and underscores.
108+ A ``letter `` in turn is any of the single characters ``'a' `` through
109+ ``'z' `` and ``A `` through ``Z ``; a ``digit `` is a single character from ``0 ``
110+ to ``9 ``.
111+
112+ Each rule begins with a name (which identifies the rule that's being defined)
113+ followed by a colon, ``: ``.
114+ The definition to the right of the colon uses the following syntax elements:
115+
116+ * ``name ``: A name refers to another rule.
117+ Where possible, it is a link to the rule's definition.
118+
119+ * ``TOKEN ``: An uppercase name refers to a :term: `token `.
120+ For the purposes of grammar definitions, tokens are the same as rules.
121+
122+ * ``"text" ``, ``'text' ``: Text in single or double quotes must match literally
123+ (without the quotes). The type of quote is chosen according to the meaning
124+ of ``text ``:
125+
126+ * ``'if' ``: A name in single quotes denotes a :ref: `keyword <keywords >`.
127+ * ``"case" ``: A name in double quotes denotes a
128+ :ref: `soft-keyword <soft-keywords >`.
129+ * ``'@' ``: A non-letter symbol in single quotes denotes an
130+ :py:data: `~token.OP ` token, that is, a :ref: `delimiter <delimiters >` or
131+ :ref: `operator <operators >`.
132+
133+ * ``e1 e2 ``: Items separated only by whitespace denote a sequence.
134+ Here, ``e1 `` must be followed by ``e2 ``.
135+ * ``e1 | e2 ``: A vertical bar is used to separate alternatives.
136+ It denotes PEG's "ordered choice": if ``e1 `` matches, ``e2 `` is
137+ not considered.
138+ In traditional PEG grammars, this is written as a slash, ``/ ``, rather than
139+ a vertical bar.
140+ See :pep: `617 ` for more background and details.
141+ * ``e* ``: A star means zero or more repetitions of the preceding item.
142+ * ``e+ ``: Likewise, a plus means one or more repetitions.
143+ * ``[e] ``: A phrase enclosed in square brackets means zero or
144+ one occurrences. In other words, the enclosed phrase is optional.
145+ * ``e? ``: A question mark has exactly the same meaning as square brackets:
146+ the preceding item is optional.
147+ * ``(e) ``: Parentheses are used for grouping.
148+ * ``"a"..."z" ``: Two literal characters separated by three dots mean a choice
149+ of any single character in the given (inclusive) range of ASCII characters.
150+ This notation is only used in
151+ :ref: `lexical definitions <notation-lexical-vs-syntactic >`.
152+ * ``<...> ``: A phrase between angular brackets gives an informal description
153+ of the matched symbol (for example, ``<any ASCII character except "\"> ``),
154+ or an abbreviation that is defined in nearby text (for example, ``<Lu> ``).
155+ This notation is only used in
156+ :ref: `lexical definitions <notation-lexical-vs-syntactic >`.
157+
158+ The unary operators (``* ``, ``+ ``, ``? ``) bind as tightly as possible;
159+ the vertical bar (``| ``) binds most loosely.
160+
161+ White space is only meaningful to separate tokens.
162+
163+ Rules are normally contained on a single line, but rules that are too long
164+ may be wrapped:
165+
166+ .. grammar-snippet ::
167+ :group: notation
168+
169+ literal: stringliteral | bytesliteral
170+ | integer | floatnumber | imagnumber
171+
172+ Alternatively, rules may be formatted with the first line ending at the colon,
173+ and each alternative beginning with a vertical bar on a new line.
174+ For example:
175+
176+
177+ .. grammar-snippet ::
178+ :group: notation-alt
179+
180+ literal:
181+ | stringliteral
182+ | bytesliteral
183+ | integer
184+ | floatnumber
185+ | imagnumber
186+
187+ This does *not * mean that there is an empty first alternative.
188+
189+ .. index :: lexical definitions
190+
191+ .. _notation-lexical-vs-syntactic :
192+
193+ Lexical and Syntactic definitions
194+ ---------------------------------
195+
196+ There is some difference between *lexical * and *syntactic * analysis:
197+ the :term: `lexical analyzer ` operates on the individual characters of the
198+ input source, while the *parser * (syntactic analyzer) operates on the stream
199+ of :term: `tokens <token> ` generated by the lexical analysis.
200+ However, in some cases the exact boundary between the two phases is a
201+ CPython implementation detail.
202+
203+ The practical difference between the two is that in *lexical * definitions,
204+ all whitespace is significant.
205+ The lexical analyzer :ref: `discards <whitespace >` all whitespace that is not
206+ converted to tokens like :data: `token.INDENT ` or :data: `~token.NEWLINE `.
207+ *Syntactic * definitions then use these tokens, rather than source characters.
208+
209+ This documentation uses the same BNF grammar for both styles of definitions.
210+ All uses of BNF in the next chapter (:ref: `lexical `) are lexical definitions;
211+ uses in subsequent chapters are syntactic definitions.
0 commit comments