-
-
Notifications
You must be signed in to change notification settings - Fork 33.2k
Improvements in regular expression doc #114357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 23 commits
817b3f3
6b53456
1b4d152
6ad009c
94f765f
292672b
65b4278
fe7389a
8394cd3
e2023e0
cdaa9ae
bb98dad
22ffed7
6a1e74e
6b357af
8f7356d
6ed5109
9c17aa8
acb2e38
4d3b8dd
643070c
17baf98
4e12f7c
a09a187
625a5cf
12ecb3a
f576282
337e4b4
0e0e082
f094a90
fd24e0f
a8c44e1
f970235
8d52469
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -33,7 +33,8 @@ usage of the backslash in string literals now generate a :exc:`SyntaxWarning` | |||||
and in the future this will become a :exc:`SyntaxError`. This behaviour | ||||||
will happen even if it is a valid escape sequence for a regular expression. | ||||||
|
||||||
The solution is to use Python's raw string notation for regular expression | ||||||
The solution is to use Python's :ref:`raw string notation <raw-string-notation>` | ||||||
for regular expression | ||||||
patterns; backslashes are not handled in any special way in a string literal | ||||||
prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing | ||||||
``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a | ||||||
|
@@ -83,6 +84,12 @@ characters, so ``last`` matches the string ``'last'``. (In the rest of this | |||||
section, we'll write RE's in ``this special style``, usually without quotes, and | ||||||
strings to be matched ``'in single quotes'``.) | ||||||
|
||||||
|
||||||
.. _re-special-characters: | ||||||
|
||||||
Special characters | ||||||
^^^^^^^^^^^^^^^^^^ | ||||||
|
||||||
Some characters, like ``'|'`` or ``'('``, are special. Special | ||||||
characters either stand for classes of ordinary characters, or affect | ||||||
how the regular expressions around them are interpreted. | ||||||
|
@@ -93,7 +100,6 @@ directly nested. This avoids ambiguity with the non-greedy modifier suffix | |||||
repetition to an inner repetition, parentheses may be used. For example, | ||||||
the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters. | ||||||
|
||||||
|
||||||
The special characters are: | ||||||
|
||||||
.. index:: single: . (dot); in regular expressions | ||||||
|
@@ -114,31 +120,33 @@ The special characters are: | |||||
``$`` | ||||||
Matches the end of the string or just before the newline at the end of the | ||||||
string, and in :const:`MULTILINE` mode also matches before a newline. ``foo`` | ||||||
matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches | ||||||
only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'`` | ||||||
matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for | ||||||
matches both ``'foo'`` and ``'foobar'``, while the regular expression ``foo$`` | ||||||
matches | ||||||
only ``'foo'``. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'`` | ||||||
matches ``'foo2'`` normally, but also ``'foo1'`` in :const:`MULTILINE` mode; searching | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I thought the original was easier to read, with the full string being searched given in a different font from the substrings that are found There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Firstly, it was inconsistent with the "(In the rest of this section, we’ll write RE’s in this special style, usually without quotes, and strings to be matched 'in single quotes'.)" However, you highlighted that 'strings to be matched' is different from 'the matches'. On the other hand, both are literal strings, and this is a common pattern around all docs. I would like some more opinions here. |
||||||
for | ||||||
a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before | ||||||
the newline, and one at the end of the string. | ||||||
|
||||||
.. index:: single: * (asterisk); in regular expressions | ||||||
|
||||||
``*`` | ||||||
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as | ||||||
many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed | ||||||
by any number of 'b's. | ||||||
many repetitions as are possible. ``ab*`` will match ``'a'``, ``'ab'``, or | ||||||
``'a'`` followed by any number of ``'b'`` s. | ||||||
|
``'a'`` followed by any number of ``'b'`` s. | |
``'a'`` followed by any number of ``'b'``\ s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-s disconnected again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call them all escape sequences? Differentiates better from the multi-character “special character” sequences above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest change the next heading to something like String literal escapes, and change this heading from Special sequences to Escape sequences.
These are the types of the special characters I can think of for REs:
- The single-character metacharacters:
$, *, [, ], \, etc
, as listed in the how-to https://cpython-previews--114357.org.readthedocs.build/en/114357/howto/regex.html#matching-characters - Multicharacter syntax built with the metacharacters, like *?, {m,n} and the bracketed extension notation (?. . .)
- “Special sequences” a.k.a. escape sequences, which begin with a backslash. These could be subdivided into
- Non-alphanumeric, for escaping metacharacters and other syntax:
\$, \*, \\, \', \", etc
- Group references \1–\99
- Alphanumeric sequences that specify locations to match, or categories of characters: \A, \b, \d, etc
- String literal escapes:
\n, \\, \N{. . .}, \0–\777, etc
. Excludes \b and\<newline>
.
- Non-alphanumeric, for escaping metacharacters and other syntax:
- Characters only special in “verbose” expressions: whitespace and #
- Additional backslash sequence for re.sub templates: \g<. . .>
- Special characters inside square-bracketed classes/sets [. . .], especially -, ^, ], \b, and reserved [, &&, etc
Uh oh!
There was an error while loading. Please reload this page.