From 958009838ca7d289dce27d6cfa8f12421c6ccdad Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Tue, 25 Apr 2023 21:06:37 -0400 Subject: [PATCH 01/24] Remove most uses of the word 'obvious' --- Doc/howto/regex.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index c19c48301f5848..bf80235cae47be 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -180,7 +180,7 @@ engine will try to repeat it as many times as possible. If later portions of the pattern don't match, the matching engine will then back up and try again with fewer repetitions. -A step-by-step example will make this more obvious. Let's consider the +A step-by-step example will make this clearer. Let's consider the expression ``a[bcd]*b``. This matches the letter ``'a'``, zero or more letters from the class ``[bcd]``, and finally ends with a ``'b'``. Now imagine matching this RE against the string ``'abcbd'``. @@ -926,7 +926,7 @@ A more significant feature is named groups: instead of referring to them by numbers, groups can be referenced by a name. The syntax for a named group is one of the Python-specific extensions: -``(?P...)``. *name* is, obviously, the name of the group. Named groups +``(?P...)``. *name* can be used to refer to the group in other contexts. Named groups behave exactly like capturing groups, and additionally associate a name with a group. The :ref:`match object ` methods that deal with capturing groups all accept either integers that refer to the group by number @@ -958,7 +958,7 @@ module:: r' (?P[-+])(?P[0-9][0-9])(?P[0-9][0-9])' r'"') -It's obviously much easier to retrieve ``m.group('zonem')``, instead of having +It's much easier to write ``m.group('zonem')``, instead of having to remember to retrieve group 9. The syntax for backreferences in an expression such as ``(...)\1`` refers to the From 808e281569a4aadf4f04f22f922e1fda5bfd1305 Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Tue, 25 Apr 2023 21:21:02 -0400 Subject: [PATCH 02/24] Unrecognized escapes now raise a SyntaxWarning, not a DeprecationWarning. Remove use of undefined jargon 'cooked'. --- Doc/howto/regex.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index bf80235cae47be..b293d64caa5cda 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -332,7 +332,7 @@ expressions will often be written in Python code using this raw string notation. In addition, special escape sequences that are valid in regular expressions, but not valid as Python string literals, now result in a -:exc:`DeprecationWarning` and will eventually become a :exc:`SyntaxError`, +:exc:`SyntaxWarning` and will eventually become a :exc:`SyntaxError`, which means the sequences will be invalid if raw string notation or escaping the backslashes isn't used. @@ -468,9 +468,9 @@ Two pattern methods return all of the matches for a pattern. ['12', '11', '10'] The ``r`` prefix, making the literal a raw string literal, is needed in this -example because escape sequences in a normal "cooked" string literal that are -not recognized by Python, as opposed to regular expressions, now result in a -:exc:`DeprecationWarning` and will eventually become a :exc:`SyntaxError`. See +example because ``\d`` is not an escape sequence recognized in Python string literals. +Such unrecognized sequences now produce a +:exc:`SyntaxWarning` and will eventually become a :exc:`SyntaxError`. See :ref:`the-backslash-plague`. :meth:`~re.Pattern.findall` has to create the entire list before it can be returned as the From 88bbe21c07867e09db5d9c28d390b368bb31f822 Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Tue, 2 May 2023 21:19:30 -0400 Subject: [PATCH 03/24] Add paragraph break --- Doc/howto/regex.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index b293d64caa5cda..51795da5d4328d 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -586,6 +586,7 @@ of each one. If your system is configured properly and a French locale is selected, certain C functions will tell the program that the byte corresponding to ``é`` should also be considered a letter. + Setting the :const:`LOCALE` flag when compiling a regular expression will cause the resulting compiled object to use these C functions for ``\w``; this is slower, but also enables ``\w+`` to match French words as you'd expect. From bdf44f20af2fd75dd0e54418fca75c6801c37dd4 Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Tue, 2 May 2023 21:20:11 -0400 Subject: [PATCH 04/24] Remove extra parathesis from an example --- Doc/howto/regex.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index 51795da5d4328d..f0850218294d9e 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -935,7 +935,7 @@ or strings that contain the desired group's name. Named groups are still given numbers, so you can retrieve information about a group in two ways:: >>> p = re.compile(r'(?P\b\w+\b)') - >>> m = p.search( '(((( Lots of punctuation )))' ) + >>> m = p.search( '((( Lots of punctuation )))' ) >>> m.group('word') 'Lots' >>> m.group(1) From ce864e2c0ffa408d67b6495ff3c9a0ad6a164f8f Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Tue, 2 May 2023 21:32:51 -0400 Subject: [PATCH 05/24] Describe .fullmatch() method --- Doc/howto/regex.rst | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index f0850218294d9e..0f68cd74aef041 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -365,14 +365,18 @@ for a complete listing. | ``search()`` | Scan through a string, looking for any | | | location where this RE matches. | +------------------+-----------------------------------------------+ +| ``fullmatch()`` | Determine if the RE matches the entire string | +| | exactly. | ++------------------+-----------------------------------------------+ | ``findall()`` | Find all substrings where the RE matches, and | | | returns them as a list. | +------------------+-----------------------------------------------+ -| ``finditer()`` | Find all substrings where the RE matches, and | -| | returns them as an :term:`iterator`. | +| ``finditer()`` | Find all matches for the RE, and returns | +| | an :term:`iterator` of | +| | :ref:`match objects `. | +------------------+-----------------------------------------------+ -:meth:`~re.Pattern.match` and :meth:`~re.Pattern.search` return ``None`` if no match can be found. If +:meth:`~re.Pattern.match`, :meth:`~re.Pattern.search`, and :meth:`~re.Pattern.fullmatch` return ``None`` if no match can be found. If they're successful, a :ref:`match object ` instance is returned, containing information about the match: where it starts and ends, the substring it matched, and more. @@ -449,6 +453,16 @@ case. :: >>> m.span() (4, 11) +The :meth:`~re.Pattern.fullmatch` method checks if the RE matches the entire +string exactly:: + + >>> p = re.compile('[a-z]+') + >>> p.search(' words ') + + >>> p.fullmatch(' textual ') # Fails to match and returns None + >>> p.fullmatch('textual') + + In actual programs, the most common style is to store the :ref:`match object ` in a variable, and then check if it was ``None``. This usually looks like:: @@ -493,7 +507,7 @@ Module-Level Functions You don't have to create a pattern object and call its methods; the :mod:`re` module also provides top-level functions called :func:`~re.match`, -:func:`~re.search`, :func:`~re.findall`, :func:`~re.sub`, and so forth. These functions +:func:`~re.search`, :func:`~re.fullmatch`, :func:`~re.findall`, :func:`~re.sub`, and so forth. These functions take the same arguments as the corresponding pattern method with the RE string added as the first argument, and still return either ``None`` or a :ref:`match object ` instance. :: From 1eaa7ac6fb44cbb5ab3930e88b24e5ef1a246b35 Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Tue, 2 May 2023 21:48:34 -0400 Subject: [PATCH 06/24] Fix bug in doubled-word example, and try to clarify the explanation --- Doc/howto/regex.rst | 25 +++++++++++++++++-------- 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index 0f68cd74aef041..c11fecb08aba8a 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -865,20 +865,29 @@ subgroups, from 1 up to however many there are. :: >>> m.groups() ('abc', 'b') -Backreferences in a pattern allow you to specify that the contents of an earlier -capturing group must also be found at the current location in the string. For -example, ``\1`` will succeed if the exact contents of group 1 can be found at -the current position, and fails otherwise. Remember that Python's string -literals also use a backslash followed by numbers to allow including arbitrary -characters in a string, so be sure to use a raw string when incorporating -backreferences in a RE. +Backreferences in a pattern allow you to specify that the contents of an +earlier capturing group must also be found at the current location in the +string. For example, ``\2`` will reference the substring matched by group 2, +succeeding only if those exact contents are found at the current position +within the string. + +(Remember that Python's string literals also use a backslash followed by +numbers for including arbitrary characters in a string, so be sure to use a +raw string when incorporating backreferences in a RE.) For example, the following RE detects doubled words in a string. :: - >>> p = re.compile(r'\b(\w+)\s+\1\b') + >>> p = re.compile(r'\b(\w+)\b\s+\1\b') >>> p.search('Paris in the the spring').group() 'the the' +The first part of the pattern, ``\b(\w+)\b``, will match an entire word and +capture the word as group 1. The pattern then matches some whitespace with +``\s+`` and checks for the word again with ``\1\b``. The second \b is +necessary to ensure that the backreference is matching an entire word; +without it, the pattern would match when word #2 contains word #1 as its +beginning, as in the string "the theropod". + Backreferences like this aren't often useful for just searching through a string --- there are few text formats which repeat data in this way --- but you'll soon find out that they're *very* useful when performing string substitutions. From a4038d8cf0f537a8d0bb5433a233532bcdbf7bbd Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Tue, 2 May 2023 21:55:53 -0400 Subject: [PATCH 07/24] Clarify discussion of named groups --- Doc/howto/regex.rst | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index c11fecb08aba8a..3394b7629be69f 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -950,12 +950,14 @@ A more significant feature is named groups: instead of referring to them by numbers, groups can be referenced by a name. The syntax for a named group is one of the Python-specific extensions: -``(?P...)``. *name* can be used to refer to the group in other contexts. Named groups -behave exactly like capturing groups, and additionally associate a name -with a group. The :ref:`match object ` methods that deal with -capturing groups all accept either integers that refer to the group by number -or strings that contain the desired group's name. Named groups are still -given numbers, so you can retrieve information about a group in two ways:: +``(?P...)``. Named groups behave exactly like capturing groups, and +additionally associate *name* with the group so that *name* can be used to +refer to the group in other contexts. Names should look like a Python +identifier andonly contain letters, digits and underscores. The :ref:`match +object ` methods that deal with capturing groups all accept +either integers that refer to the group by number or strings that contain the +desired group's name. Named groups are still given numbers, so you can +retrieve information about a group in two ways:: >>> p = re.compile(r'(?P\b\w+\b)') >>> m = p.search( '((( Lots of punctuation )))' ) From 662e6466d4bb9eeac16ab22eb9d11763279fceea Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Wed, 9 Aug 2023 19:41:56 -0400 Subject: [PATCH 08/24] Mention := operator --- Doc/howto/regex.rst | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index 3394b7629be69f..8102f30d395a69 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -474,6 +474,15 @@ In actual programs, the most common style is to store the else: print('No match') +Python 3.8 added assignment expressions that shorten the above pattern +by a line:: + + p = re.compile( ... ) + if (m := p.match( 'string goes here' )): + print('Match found: ', m.group()) + else: + print('No match') + Two pattern methods return all of the matches for a pattern. :meth:`~re.Pattern.findall` returns a list of matching strings:: From 132b3e6ad8807f75c55c5a81ae5b9b8e45b4359e Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Wed, 9 Aug 2023 20:14:09 -0400 Subject: [PATCH 09/24] Describe how to use flags, and embedded modifiers such as (?x) --- Doc/howto/regex.rst | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index 8102f30d395a69..9eced6132304f7 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -544,12 +544,22 @@ Compilation Flags .. currentmodule:: re Compilation flags let you modify some aspects of how regular expressions work. -Flags are available in the :mod:`re` module under two names, a long name such as -:const:`IGNORECASE` and a short, one-letter form such as :const:`I`. (If you're -familiar with Perl's pattern modifiers, the one-letter forms use the same -letters; the short form of :const:`re.VERBOSE` is :const:`re.X`, for example.) -Multiple flags can be specified by bitwise OR-ing them; ``re.I | re.M`` sets -both the :const:`I` and :const:`M` flags, for example. +They can be passed as an argument to functions such as :func:`re.compile` and +:func:`re.sub` or you can specify them in the regex pattern. + +Flags are available in the :mod:`re` module under two names, a long +name such as :const:`IGNORECASE` and a short, one-letter form such as +:const:`I`. Multiple flags can be specified by bitwise OR-ing them; +``re.IGNORECASE | re.MULTILINE`` sets both the :const:`IGNORECASE` and +:const:`MULTILINE` flags, for example. + +To specify them in the pattern, you can write them as an embedded +modifier at the start of the pattern that uses the short one-letter +form: `(?i)` for a single flag or `(?mxi)` to enable multiple flags. +(If you're familiar with Perl's pattern modifiers, the one-letter +forms use the same letters; the short form of :const:`re.VERBOSE` is +:const:`re.X` because Perl calls these "extended regular expressions", +for example.) Here's a table of the available flags, followed by a more detailed explanation of each one. From 05555dff07324ef96ca63e789701d82fcb5599eb Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Wed, 9 Aug 2023 20:21:45 -0400 Subject: [PATCH 10/24] re.sub() now has a flags argument --- Doc/howto/regex.rst | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index 9eced6132304f7..bdaf8dd13f21f0 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -1265,10 +1265,7 @@ hexadecimal:: 'Call 0xffd2 for printing, 0xc000 for user code.' When using the module-level :func:`re.sub` function, the pattern is passed as -the first argument. The pattern may be provided as an object or as a string; if -you need to specify regular expression flags, you must either use a -pattern object as the first parameter, or use embedded modifiers in the -pattern string, e.g. ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``. +the first argument. The pattern may be provided as an object or as a string. Common Problems From acd1460e647dcce1cba12ce0ab8be6e6f43017db Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Wed, 9 Aug 2023 20:39:12 -0400 Subject: [PATCH 11/24] Make re.sub() and re.split() signature match the current module --- Doc/howto/regex.rst | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index bdaf8dd13f21f0..b07f712706cf92 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -1129,13 +1129,14 @@ whitespace or by a fixed string. As you'd expect, there's a module-level :func:`re.split` function, too. -.. method:: .split(string [, maxsplit=0]) +.. method:: .split(string [, maxsplit=0, flags=0]) :noindex: Split *string* by the matches of the regular expression. If capturing parentheses are used in the RE, then their contents will also be returned as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit* splits - are performed. + are performed. The *flags* argument is optional and may contain flag values such as + `re.MULTILINE` or `re.VERBOSE`. You can limit the number of splits made, by passing a value for *maxsplit*. When *maxsplit* is nonzero, at most *maxsplit* splits will be made, and the @@ -1179,7 +1180,7 @@ Another common task is to find all the matches for a pattern, and replace them with a different string. The :meth:`~re.Pattern.sub` method takes a replacement value, which can be either a string or a function, and the string to be processed. -.. method:: .sub(replacement, string[, count=0]) +.. method:: .sub(replacement, string[, count=0, flags=0]) :noindex: Returns the string obtained by replacing the leftmost non-overlapping @@ -1188,7 +1189,8 @@ which can be either a string or a function, and the string to be processed. The optional argument *count* is the maximum number of pattern occurrences to be replaced; *count* must be a non-negative integer. The default value of 0 means - to replace all occurrences. + to replace all occurrences. The *flags* argument is also optional and may contain + flag values such as `re.MULTILINE` or `re.VERBOSE`. Here's a simple example of using the :meth:`~re.Pattern.sub` method. It replaces colour names with the word ``colour``:: From f879c88301752f4fce1ff300bf49ddef658a32ba Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Wed, 9 Aug 2023 20:52:06 -0400 Subject: [PATCH 12/24] Move discussion of zero-width assertions, and clarify that repeating them is an error --- Doc/howto/regex.rst | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index b07f712706cf92..e92d0a190e8a5d 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -704,7 +704,7 @@ More Pattern Power ================== So far we've only covered a part of the features of regular expressions. In -this section, we'll cover some new metacharacters, and how to use groups to +this section, we'll cover some additional metacharacters and how to retrieve portions of the text that was matched. @@ -713,16 +713,8 @@ retrieve portions of the text that was matched. More Metacharacters ------------------- -There are some metacharacters that we haven't covered yet. Most of them will be -covered in this section. - -Some of the remaining metacharacters to be discussed are :dfn:`zero-width -assertions`. They don't cause the engine to advance through the string; -instead, they consume no characters at all, and simply succeed or fail. For -example, ``\b`` is an assertion that the current position is located at a word -boundary; the position isn't changed by the ``\b`` at all. This means that -zero-width assertions should never be repeated, because if they match once at a -given location, they can obviously be matched an infinite number of times. +There are more metacharacters that provide different capabilities. The first one +allows matching two possible sub-patterns. ``|`` Alternation, or the "or" operator. If *A* and *B* are regular expressions, @@ -734,6 +726,17 @@ given location, they can obviously be matched an infinite number of times. To match a literal ``'|'``, use ``\|``, or enclose it inside a character class, as in ``[|]``. +The following metacharacters are all :dfn:`zero-width assertions`. +They don't cause the engine to advance through the string; +instead, they consume no characters at all and simply succeed or fail. For +example, ``\b`` is an assertion that the current position is located at a word +boundary; the position isn't changed by the ``\b`` at all. + +Zero-width assertions can't be repeated, because if they match once at +a given location, they could be matched an infinite number of times, +so it's meaningless to repeat them. A pattern such as `^*` will raise +an exception when you try to compile it. + ``^`` Matches at the beginning of lines. Unless the :const:`MULTILINE` flag has been set, this will only match at the beginning of the string. In :const:`MULTILINE` From d9e8ddf0585cac3b9582c6797c8af1c4a8c23094 Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Wed, 23 Aug 2023 20:40:52 -0400 Subject: [PATCH 13/24] Move fullmatch() above match(), and re-word this table a bit --- Doc/howto/regex.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index e92d0a190e8a5d..caded3039d8e86 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -359,21 +359,21 @@ for a complete listing. +------------------+-----------------------------------------------+ | Method/Attribute | Purpose | +==================+===============================================+ +| ``fullmatch()`` | Determine if the RE matches the entire string | +| | exactly. | ++------------------+-----------------------------------------------+ | ``match()`` | Determine if the RE matches at the beginning | | | of the string. | +------------------+-----------------------------------------------+ | ``search()`` | Scan through a string, looking for any | -| | location where this RE matches. | -+------------------+-----------------------------------------------+ -| ``fullmatch()`` | Determine if the RE matches the entire string | -| | exactly. | +| | location where the RE matches. | +------------------+-----------------------------------------------+ | ``findall()`` | Find all substrings where the RE matches, and | | | returns them as a list. | +------------------+-----------------------------------------------+ -| ``finditer()`` | Find all matches for the RE, and returns | -| | an :term:`iterator` of | -| | :ref:`match objects `. | +| ``finditer()`` | Returns an :term:`iterator` yielding | +| | :ref:`match objects ` for all | +| | matches of the RE. | +------------------+-----------------------------------------------+ :meth:`~re.Pattern.match`, :meth:`~re.Pattern.search`, and :meth:`~re.Pattern.fullmatch` return ``None`` if no match can be found. If From e3709811094447da81a47a46abeb71620686ef9b Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Tue, 24 Sep 2024 21:20:18 -0400 Subject: [PATCH 14/24] Fix some lint-detected markup issues --- Doc/howto/regex.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index 76f5119bc2ddbe..5fa798f87fb734 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -558,7 +558,7 @@ name such as :const:`IGNORECASE` and a short, one-letter form such as To specify them in the pattern, you can write them as an embedded modifier at the start of the pattern that uses the short one-letter -form: `(?i)` for a single flag or `(?mxi)` to enable multiple flags. +form: ``(?i)`` for a single flag or ``(?mxi)`` to enable multiple flags. (If you're familiar with Perl's pattern modifiers, the one-letter forms use the same letters; the short form of :const:`re.VERBOSE` is :const:`re.X` because Perl calls these "extended regular expressions", @@ -737,7 +737,7 @@ boundary; the position isn't changed by the ``\b`` at all. Zero-width assertions can't be repeated, because if they match once at a given location, they could be matched an infinite number of times, -so it's meaningless to repeat them. A pattern such as `^*` will raise +so it's meaningless to repeat them. A pattern such as ``^*`` will raise an exception when you try to compile it. ``^`` @@ -1142,7 +1142,7 @@ whitespace or by a fixed string. As you'd expect, there's a module-level parentheses are used in the RE, then their contents will also be returned as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit* splits are performed. The *flags* argument is optional and may contain flag values such as - `re.MULTILINE` or `re.VERBOSE`. + ``re.MULTILINE`` or ``re.VERBOSE``. You can limit the number of splits made, by passing a value for *maxsplit*. When *maxsplit* is nonzero, at most *maxsplit* splits will be made, and the @@ -1196,7 +1196,7 @@ which can be either a string or a function, and the string to be processed. The optional argument *count* is the maximum number of pattern occurrences to be replaced; *count* must be a non-negative integer. The default value of 0 means to replace all occurrences. The *flags* argument is also optional and may contain - flag values such as `re.MULTILINE` or `re.VERBOSE`. + flag values such as ``re.MULTILINE`` or ``re.VERBOSE``. Here's a simple example of using the :meth:`~re.Pattern.sub` method. It replaces colour names with the word ``colour``:: From e1b084ca3f77c8da0e93847f886bc68166f60e2d Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Tue, 24 Sep 2024 21:29:35 -0400 Subject: [PATCH 15/24] Remove \b from double-word example --- Doc/howto/regex.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index 5fa798f87fb734..caa0eb1f8ad22c 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -902,11 +902,11 @@ raw string when incorporating backreferences in a RE.) For example, the following RE detects doubled words in a string. :: - >>> p = re.compile(r'\b(\w+)\b\s+\1\b') + >>> p = re.compile(r'\b(\w+)\s+\1\b') >>> p.search('Paris in the the spring').group() 'the the' -The first part of the pattern, ``\b(\w+)\b``, will match an entire word and +The first part of the pattern, ``\b(\w+)``, will match an entire word and capture the word as group 1. The pattern then matches some whitespace with ``\s+`` and checks for the word again with ``\1\b``. The second \b is necessary to ensure that the backreference is matching an entire word; From 0e855ea874fc24c900e371013a867aa8d66273e1 Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Tue, 24 Sep 2024 21:38:07 -0400 Subject: [PATCH 16/24] Add comments listing future work --- Doc/howto/regex.rst | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index caa0eb1f8ad22c..045ffec63e6c66 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -4,6 +4,14 @@ Regular Expression HOWTO **************************** +.. missing items: re.DEBUG + +.. New in 3.11: possessive quantifiers (*+, ++, ?+), {m,n}+, (?>...): atomic match + +.. New in 3.12: maxsplit, count, and flags will become keyword-only; examples should be updated + +.. (?aiLmsux-aiLmsux: ... ): modifier spans restricting pattern changes + :Author: A.M. Kuchling .. TODO: @@ -755,6 +763,8 @@ an exception when you try to compile it. To match a literal ``'^'``, use ``\^``. +.. clarification: only matches any location in re.MULTILINE mode + ``$`` Matches at the end of a line, which is defined as either the end of the string, or any location followed by a newline character. :: @@ -998,6 +1008,8 @@ Additionally, you can retrieve named groups as a dictionary with >>> m.groupdict() {'first': 'Jane', 'last': 'Doe'} +.. describe .groupindex attribute here + Named groups are handy because they let you use easily remembered names, instead of having to remember numbers. Here's an example RE from the :mod:`imaplib` module:: @@ -1439,3 +1451,6 @@ and doesn't contain any Python material at all, so it won't be useful as a reference for programming in Python. (The first edition covered Python's now-removed :mod:`!regex` module, which won't help you much.) Consider checking it out from your library. + +.. look for more references (regex builders; modern books) +.. re-examples in the LibRef From 7a97af3f4481caa99e7f85e88d8bc16896cddd3b Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Tue, 24 Sep 2024 21:43:15 -0400 Subject: [PATCH 17/24] Break long line --- Doc/howto/regex.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index 045ffec63e6c66..0c08cd8375e5c1 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -527,7 +527,8 @@ Module-Level Functions You don't have to create a pattern object and call its methods; the :mod:`re` module also provides top-level functions called :func:`~re.match`, -:func:`~re.search`, :func:`~re.fullmatch`, :func:`~re.findall`, :func:`~re.sub`, and so forth. These functions +:func:`~re.search`, :func:`~re.fullmatch`, :func:`~re.findall`, +:func:`~re.sub`, and so forth. These functions take the same arguments as the corresponding pattern method with the RE string added as the first argument, and still return either ``None`` or a :ref:`match object ` instance. :: From 961a4ef7a3d5edb34de747c94639dead7bc35e25 Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Tue, 24 Sep 2024 21:43:25 -0400 Subject: [PATCH 18/24] Use same word in example --- Doc/howto/regex.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index 0c08cd8375e5c1..6ca8e775ab7eda 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -468,8 +468,8 @@ The :meth:`~re.Pattern.fullmatch` method checks if the RE matches the entire string exactly:: >>> p = re.compile('[a-z]+') - >>> p.search(' words ') - + >>> p.search(' textual ') + >>> p.fullmatch(' textual ') # Fails to match and returns None >>> p.fullmatch('textual') From b0258f6e6ecf625daeab0382305213a76537ae1a Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Tue, 24 Sep 2024 21:47:40 -0400 Subject: [PATCH 19/24] Update Doc/howto/regex.rst Co-authored-by: Guido van Rossum --- Doc/howto/regex.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index 6ca8e775ab7eda..e9404a4a7f88e9 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -1022,8 +1022,8 @@ module:: r' (?P[-+])(?P[0-9][0-9])(?P[0-9][0-9])' r'"') -It's much easier to write ``m.group('zonem')``, instead of having -to remember to retrieve group 9. +It's much easier to write ``m.group('zonem')`` instead of having +to count groups so as to verify we must retrieve group 9. The syntax for backreferences in an expression such as ``(...)\1`` refers to the number of the group. There's naturally a variant that uses the group name From bb9497d2411a1453fbd01a4489606a2f67b8c2dc Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Tue, 24 Sep 2024 21:48:10 -0400 Subject: [PATCH 20/24] PEP8 Co-authored-by: Guido van Rossum --- Doc/howto/regex.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index e9404a4a7f88e9..1164be0f338686 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -996,7 +996,7 @@ desired group's name. Named groups are still given numbers, so you can retrieve information about a group in two ways:: >>> p = re.compile(r'(?P\b\w+\b)') - >>> m = p.search( '((( Lots of punctuation )))' ) + >>> m = p.search('((( Lots of punctuation )))') >>> m.group('word') 'Lots' >>> m.group(1) From 23c2934d23ea99541cc6bb7a00503ce9ad430521 Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Tue, 24 Sep 2024 21:49:48 -0400 Subject: [PATCH 21/24] Typo fix Co-authored-by: Guido van Rossum --- Doc/howto/regex.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index 1164be0f338686..9e7b5c887507d7 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -989,7 +989,7 @@ The syntax for a named group is one of the Python-specific extensions: ``(?P...)``. Named groups behave exactly like capturing groups, and additionally associate *name* with the group so that *name* can be used to refer to the group in other contexts. Names should look like a Python -identifier andonly contain letters, digits and underscores. The :ref:`match +identifier and only contain letters, digits and underscores. The :ref:`match object ` methods that deal with capturing groups all accept either integers that refer to the group by number or strings that contain the desired group's name. Named groups are still given numbers, so you can From 4752488c17f3af132d95aeec4d5a6e881b877dc3 Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Tue, 24 Sep 2024 21:50:13 -0400 Subject: [PATCH 22/24] PEP8 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com> --- Doc/howto/regex.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index 9e7b5c887507d7..2745174cab1a57 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -488,7 +488,7 @@ In actual programs, the most common style is to store the Python 3.8 added assignment expressions that shorten the above pattern by a line:: - p = re.compile( ... ) + p = re.compile(...) if (m := p.match( 'string goes here' )): print('Match found: ', m.group()) else: From 05c04ec28de482c9599fb59ee112046e414be80a Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Tue, 24 Sep 2024 21:50:48 -0400 Subject: [PATCH 23/24] PEP8 --- Doc/howto/regex.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index 6ca8e775ab7eda..1ad5be9e9b02af 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -489,7 +489,7 @@ Python 3.8 added assignment expressions that shorten the above pattern by a line:: p = re.compile( ... ) - if (m := p.match( 'string goes here' )): + if (m := p.match('string goes here')): print('Match found: ', m.group()) else: print('No match') From 900c50bc25cedc09a8143e4537b5068a4bb14223 Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Tue, 24 Sep 2024 21:55:08 -0400 Subject: [PATCH 24/24] Remove somewhat off-topic sentence --- Doc/howto/regex.rst | 3 --- 1 file changed, 3 deletions(-) diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index 6c4ba1d54c631d..d1c9a897062fd0 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -503,9 +503,6 @@ Two pattern methods return all of the matches for a pattern. The ``r`` prefix, making the literal a raw string literal, is needed in this example because ``\d`` is not an escape sequence recognized in Python string literals. -Such unrecognized sequences now produce a -:exc:`SyntaxWarning` and will eventually become a :exc:`SyntaxError`. See -:ref:`the-backslash-plague`. :meth:`~re.Pattern.findall` has to create the entire list before it can be returned as the result. The :meth:`~re.Pattern.finditer` method returns a sequence of