-
-
Notifications
You must be signed in to change notification settings - Fork 33.2k
Description
Bug report
Bug description:
This bug report is about capture groups in the following regex returning characters that are explicitly against the regex-specification of that group.
>>> import re
>>> re.sub(r".*(c|cxx|fortran)", r"\1", ' - {{ compiler("fortran") }} # [win]')
'fortran") }} # [win]' # does not match (c|cxx|fortran)
>>> re.sub(r".*(c|cxx|fortran).*", r"\1", ' - {{ compiler("fortran") }} # [win]') # note trailing `.*`
'fortran' # correct capture group content
The first line should either not match at all, or if it does, r"\1"
may never include characters that are against the spec of the capture group. In more detail, the docs for re.sub
say:
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of
pattern
instring
by the replacementrepl
.
The first question is if the pattern matches the string without a trailing .*
. Based on my reading of "leftmost non-overlapping occurrences of pattern
", the answer to that is yes, which matches the behaviour of the above.
Once it is considered a match, a capture group must never spuriously pick up characters that go against the respective definition, i.e. the content between the (...)
brackets corresponding to the named or numbered group.
Another pretty egregious example is where the group isn't even at the end of the pattern, but still ends up picking up unrelated characters:
>>> re.sub(r"^(?P<indent>\s*).*(c|cxx|fortran)", r"\g<indent>", ' - {{ compiler("fortran") }} # [win]')
' ") }} # [win]'
>>> re.sub(r"^(?P<indent>\s*).*(c|cxx|fortran).*", r"\g<indent>", ' - {{ compiler("fortran") }} # [win]')
' '
The result should only contain whitespace, also for the first line as (?P<indent>\s*)
clearly should not capture anything else.
CPython versions tested on:
3.11, 3.13
Operating systems tested on:
Linux, Windows