Skip to content

Regex group captures content against its specification #126343

@h-vetinari

Description

@h-vetinari

Bug report

Bug description:

This bug report is about capture groups in the following regex returning characters that are explicitly against the regex-specification of that group.

>>> import re
>>> re.sub(r".*(c|cxx|fortran)", r"\1", '    - {{ compiler("fortran") }}   # [win]')
'fortran") }}   # [win]'                                                                # does not match (c|cxx|fortran)
>>> re.sub(r".*(c|cxx|fortran).*", r"\1", '    - {{ compiler("fortran") }}   # [win]')  # note trailing `.*`
'fortran'                                                                               # correct capture group content

The first line should either not match at all, or if it does, r"\1" may never include characters that are against the spec of the capture group. In more detail, the docs for re.sub say:

re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.

The first question is if the pattern matches the string without a trailing .*. Based on my reading of "leftmost non-overlapping occurrences of pattern", the answer to that is yes, which matches the behaviour of the above.

Once it is considered a match, a capture group must never spuriously pick up characters that go against the respective definition, i.e. the content between the (...) brackets corresponding to the named or numbered group.

Another pretty egregious example is where the group isn't even at the end of the pattern, but still ends up picking up unrelated characters:

>>> re.sub(r"^(?P<indent>\s*).*(c|cxx|fortran)", r"\g<indent>", '    - {{ compiler("fortran") }}   # [win]')
'    ") }}   # [win]'
>>> re.sub(r"^(?P<indent>\s*).*(c|cxx|fortran).*", r"\g<indent>", '    - {{ compiler("fortran") }}   # [win]')
'    '

The result should only contain whitespace, also for the first line as (?P<indent>\s*) clearly should not capture anything else.

CPython versions tested on:

3.11, 3.13

Operating systems tested on:

Linux, Windows

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions