Merge pull request #877 from ossf/python-adds-z

david-a-wheeler · web-flow · commit 3fdb6d800b7b · 2025-05-20T10:59:20.000-04:00
Python also made changes, let's note that
diff --git a/docs/Correctly-Using-Regular-Expressions-Rationale.md b/docs/Correctly-Using-Regular-Expressions-Rationale.md
@@ -146,6 +146,8 @@ In both BRE and ERE notation, by default “^” means beginning-of-string and 
 
 The [regcomp function (which compiles regular expressions)](https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/) accepts a “REG_NEWLINE” flag, to help text editors search many lines. If REG_NEW_LINE is set, the interpretation changes: a “^”  matches the zero-length string immediately after a &lt;newline> in string, and “$” matches the zero-length string immediately before a &lt;newline> in string. There’s no way in the POSIX specification to separately match the beginning of a string nor an end of a string when REG_NEWLINE is enabled, which is why \A, \Z, and \z were later created by Perl. When validating input from untrusted users the REG_NEWLINE option is normally not used.
 
+The Austin Group (who maintain the POSIX specification) in 2025 [added \A and \z to POSIX for EREs](https://www.austingroupbugs.net/view.php?id=1919) and recommends that BREs also implement them.
+
 ### Perl
 
 [Perl documentation for perlre (perl regular expressions)](https://perldoc.perl.org/perlre) describes its support for regular expressions. Version 5.38.2 documents the following, where “/m” is the “multiple lines” modifier (the multiple lines modifier is _not_ enabled by default):
@@ -185,6 +187,8 @@ Python3’s regular expression library “re” has the method “fullmatch” w
 
 As of 2024-03-24, [Tutorialspoints incorrectly claims that “$ matches the end of a string” in Python](https://www.tutorialspoint.com/How-to-match-at-the-end-of-string-in-python-using-Regular-Expression#). StackOverflow answer [1218783](https://stackoverflow.com/a/12187839) is also incorrect.​​
 
+In 2025 Python decided to add support for [\z as end-of-string](https://github.com/python/cpython/issues/133306) and modified various libraries to use it.
+
 ### RE2
 
 [RE2](https://github.com/google/re2) is a regular expression library using a non-backtracking impllementation approach. Such implementations are don’t have catestrophic cases and are sometimes orders of magnitude faster, but they’re less featureful (e.g., they don’t support backreferences). RE2’s speed is compelling in many cases, so RE2 ended up being used in many places.
@@ -506,15 +510,19 @@ be nearly universal:
   [Regular Expression Buffer Boundaries for ECMAScript](https://github.com/tc39/proposal-regexp-buffer-boundaries)
   to add \A and \z to ECMAScript/JavaScript, and it advanced to stage 2,
   but it seems to be stuck there. We intend to see if we can help it advance.
-* Python: Python supports \A, but it uses the unique \Z instead of the
-  \z used everywhere else for end-of-string.
-  We'll ask to see if \z could be supported in addition to \Z for end-of-string.
-  We'll probably start with a minor git request (as this is a really
-  small change), otherwise we'll create a PEP, depending on the desires
-  of the Python community.
+* Python: Python supports \A, but historically it uses
+  the rare \Z instead of the \z used almost everywhere else for end-of-string.
   In current versions of Python3 a \z in a regex raises an exception, so
   adding \z for end-of-string would be a backwards-compatible addition.
-  See [CPython issue 133306](https://github.com/python/cpython/issues/133306).
+  In [CPython issue 133306](https://github.com/python/cpython/issues/133306)
+  it was agreed to add \z in addition to \Z to match end-of-string,
+  which was implemented in
+  [PR 133314](https://github.com/python/cpython/pull/133314).
+  They noted that Tcl also uses \Z instead of \z (another group to contact).
+  Our thanks to the Python community!
+* Tcl: Tcl uses `\A` and `\Z`. It currently leaves `\z` undefined. A
+  [proposal to add support for `\z`](https://core.tcl-lang.org/tcl/tktview/fbc56b259e989230e54a4053feeecf7aa765f61d)
+  has been submitted.
 
 ## Authors and contributors
 
diff --git a/docs/Correctly-Using-Regular-Expressions.md b/docs/Correctly-Using-Regular-Expressions.md
@@ -34,7 +34,7 @@ When using regexes for secure validation of untrusted input, do the following so
 | Python                                            | “^” or “\A”    | “\Z” (not “$” nor “\z”)                                                                             | Yes                |
 | Ruby                                              | “\A” (not “^”) | “\z” (not “$”)                                                                                      | Yes                |
 
-For example, to validate in JavaScript that the input is only “ab” or “de”, use the regex “<tt>^(ab&#x7c;de)$</tt>”. To validate the same thing in Python, use “<tt>^(ab&#x7c;de)\Z</tt>” or “<tt>\A(ab&#x7c;de)\Z</tt>”. Note that the “$” anchor has different meanings among platforms and is often misunderstood; on many platforms it’s permissive by default and doesn’t match only the end of the input. Instead of using “$” on a platform if $ is permissive, consider using an explicit form instead (e.g., “`\n?\z`”). Consider preferring “\A” and “\z” where it’s supported (this is necessary when using Ruby).
+For example, to validate in JavaScript that the input is only “ab” or “de”, use the regex “<tt>^(ab&#x7c;de)$</tt>”. To validate the same thing in Python, use “<tt>^(ab&#x7c;de)\Z</tt>” or “<tt>\A(ab&#x7c;de)\Z</tt>”. Note that the “$” anchor has different meanings among platforms and is often misunderstood; on many platforms it’s permissive by default and doesn’t match only the end of the input. Instead of using “$” on a platform if $ is permissive, consider using an explicit form instead (e.g., “`\n?\z`”). Consider preferring “\A” and “\z” where it’s supported (this is necessary when using Ruby). [POSIX EREs](https://www.austingroupbugs.net/view.php?id=1919) and [Python](https://github.com/python/cpython/issues/133306) are being changed to support `\A`...`\z`.
 
 In addition, ensure your regex is not vulnerable to a Regular Expression Denial of Service (ReDoS) attack. A ReDoS “[is a Denial of Service attack, that exploits the fact that most Regular Expression implementations may reach extreme situations that cause them to work very slowly (exponentially related to input size)](https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS)”. Many regex implementations are “backtracking” implementations, that is, they try all possible matches. In these implementations,  a poorly-written regular expression can be exploited by an attacker to take a vast amount of time.