From 0203d9fad20ffd6d6c53648c7b9491c3600a2289 Mon Sep 17 00:00:00 2001
From: "David A. Wheeler" <dwheeler@dwheeler.com>
Date: Mon, 23 Sep 2024 11:40:33 -0400
Subject: [PATCH 1/6] Fix "Correctly Using Regex" table for Python and Ruby

The detailed rationale explained why "$" is permissive in
Python3 and Ruby, but the roll-up table is wrong (!).
Fix the table, and provide source code for verifying it.

Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
---
 docs/Correctly-Using-Regular-Expressions-Rationale.md | 2 +-
 docs/Correctly-Using-Regular-Expressions.md           | 4 ++--
 docs/src/regex.py                                     | 8 ++++++++
 docs/src/regex.rb                                     | 6 ++++++
 4 files changed, 17 insertions(+), 3 deletions(-)
 create mode 100755 docs/src/regex.py
 create mode 100755 docs/src/regex.rb
diff --git a/docs/Correctly-Using-Regular-Expressions-Rationale.md b/docs/Correctly-Using-Regular-Expressions-Rationale.md
index d4f05d0d..4e16badd 100644
--- a/docs/Correctly-Using-Regular-Expressions-Rationale.md
+++ b/docs/Correctly-Using-Regular-Expressions-Rationale.md
@@ -175,7 +175,7 @@ Setting both PCRE2_ANCHORED and PCRE2_ENDANCHORED forces a full-string match, bu
 The [Python3 language documentation on re](https://docs.python.org/3/library/re.html) notes that its operations are “similar to those found in Perl” - but note that they are _similar_ not _identical_. In this library:
 
 * ^ (Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.
-* $ Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline.
+* $ Matches the end of the string or just before the newline at the end of the string (it is *permissive*), and in MULTILINE mode it also matches before a newline.
 * \A Matches only at the start of the string.
 * \Z Matches only at the end of the string. Note that this is spelled \Z not \z, and there is no \z.
 
diff --git a/docs/Correctly-Using-Regular-Expressions.md b/docs/Correctly-Using-Regular-Expressions.md
index 30bdf2a6..f3cc8fda 100644
--- a/docs/Correctly-Using-Regular-Expressions.md
+++ b/docs/Correctly-Using-Regular-Expressions.md
@@ -102,7 +102,7 @@ Platform
    </td>
    <td>“\Z” (not “$” nor “\z”)
    </td>
-   <td>No
+   <td>Yes
    </td>
   </tr>
   <tr>
@@ -112,7 +112,7 @@ Platform
    </td>
    <td>“\z” (not “$”)
    </td>
-   <td>No
+   <td>Yes
    </td>
   </tr>
 </table>
diff --git a/docs/src/regex.py b/docs/src/regex.py
new file mode 100755
index 00000000..bca3b3f0
--- /dev/null
+++ b/docs/src/regex.py
@@ -0,0 +1,8 @@
+#!/usr/bin/env python3
+
+import re
+
+print('Test Python regex')
+print("Must be false: ", bool(re.search(r'^wrong$', "hello")))
+print("Must be true: ", bool(re.search(r'^hello$', "hello")))
+print("True if permissive: ", bool(re.search(r'^hello$', "hello\n")))
diff --git a/docs/src/regex.rb b/docs/src/regex.rb
new file mode 100755
index 00000000..c4f395b7
--- /dev/null
+++ b/docs/src/regex.rb
@@ -0,0 +1,6 @@
+#!/usr/bin/env ruby
+
+puts('Test Ruby regex')
+puts("Must be false: ", !! /^wrong$/.match("hello"))
+puts("Must be true: ", !! /^hello$/.match("hello"))
+puts("True if permissive: ", !! /^hello$/.match("hello\n"))

From 153416a1ca8fa92f2b2c92c925c7e17bac93d1bd Mon Sep 17 00:00:00 2001
From: "David A. Wheeler" <dwheeler@dwheeler.com>
Date: Mon, 23 Sep 2024 11:52:32 -0400
Subject: [PATCH 2/6] Fix markdownlint error

Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
---
 docs/Correctly-Using-Regular-Expressions-Rationale.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Correctly-Using-Regular-Expressions-Rationale.md b/docs/Correctly-Using-Regular-Expressions-Rationale.md
index 4e16badd..54a756a2 100644
--- a/docs/Correctly-Using-Regular-Expressions-Rationale.md
+++ b/docs/Correctly-Using-Regular-Expressions-Rationale.md
@@ -175,7 +175,7 @@ Setting both PCRE2_ANCHORED and PCRE2_ENDANCHORED forces a full-string match, bu
 The [Python3 language documentation on re](https://docs.python.org/3/library/re.html) notes that its operations are “similar to those found in Perl” - but note that they are _similar_ not _identical_. In this library:
 
 * ^ (Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.
-* $ Matches the end of the string or just before the newline at the end of the string (it is *permissive*), and in MULTILINE mode it also matches before a newline.
+* $ Matches the end of the string or just before the newline at the end of the string (it is _permissive_), and in MULTILINE mode it also matches before a newline.
 * \A Matches only at the start of the string.
 * \Z Matches only at the end of the string. Note that this is spelled \Z not \z, and there is no \z.
 

From d4c159e84431fdaaf6e3276c26b37ca84adae893 Mon Sep 17 00:00:00 2001
From: "David A. Wheeler" <dwheeler@dwheeler.com>
Date: Mon, 23 Sep 2024 12:26:35 -0400
Subject: [PATCH 3/6] Add more tests for regex

Add more tests, showing the differences between Python and Ruby.

Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
---
 docs/src/regex.py | 1 +
 docs/src/regex.rb | 1 +
 2 files changed, 2 insertions(+)

diff --git a/docs/src/regex.py b/docs/src/regex.py
index bca3b3f0..1bff1098 100755
--- a/docs/src/regex.py
+++ b/docs/src/regex.py
@@ -6,3 +6,4 @@
 print("Must be false: ", bool(re.search(r'^wrong$', "hello")))
 print("Must be true: ", bool(re.search(r'^hello$', "hello")))
 print("True if permissive: ", bool(re.search(r'^hello$', "hello\n")))
+print("Should be false: ", bool(re.search(r'^hello$', "hello\nthere")))
diff --git a/docs/src/regex.rb b/docs/src/regex.rb
index c4f395b7..73b3b069 100755
--- a/docs/src/regex.rb
+++ b/docs/src/regex.rb
@@ -4,3 +4,4 @@
 puts("Must be false: ", !! /^wrong$/.match("hello"))
 puts("Must be true: ", !! /^hello$/.match("hello"))
 puts("True if permissive: ", !! /^hello$/.match("hello\n"))
+puts("Should be true ($ always multi): ", !! /^hello$/.match("hello\nthere"))

From c1b09a1b9c833f646fa4bab830905215278b1a10 Mon Sep 17 00:00:00 2001
From: "David A. Wheeler" <dwheeler@dwheeler.com>
Date: Mon, 23 Sep 2024 12:31:05 -0400
Subject: [PATCH 4/6] Note that we focus on defaults

Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
---
 docs/Correctly-Using-Regular-Expressions.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Correctly-Using-Regular-Expressions.md b/docs/Correctly-Using-Regular-Expressions.md
index f3cc8fda..8e373155 100644
--- a/docs/Correctly-Using-Regular-Expressions.md
+++ b/docs/Correctly-Using-Regular-Expressions.md
@@ -117,7 +117,7 @@ Platform
   </tr>
 </table>
 
-For example, to validate in JavaScript that the input is only “ab” or “de”, use the regex “<tt>^(ab&#x7c;de)$</tt>”. To validate the same thing in Python, use “<tt>^(ab&#x7c;de)\Z</tt>” or “<tt>\A(ab&#x7c;de)\Z</tt>”. Note that the “$” anchor has different meanings among platforms and is often misunderstood; on many platforms it’s permissive and doesn’t match only the end of the input. Instead of using “$” on a platform if $ is permissive, consider using an explicit form instead (e.g., “`\n?\z`”). Consider preferring “\A” and “\z” where it’s supported (this is necessary when using Ruby).
+For example, to validate in JavaScript that the input is only “ab” or “de”, use the regex “<tt>^(ab&#x7c;de)$</tt>”. To validate the same thing in Python, use “<tt>^(ab&#x7c;de)\Z</tt>” or “<tt>\A(ab&#x7c;de)\Z</tt>”. Note that the “$” anchor has different meanings among platforms and is often misunderstood; on many platforms it’s permissive by default and doesn’t match only the end of the input. Instead of using “$” on a platform if $ is permissive, consider using an explicit form instead (e.g., “`\n?\z`”). Consider preferring “\A” and “\z” where it’s supported (this is necessary when using Ruby).
 
 In addition, ensure your regex is not vulnerable to a Regular Expression Denial of Service (ReDoS) attack. A ReDoS “[is a Denial of Service attack, that exploits the fact that most Regular Expression implementations may reach extreme situations that cause them to work very slowly (exponentially related to input size)](https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS)”. Many regex implementations are “backtracking” implementations, that is, they try all possible matches. In these implementations,  a poorly-written regular expression can be exploited by an attacker to take a vast amount of time.
 

From e317578e16ccb2b22873bf8aaed174e07a28e2e2 Mon Sep 17 00:00:00 2001
From: "David A. Wheeler" <dwheeler@dwheeler.com>
Date: Mon, 23 Sep 2024 12:35:03 -0400
Subject: [PATCH 5/6] Note that limiting the number of repetitions helps ReDoS

Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
---
 docs/Correctly-Using-Regular-Expressions.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Correctly-Using-Regular-Expressions.md b/docs/Correctly-Using-Regular-Expressions.md
index 8e373155..a50819ea 100644
--- a/docs/Correctly-Using-Regular-Expressions.md
+++ b/docs/Correctly-Using-Regular-Expressions.md
@@ -123,7 +123,7 @@ In addition, ensure your regex is not vulnerable to a Regular Expression Denial
 
 1. One solution is to use a regex implementation that does not have this vulnerability because it never backtracks. E.g., use Go’s default regex system, RE2, or on .NET enable the RegexOptions.NonBacktracking option. Non-backtracking implementations can sometimes be orders of magnitude faster, but they also omit some features (e.g., backreferences).
 2. Alternatively, create regexes that require no or little backtracking. Where a branch (“&#x7c;”) occurs, the next character should select one branch. Where there is optional repetition (e.g., “&#x2a;”), the next character should determine if there is a repetition or not. One common cause of unnecessary backtracking are poorly-written regexes with repetitions in repetitions, e.g., “(a+)&#x2a;”. Some tools can help find these defects.
-3. A partial countermeasure is to greatly limit the length of the untrusted input. This can limit the impact of a vulnerability.
+3. A partial countermeasure is to greatly limit the length of the untrusted input and/or the number of repetitions. This can limit the impact of a vulnerability.
 
 ## Detailed Rationale
 

From ff74d1b067c2703b91ca93deb2970c11b6673806 Mon Sep 17 00:00:00 2001
From: "David A. Wheeler" <dwheeler@dwheeler.com>
Date: Mon, 23 Sep 2024 12:47:35 -0400
Subject: [PATCH 6/6] Add example of limiting repetitions

Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
---
 docs/Correctly-Using-Regular-Expressions.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/Correctly-Using-Regular-Expressions.md b/docs/Correctly-Using-Regular-Expressions.md
index a50819ea..cb36afe5 100644
--- a/docs/Correctly-Using-Regular-Expressions.md
+++ b/docs/Correctly-Using-Regular-Expressions.md
@@ -123,7 +123,7 @@ In addition, ensure your regex is not vulnerable to a Regular Expression Denial
 
 1. One solution is to use a regex implementation that does not have this vulnerability because it never backtracks. E.g., use Go’s default regex system, RE2, or on .NET enable the RegexOptions.NonBacktracking option. Non-backtracking implementations can sometimes be orders of magnitude faster, but they also omit some features (e.g., backreferences).
 2. Alternatively, create regexes that require no or little backtracking. Where a branch (“&#x7c;”) occurs, the next character should select one branch. Where there is optional repetition (e.g., “&#x2a;”), the next character should determine if there is a repetition or not. One common cause of unnecessary backtracking are poorly-written regexes with repetitions in repetitions, e.g., “(a+)&#x2a;”. Some tools can help find these defects.
-3. A partial countermeasure is to greatly limit the length of the untrusted input and/or the number of repetitions. This can limit the impact of a vulnerability.
+3. A partial countermeasure is to greatly limit the length of the untrusted input and/or the number of repetitions. This can limit the impact of a vulnerability. For example, in a regex, use “{0,4}” (0 through 4 repetitions inclusive) instead of “*” (0 or more repetitions, with no maximum).
 
 ## Detailed Rationale