From 0203d9fad20ffd6d6c53648c7b9491c3600a2289 Mon Sep 17 00:00:00 2001 From: "David A. Wheeler" Date: Mon, 23 Sep 2024 11:40:33 -0400 Subject: [PATCH 1/6] Fix "Correctly Using Regex" table for Python and Ruby The detailed rationale explained why "$" is permissive in Python3 and Ruby, but the roll-up table is wrong (!). Fix the table, and provide source code for verifying it. Signed-off-by: David A. Wheeler --- docs/Correctly-Using-Regular-Expressions-Rationale.md | 2 +- docs/Correctly-Using-Regular-Expressions.md | 4 ++-- docs/src/regex.py | 8 ++++++++ docs/src/regex.rb | 6 ++++++ 4 files changed, 17 insertions(+), 3 deletions(-) create mode 100755 docs/src/regex.py create mode 100755 docs/src/regex.rb diff --git a/docs/Correctly-Using-Regular-Expressions-Rationale.md b/docs/Correctly-Using-Regular-Expressions-Rationale.md index d4f05d0d..4e16badd 100644 --- a/docs/Correctly-Using-Regular-Expressions-Rationale.md +++ b/docs/Correctly-Using-Regular-Expressions-Rationale.md @@ -175,7 +175,7 @@ Setting both PCRE2_ANCHORED and PCRE2_ENDANCHORED forces a full-string match, bu The [Python3 language documentation on re](https://docs.python.org/3/library/re.html) notes that its operations are “similar to those found in Perl” - but note that they are _similar_ not _identical_. In this library: * ^ (Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline. -* $ Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. +* $ Matches the end of the string or just before the newline at the end of the string (it is *permissive*), and in MULTILINE mode it also matches before a newline. * \A Matches only at the start of the string. * \Z Matches only at the end of the string. Note that this is spelled \Z not \z, and there is no \z. diff --git a/docs/Correctly-Using-Regular-Expressions.md b/docs/Correctly-Using-Regular-Expressions.md index 30bdf2a6..f3cc8fda 100644 --- a/docs/Correctly-Using-Regular-Expressions.md +++ b/docs/Correctly-Using-Regular-Expressions.md @@ -102,7 +102,7 @@ Platform “\Z” (not “$” nor “\z”) - No + Yes @@ -112,7 +112,7 @@ Platform “\z” (not “$”) - No + Yes diff --git a/docs/src/regex.py b/docs/src/regex.py new file mode 100755 index 00000000..bca3b3f0 --- /dev/null +++ b/docs/src/regex.py @@ -0,0 +1,8 @@ +#!/usr/bin/env python3 + +import re + +print('Test Python regex') +print("Must be false: ", bool(re.search(r'^wrong$', "hello"))) +print("Must be true: ", bool(re.search(r'^hello$', "hello"))) +print("True if permissive: ", bool(re.search(r'^hello$', "hello\n"))) diff --git a/docs/src/regex.rb b/docs/src/regex.rb new file mode 100755 index 00000000..c4f395b7 --- /dev/null +++ b/docs/src/regex.rb @@ -0,0 +1,6 @@ +#!/usr/bin/env ruby + +puts('Test Ruby regex') +puts("Must be false: ", !! /^wrong$/.match("hello")) +puts("Must be true: ", !! /^hello$/.match("hello")) +puts("True if permissive: ", !! /^hello$/.match("hello\n")) From 153416a1ca8fa92f2b2c92c925c7e17bac93d1bd Mon Sep 17 00:00:00 2001 From: "David A. Wheeler" Date: Mon, 23 Sep 2024 11:52:32 -0400 Subject: [PATCH 2/6] Fix markdownlint error Signed-off-by: David A. Wheeler --- docs/Correctly-Using-Regular-Expressions-Rationale.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/Correctly-Using-Regular-Expressions-Rationale.md b/docs/Correctly-Using-Regular-Expressions-Rationale.md index 4e16badd..54a756a2 100644 --- a/docs/Correctly-Using-Regular-Expressions-Rationale.md +++ b/docs/Correctly-Using-Regular-Expressions-Rationale.md @@ -175,7 +175,7 @@ Setting both PCRE2_ANCHORED and PCRE2_ENDANCHORED forces a full-string match, bu The [Python3 language documentation on re](https://docs.python.org/3/library/re.html) notes that its operations are “similar to those found in Perl” - but note that they are _similar_ not _identical_. In this library: * ^ (Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline. -* $ Matches the end of the string or just before the newline at the end of the string (it is *permissive*), and in MULTILINE mode it also matches before a newline. +* $ Matches the end of the string or just before the newline at the end of the string (it is _permissive_), and in MULTILINE mode it also matches before a newline. * \A Matches only at the start of the string. * \Z Matches only at the end of the string. Note that this is spelled \Z not \z, and there is no \z. From d4c159e84431fdaaf6e3276c26b37ca84adae893 Mon Sep 17 00:00:00 2001 From: "David A. Wheeler" Date: Mon, 23 Sep 2024 12:26:35 -0400 Subject: [PATCH 3/6] Add more tests for regex Add more tests, showing the differences between Python and Ruby. Signed-off-by: David A. Wheeler --- docs/src/regex.py | 1 + docs/src/regex.rb | 1 + 2 files changed, 2 insertions(+) diff --git a/docs/src/regex.py b/docs/src/regex.py index bca3b3f0..1bff1098 100755 --- a/docs/src/regex.py +++ b/docs/src/regex.py @@ -6,3 +6,4 @@ print("Must be false: ", bool(re.search(r'^wrong$', "hello"))) print("Must be true: ", bool(re.search(r'^hello$', "hello"))) print("True if permissive: ", bool(re.search(r'^hello$', "hello\n"))) +print("Should be false: ", bool(re.search(r'^hello$', "hello\nthere"))) diff --git a/docs/src/regex.rb b/docs/src/regex.rb index c4f395b7..73b3b069 100755 --- a/docs/src/regex.rb +++ b/docs/src/regex.rb @@ -4,3 +4,4 @@ puts("Must be false: ", !! /^wrong$/.match("hello")) puts("Must be true: ", !! /^hello$/.match("hello")) puts("True if permissive: ", !! /^hello$/.match("hello\n")) +puts("Should be true ($ always multi): ", !! /^hello$/.match("hello\nthere")) From c1b09a1b9c833f646fa4bab830905215278b1a10 Mon Sep 17 00:00:00 2001 From: "David A. Wheeler" Date: Mon, 23 Sep 2024 12:31:05 -0400 Subject: [PATCH 4/6] Note that we focus on defaults Signed-off-by: David A. Wheeler --- docs/Correctly-Using-Regular-Expressions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/Correctly-Using-Regular-Expressions.md b/docs/Correctly-Using-Regular-Expressions.md index f3cc8fda..8e373155 100644 --- a/docs/Correctly-Using-Regular-Expressions.md +++ b/docs/Correctly-Using-Regular-Expressions.md @@ -117,7 +117,7 @@ Platform -For example, to validate in JavaScript that the input is only “ab” or “de”, use the regex “^(ab|de)$”. To validate the same thing in Python, use “^(ab|de)\Z” or “\A(ab|de)\Z”. Note that the “$” anchor has different meanings among platforms and is often misunderstood; on many platforms it’s permissive and doesn’t match only the end of the input. Instead of using “$” on a platform if $ is permissive, consider using an explicit form instead (e.g., “`\n?\z`”). Consider preferring “\A” and “\z” where it’s supported (this is necessary when using Ruby). +For example, to validate in JavaScript that the input is only “ab” or “de”, use the regex “^(ab|de)$”. To validate the same thing in Python, use “^(ab|de)\Z” or “\A(ab|de)\Z”. Note that the “$” anchor has different meanings among platforms and is often misunderstood; on many platforms it’s permissive by default and doesn’t match only the end of the input. Instead of using “$” on a platform if $ is permissive, consider using an explicit form instead (e.g., “`\n?\z`”). Consider preferring “\A” and “\z” where it’s supported (this is necessary when using Ruby). In addition, ensure your regex is not vulnerable to a Regular Expression Denial of Service (ReDoS) attack. A ReDoS “[is a Denial of Service attack, that exploits the fact that most Regular Expression implementations may reach extreme situations that cause them to work very slowly (exponentially related to input size)](https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS)”. Many regex implementations are “backtracking” implementations, that is, they try all possible matches. In these implementations, a poorly-written regular expression can be exploited by an attacker to take a vast amount of time. From e317578e16ccb2b22873bf8aaed174e07a28e2e2 Mon Sep 17 00:00:00 2001 From: "David A. Wheeler" Date: Mon, 23 Sep 2024 12:35:03 -0400 Subject: [PATCH 5/6] Note that limiting the number of repetitions helps ReDoS Signed-off-by: David A. Wheeler --- docs/Correctly-Using-Regular-Expressions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/Correctly-Using-Regular-Expressions.md b/docs/Correctly-Using-Regular-Expressions.md index 8e373155..a50819ea 100644 --- a/docs/Correctly-Using-Regular-Expressions.md +++ b/docs/Correctly-Using-Regular-Expressions.md @@ -123,7 +123,7 @@ In addition, ensure your regex is not vulnerable to a Regular Expression Denial 1. One solution is to use a regex implementation that does not have this vulnerability because it never backtracks. E.g., use Go’s default regex system, RE2, or on .NET enable the RegexOptions.NonBacktracking option. Non-backtracking implementations can sometimes be orders of magnitude faster, but they also omit some features (e.g., backreferences). 2. Alternatively, create regexes that require no or little backtracking. Where a branch (“|”) occurs, the next character should select one branch. Where there is optional repetition (e.g., “*”), the next character should determine if there is a repetition or not. One common cause of unnecessary backtracking are poorly-written regexes with repetitions in repetitions, e.g., “(a+)*”. Some tools can help find these defects. -3. A partial countermeasure is to greatly limit the length of the untrusted input. This can limit the impact of a vulnerability. +3. A partial countermeasure is to greatly limit the length of the untrusted input and/or the number of repetitions. This can limit the impact of a vulnerability. ## Detailed Rationale From ff74d1b067c2703b91ca93deb2970c11b6673806 Mon Sep 17 00:00:00 2001 From: "David A. Wheeler" Date: Mon, 23 Sep 2024 12:47:35 -0400 Subject: [PATCH 6/6] Add example of limiting repetitions Signed-off-by: David A. Wheeler --- docs/Correctly-Using-Regular-Expressions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/Correctly-Using-Regular-Expressions.md b/docs/Correctly-Using-Regular-Expressions.md index a50819ea..cb36afe5 100644 --- a/docs/Correctly-Using-Regular-Expressions.md +++ b/docs/Correctly-Using-Regular-Expressions.md @@ -123,7 +123,7 @@ In addition, ensure your regex is not vulnerable to a Regular Expression Denial 1. One solution is to use a regex implementation that does not have this vulnerability because it never backtracks. E.g., use Go’s default regex system, RE2, or on .NET enable the RegexOptions.NonBacktracking option. Non-backtracking implementations can sometimes be orders of magnitude faster, but they also omit some features (e.g., backreferences). 2. Alternatively, create regexes that require no or little backtracking. Where a branch (“|”) occurs, the next character should select one branch. Where there is optional repetition (e.g., “*”), the next character should determine if there is a repetition or not. One common cause of unnecessary backtracking are poorly-written regexes with repetitions in repetitions, e.g., “(a+)*”. Some tools can help find these defects. -3. A partial countermeasure is to greatly limit the length of the untrusted input and/or the number of repetitions. This can limit the impact of a vulnerability. +3. A partial countermeasure is to greatly limit the length of the untrusted input and/or the number of repetitions. This can limit the impact of a vulnerability. For example, in a regex, use “{0,4}” (0 through 4 repetitions inclusive) instead of “*” (0 or more repetitions, with no maximum). ## Detailed Rationale