diff --git a/docs/Correctly-Using-Regular-Expressions-Rationale.md b/docs/Correctly-Using-Regular-Expressions-Rationale.md index d4f05d0d..54a756a2 100644 --- a/docs/Correctly-Using-Regular-Expressions-Rationale.md +++ b/docs/Correctly-Using-Regular-Expressions-Rationale.md @@ -175,7 +175,7 @@ Setting both PCRE2_ANCHORED and PCRE2_ENDANCHORED forces a full-string match, bu The [Python3 language documentation on re](https://docs.python.org/3/library/re.html) notes that its operations are “similar to those found in Perl” - but note that they are _similar_ not _identical_. In this library: * ^ (Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline. -* $ Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. +* $ Matches the end of the string or just before the newline at the end of the string (it is _permissive_), and in MULTILINE mode it also matches before a newline. * \A Matches only at the start of the string. * \Z Matches only at the end of the string. Note that this is spelled \Z not \z, and there is no \z. diff --git a/docs/Correctly-Using-Regular-Expressions.md b/docs/Correctly-Using-Regular-Expressions.md index 30bdf2a6..cb36afe5 100644 --- a/docs/Correctly-Using-Regular-Expressions.md +++ b/docs/Correctly-Using-Regular-Expressions.md @@ -102,7 +102,7 @@ Platform “\Z” (not “$” nor “\z”) - No + Yes @@ -112,18 +112,18 @@ Platform “\z” (not “$”) - No + Yes -For example, to validate in JavaScript that the input is only “ab” or “de”, use the regex “^(ab|de)$”. To validate the same thing in Python, use “^(ab|de)\Z” or “\A(ab|de)\Z”. Note that the “$” anchor has different meanings among platforms and is often misunderstood; on many platforms it’s permissive and doesn’t match only the end of the input. Instead of using “$” on a platform if $ is permissive, consider using an explicit form instead (e.g., “`\n?\z`”). Consider preferring “\A” and “\z” where it’s supported (this is necessary when using Ruby). +For example, to validate in JavaScript that the input is only “ab” or “de”, use the regex “^(ab|de)$”. To validate the same thing in Python, use “^(ab|de)\Z” or “\A(ab|de)\Z”. Note that the “$” anchor has different meanings among platforms and is often misunderstood; on many platforms it’s permissive by default and doesn’t match only the end of the input. Instead of using “$” on a platform if $ is permissive, consider using an explicit form instead (e.g., “`\n?\z`”). Consider preferring “\A” and “\z” where it’s supported (this is necessary when using Ruby). In addition, ensure your regex is not vulnerable to a Regular Expression Denial of Service (ReDoS) attack. A ReDoS “[is a Denial of Service attack, that exploits the fact that most Regular Expression implementations may reach extreme situations that cause them to work very slowly (exponentially related to input size)](https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS)”. Many regex implementations are “backtracking” implementations, that is, they try all possible matches. In these implementations, a poorly-written regular expression can be exploited by an attacker to take a vast amount of time. 1. One solution is to use a regex implementation that does not have this vulnerability because it never backtracks. E.g., use Go’s default regex system, RE2, or on .NET enable the RegexOptions.NonBacktracking option. Non-backtracking implementations can sometimes be orders of magnitude faster, but they also omit some features (e.g., backreferences). 2. Alternatively, create regexes that require no or little backtracking. Where a branch (“|”) occurs, the next character should select one branch. Where there is optional repetition (e.g., “*”), the next character should determine if there is a repetition or not. One common cause of unnecessary backtracking are poorly-written regexes with repetitions in repetitions, e.g., “(a+)*”. Some tools can help find these defects. -3. A partial countermeasure is to greatly limit the length of the untrusted input. This can limit the impact of a vulnerability. +3. A partial countermeasure is to greatly limit the length of the untrusted input and/or the number of repetitions. This can limit the impact of a vulnerability. For example, in a regex, use “{0,4}” (0 through 4 repetitions inclusive) instead of “*” (0 or more repetitions, with no maximum). ## Detailed Rationale diff --git a/docs/src/regex.py b/docs/src/regex.py new file mode 100755 index 00000000..1bff1098 --- /dev/null +++ b/docs/src/regex.py @@ -0,0 +1,9 @@ +#!/usr/bin/env python3 + +import re + +print('Test Python regex') +print("Must be false: ", bool(re.search(r'^wrong$', "hello"))) +print("Must be true: ", bool(re.search(r'^hello$', "hello"))) +print("True if permissive: ", bool(re.search(r'^hello$', "hello\n"))) +print("Should be false: ", bool(re.search(r'^hello$', "hello\nthere"))) diff --git a/docs/src/regex.rb b/docs/src/regex.rb new file mode 100755 index 00000000..73b3b069 --- /dev/null +++ b/docs/src/regex.rb @@ -0,0 +1,7 @@ +#!/usr/bin/env ruby + +puts('Test Ruby regex') +puts("Must be false: ", !! /^wrong$/.match("hello")) +puts("Must be true: ", !! /^hello$/.match("hello")) +puts("True if permissive: ", !! /^hello$/.match("hello\n")) +puts("Should be true ($ always multi): ", !! /^hello$/.match("hello\nthere"))