-
Notifications
You must be signed in to change notification settings - Fork 81
Description
I've been thinking about how to overcome the obstacles that prevent inclusion of something like M.regex for matching strings in use cases like "starts with match:" and "is an EVM address" and "is hexadecimal/base64/etc.". I think the pieces are in place to do so after making some decisions about convenience.
Problem: Syntax
There are many dialects of regular expression, each with their own syntax and behavioral quirks.
But RFC 9485 formally defines "I-Regexp", an interoperable subset with corresponding semantics. It relies upon Unicode as exposed in \p{…} and \P{…} patterns (which we would presumably exclude because data for a code point can change over time) and is missing case-sensitivity flags (which is probably for the best, since case mappings can also change over time), but is otherwise excellent and unquestionably simple. From a JavaScript perspective, the only quirks to be aware of are implicit anchoring (e.g., pattern a matches "a" but not "aa") and . excluding both U+000A LINE FEED \n and U+000D CARRIAGE RETURN \r (e.g., the pattern for matching any single code point is (.|[\n\r])).
Problem: ReDoS
Many JavaScript implementations, including V8, suffer from exponential backtracking with patterns like (a*)*$, ([0-9]+)*$, (a|aa){10}$, or (a|a?)+$, in which repetition is applied to a subexpression that can either match the same input in multiple ways or can match a prefix of a subsequent repetition. Similarly, but less catastrophically, they also suffer from polynomial backtracking with patterns like a*b?a*c or a.*?b.*?c, in which repetition is applied to multiple elements in a sequence with potential overlap.
This suggests that a restriction on valid patterns can avoid such backtracking, and indeed I believe it can—XML type declarations require content models to be "deterministic"/"unambiguous"/"one-unambiguous", such that matching can always proceed without lookahead because the next symbol uniquely determines at most one transition from any given state, which seems to be recognizable from a Glushkov construction variant that preserves the identity of transitions from distinct repetition quantifiers (e.g., in (a*)*$, every "a" transitions to the same state but that state itself has self-transition on "a" for the inner * and another one for the outer *). There's a decent amount of literature on this, including linear classification of any given pattern as deterministic ([GMS2012] and summary slides) and extension to so-called "k-lookahead" and "block determinism" families ([CMM2015]). Things are functional but possibly awkward without the latter, e.g. (foo|bar|baz)__[0-9]+ is not determinstic because an initial "b" might be part of "bar" or "baz" but the equivalent (foo|ba(r|z))__[0-9]+ is deterministic, a gap which is bridged in block determinism by collapsing fixed sequences into generalized alphabet expansion, basically something like («foo»|«bar»|«baz»)«__»[0-9]+ in which symbols can have length > 1.
The relaxations are clearly more convenient, but it would take a little more work to precisely define. Note also that of my examples above, only base64 is really affected:
- starts with
match::match:(.|[\n\r])* - is an EVM address:
0x[0-9a-fA-F]{40} - is [non-empty] hexadecimal:
([0-9a-fA-F]{2})+ - is [non-empty] base64:
([A-Za-z0-9+/]{4})+([A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?is 4-lookahead deterministic but not 1-deterministic;([A-Za-z0-9+/]{2}([A-Za-z0-9+/]([A-Za-z0-9+/]|=$)|==$))+is deterministic but somewhat clumsy
Other problems
Have I missed anything?
Proposal
M.dregex(pattern, limits) (or M.safeRegex(…) or 🖌️…) emits a "match:$name" instance, and limits can constrain the upper bound for range quantifiers (defaulting to 9 such that . We require the pattern to be an RFC 9485 I-Regexp, Unicode-independent (i.e., free of .{10} and .{1,10} are invalid) and also repetition depth (defaulting to 2 such that the above base64 pattern is valid but (([0-9a-f]{2})+.)+ is not)\p{…} and \P{…} atoms), and block deterministic (i.e., deterministic if treating consecutive static symbols atomically), but do not disambiguate with lookahead (e.g., the first base64 pattern above is invalid).
What would it take to commit to either this approach or a similar alternative? Formal proof may be hard to come by, but is there some lower threshold that we might deem sufficient?