Skip to content

Can we add a limited regular expression matcher? #3079

@gibson042

Description

@gibson042

I've been thinking about how to overcome the obstacles that prevent inclusion of something like M.regex for matching strings in use cases like "starts with match:" and "is an EVM address" and "is hexadecimal/base64/etc.". I think the pieces are in place to do so after making some decisions about convenience.

Problem: Syntax

There are many dialects of regular expression, each with their own syntax and behavioral quirks.

But RFC 9485 formally defines "I-Regexp", an interoperable subset with corresponding semantics. It relies upon Unicode as exposed in \p{…} and \P{…} patterns (which we would presumably exclude because data for a code point can change over time) and is missing case-sensitivity flags (which is probably for the best, since case mappings can also change over time), but is otherwise excellent and unquestionably simple. From a JavaScript perspective, the only quirks to be aware of are implicit anchoring (e.g., pattern a matches "a" but not "aa") and . excluding both U+000A LINE FEED \n and U+000D CARRIAGE RETURN \r (e.g., the pattern for matching any single code point is (.|[\n\r])).

Problem: ReDoS

Many JavaScript implementations, including V8, suffer from exponential backtracking with patterns like (a*)*$, ([0-9]+)*$, (a|aa){10}$, or (a|a?)+$, in which repetition is applied to a subexpression that can either match the same input in multiple ways or can match a prefix of a subsequent repetition. Similarly, but less catastrophically, they also suffer from polynomial backtracking with patterns like a*b?a*c or a.*?b.*?c, in which repetition is applied to multiple elements in a sequence with potential overlap.

This suggests that a restriction on valid patterns can avoid such backtracking, and indeed I believe it can—XML type declarations require content models to be "deterministic"/"unambiguous"/"one-unambiguous", such that matching can always proceed without lookahead because the next symbol uniquely determines at most one transition from any given state, which seems to be recognizable from a Glushkov construction variant that preserves the identity of transitions from distinct repetition quantifiers (e.g., in (a*)*$, every "a" transitions to the same state but that state itself has self-transition on "a" for the inner * and another one for the outer *). There's a decent amount of literature on this, including linear classification of any given pattern as deterministic ([GMS2012] and summary slides) and extension to so-called "k-lookahead" and "block determinism" families ([CMM2015]). Things are functional but possibly awkward without the latter, e.g. (foo|bar|baz)__[0-9]+ is not determinstic because an initial "b" might be part of "bar" or "baz" but the equivalent (foo|ba(r|z))__[0-9]+ is deterministic, a gap which is bridged in block determinism by collapsing fixed sequences into generalized alphabet expansion, basically something like («foo»|«bar»|«baz»)«__»[0-9]+ in which symbols can have length > 1.

The relaxations are clearly more convenient, but it would take a little more work to precisely define. Note also that of my examples above, only base64 is really affected:

  • starts with match:: match:(.|[\n\r])*
  • is an EVM address: 0x[0-9a-fA-F]{40}
  • is [non-empty] hexadecimal: ([0-9a-fA-F]{2})+
  • is [non-empty] base64: ([A-Za-z0-9+/]{4})+([A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)? is 4-lookahead deterministic but not 1-deterministic; ([A-Za-z0-9+/]{2}([A-Za-z0-9+/]([A-Za-z0-9+/]|=$)|==$))+ is deterministic but somewhat clumsy

Other problems

Have I missed anything?

Proposal

M.dregex(pattern, limits) (or M.safeRegex(…) or 🖌️…) emits a "match:$name" instance, and limits can constrain the upper bound for range quantifiers (defaulting to 9 such that .{10} and .{1,10} are invalid) and also repetition depth (defaulting to 2 such that the above base64 pattern is valid but (([0-9a-f]{2})+.)+ is not). We require the pattern to be an RFC 9485 I-Regexp, Unicode-independent (i.e., free of \p{…} and \P{…} atoms), and block deterministic (i.e., deterministic if treating consecutive static symbols atomically), but do not disambiguate with lookahead (e.g., the first base64 pattern above is invalid).

What would it take to commit to either this approach or a similar alternative? Formal proof may be hard to come by, but is there some lower threshold that we might deem sufficient?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions