Can we add a limited regular expression matcher?

I've been thinking about how to overcome the obstacles that prevent inclusion of something like `M.regex` for matching strings in use cases like "_starts with `match:`_" and "_is an EVM address_" and "_is hexadecimal/base64/etc._". I think the pieces are in place to do so after making some decisions about convenience.

## Problem: Syntax

There are many dialects of regular expression, each with their own syntax and behavioral quirks.

But [RFC 9485](https://www.rfc-editor.org/rfc/rfc9485) formally defines "I-Regexp", an interoperable subset with corresponding semantics. It relies upon Unicode as exposed in `\p{…}` and `\P{…}` patterns (which we would presumably exclude because data for a code point can change over time) and is missing case-sensitivity flags (which is probably for the best, since case mappings can _also_ change over time), but is otherwise excellent and unquestionably simple. From a JavaScript perspective, the only quirks to be aware of are implicit anchoring (_e.g., pattern `a` matches "a" but not "aa"_) and `.` excluding both U+000A LINE FEED `\n` and U+000D CARRIAGE RETURN `\r` (_e.g., the pattern for matching any single code point is `(.|[\n\r])`_).

## Problem: ReDoS

Many JavaScript implementations, including V8, suffer from exponential backtracking with patterns like `(a*)*$`, `([0-9]+)*$`, `(a|aa){10}$`, or `(a|a?)+$`, in which repetition is applied to a subexpression that can either match the same input in multiple ways or can match a prefix of a subsequent repetition. Similarly, but less catastrophically, they also suffer from polynomial backtracking with patterns like `a*b?a*c` or `a.*?b.*?c`, in which repetition is applied to multiple elements in a sequence with potential overlap.

This suggests that a restriction on valid patterns can avoid such backtracking, and indeed I believe it can—[XML type declarations require content models to be "deterministic"/"unambiguous"/"one-unambiguous"](https://www.w3.org/TR/xml11/#determinism), such that matching can always proceed without lookahead because the next symbol uniquely determines at most one transition from any given state, which seems to be recognizable from a [Glushkov construction](https://en.wikipedia.org/wiki/Glushkov%27s_construction_algorithm) variant that preserves the identity of transitions from distinct repetition quantifiers (_e.g., in `(a*)*$`, every "a" transitions to the same state but that state itself has self-transition on "a" for the inner `*` and another one for the outer `*`_). There's a decent amount of literature on this, including linear classification of any given pattern as deterministic ([[GMS2012]](https://inria.hal.science/inria-00618451/document/) and [summary slides](https://www.lri.fr/~groz/documents/pods12-talk.pdf)) and extension to so-called "k-lookahead" and "block determinism" families ([[CMM2015]](https://arxiv.org/abs/1512.05475)). Things are functional but possibly awkward without the latter, e.g. `(foo|bar|baz)__[0-9]+` is not determinstic because an initial "b" might be part of "bar" or "baz" but the equivalent `(foo|ba(r|z))__[0-9]+` _is_ deterministic, a gap which is bridged in block determinism by collapsing fixed sequences into generalized alphabet expansion, basically something like `(«foo»|«bar»|«baz»)«__»[0-9]+` in which symbols can have length \> 1.

The relaxations are clearly more convenient, but it would take a little more work to precisely define. Note also that of my examples above, only base64 is really affected:
* **starts with `match:`**: `match:(.|[\n\r])*`
* **is an EVM address**: `0x[0-9a-fA-F]{40}`
* **is [non-empty] hexadecimal**: `([0-9a-fA-F]{2})+`
* **is [non-empty] base64**: `([A-Za-z0-9+/]{4})+([A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?` is 4-lookahead deterministic but not 1-deterministic; `([A-Za-z0-9+/]{2}([A-Za-z0-9+/]([A-Za-z0-9+/]|=$)|==$))+` is deterministic but somewhat clumsy

## Other problems

Have I missed anything?

## Proposal

`M.dregex(pattern, limits)` (or `M.safeRegex(…)` or 🖌️…) emits a "match:$name" instance, ~~and limits can constrain the upper bound for range quantifiers (_defaulting to 9 such that `.{10}` and `.{1,10}` are invalid_) and also repetition depth (_defaulting to 2 such that the above base64 pattern is valid but `(([0-9a-f]{2})+.)+` is not_)~~. We require the pattern to be an RFC 9485 I-Regexp, Unicode-independent (i.e., free of `\p{…}` and `\P{…}` atoms), and block deterministic (_i.e., deterministic if treating consecutive static symbols atomically_), but do not disambiguate with lookahead (_e.g., the first base64 pattern above is invalid_).

What would it take to commit to either this approach or a similar alternative? Formal proof may be hard to come by, but is there some lower threshold that we might deem sufficient?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we add a limited regular expression matcher? #3079

Problem: Syntax

Problem: ReDoS

Other problems

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can we add a limited regular expression matcher? #3079

Description

Problem: Syntax

Problem: ReDoS

Other problems

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions