Skip to content

Actually document the regex dialect and semantics #594

@masklinn

Description

@masklinn

While many regex dialects / implementations use similar symbols they don't necessarily ascribe the same semantics to those e.g. \d, w, \s and their reverse may be ascii only or partially or fully unicode, the latter would be a lot more expensive than the former, possibly unnecessarily.

Furthermore from a performance / memory standpoint 6e65445 modified regexes to limit redos risk, however it did so inconsistently so it's not entirely clear whether and which rules non-backtracking engines which are not sensitive to catastrophic backtracking (e.g. re2, regex, regexp, ...) may convert the regexes back to unbounded repetition, as bounded repetitions are also used in semantically relevant contexts. Having a well defined and consistent substitute for * and + (and maybe some rules ensuring new ones don't get added improperly) would allow engines to track and substitute them on the fly, which can positively impact their memory use and runtime as they don't need to track the number of iterations anymore.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions