Skip to content

Conversation

@414owen
Copy link
Contributor

@414owen 414owen commented Jan 27, 2025

In posix lex, '\b' represents the backspace character. In the rust 'regex' crate, it represents a word boundary assertion.

This patch adds a test that all posix escapes are interpreted correctly, and a fix for the backspace escape incongruity.

@ratmice
Copy link
Collaborator

ratmice commented Jan 28, 2025

I guess I have two questions,

Whether people can currently use/rely on the \b as word boundary in regex, (i.e. does the \b as word boundary escape sequence code operate correctly such that people could be using it)?

Similarly, when we make \b as backspace, currently people who want backspace can match the hex code manually.
But if people want word boundary, there is no hex code for that, (i.e. regardless of whether it currently works, should it be something configurable that we should allow to be used).

I'm actually a bit surprised that this isn't something configurable by RegexBuilder but perhaps it would be nice if we could just add a flag to CTLexerBuilder such as fn escape_b_as_backspace(...) and then test that flag in the character replacement code you added? At least I believe that should work correctly in the fall through case once added to the RE_LEX_ESC_LITERAL.

I'm not certain if we actually have a CTLexerBuilder at that point so we can access it's fields, or whether making that possible is going to be too painful a task to be worthwhile though.

@ltratt
Copy link
Member

ltratt commented Jan 28, 2025

This PR is getting at a fairly deep point: because I was lazy, lrlex uses the regex library and hopes for the best in terms of "it does what you expect" for people who aren't aware of that. That means that there are incompatibilities, and seemingly random extensions, from POSiX Lex. I expected more complaints about this by now, but I suppose most lex files only use a small subset of regular expressions (certainly that's the case for me).

I expected that ultimately we would implement a full Lex-compatible matcher. But I'm unlikely to get to any year/decade soon, and we now have a fairly substantial userbase who, whether they intended to or not, are probably now relying on regex compatibility. I would be surprised if someone isn't using \b for "word boundary" already. That means that changing this behaviour will cause surprises for existing users. My laziness has come back to bite us!

Arguably what we should do is allow lrlex to support different flavours of regexes. lrpar does this tolerably well with https://docs.rs/lrpar/0.13.8/lrpar/struct.CTParserBuilder.html#method.yacckind. lrlex should probably grow lexkind and that could take PosixLex or RustRegex (etc) as options, or perhaps more honestly PosixIshLexIsh if we take the "restrict rustregex approach"!

[Note: lrlex has a lexerkind method but that's really quite different, currently meaningless (the accompanying enum has one variant, and arguably not a very good name. For the next major release we should probably remove this method until we work out what it, or some variant, should really be.]

So summary: we can't merge this PR as-is for backwards compatibility reasons, but I would love to merge a variant of it! It would need to do something like:

  1. Add a lexkind method to CTLexerBuilder.
  2. Add an enum similar to https://docs.rs/cfgrammar/0.13.8/cfgrammar/yacc/enum.YaccKind.html. The default we use should, for backwards compatibility, be LexKind::RustRegex but we should allow LexKind::PosixLex (or similar).
  3. Use information from that enum when parsing the lex file to do different things (similarly to https://docs.rs/cfgrammar/0.13.8/src/cfgrammar/yacc/parser.rs.html#625).
  4. [Optional but nice] Allow users of nimbleparse to specify the lex variant they're using.

I know that being asked to do more work is never fun, so I apologise for putting up barriers -- I would like to think, though, that this isn't too much work, and it would make & keep lots of people happy!

@ratmice
Copy link
Collaborator

ratmice commented Jan 28, 2025

4. [Optional but nice] Allow users of nimbleparse to specify the lex variant they're using.

I hadn't really thought of nimbleparse in my earlier reply, it seems like it is also already an issue with the lrlex::lexer::RegexOptions settings I had mentioned.

nimbleparse will always use the default dot_matches_newline behavior (true), there is the fn LrNonStreamingLexerDef::new_with_options which takes a RegexOptions param,
but nimbleparse will always use the default parameters.

I had originally added these for similar reasons as are given by this patch because the default dot_matches_newline behavior doesn't match posix, but (currently) there is no code to alter the setting from nimbleparse, though there is some in the cttests code as can be seen in the lex_flags field and it's implementation though.

  • lex_flags: ['!dot_matches_new_line', 'octal']
  • if let Some(flag) = check_flag(lex_flags, "allow_missing_terms_in_lexer") {
    cl_build = cl_build.allow_missing_terms_in_lexer(flag)
    }
    if let Some(flag) = check_flag(lex_flags, "allow_missing_tokens_in_parser") {
    cl_build = cl_build.allow_missing_tokens_in_parser(flag)
    }
    if let Some(flag) = check_flag(lex_flags, "dot_matches_new_line") {
    cl_build = cl_build.dot_matches_new_line(flag)
    }
    if let Some(flag) = check_flag(lex_flags, "case_insensitive") {
    cl_build = cl_build.case_insensitive(flag)
    }
    if let Some(flag) = check_flag(lex_flags, "multi_line") {
    cl_build = cl_build.multi_line(flag)
    }
    if let Some(flag) = check_flag(lex_flags, "swap_greed") {
    cl_build = cl_build.swap_greed(flag)
    }
    if let Some(flag) = check_flag(lex_flags, "ignore_whitespace") {
    cl_build = cl_build.ignore_whitespace(flag)
    }
    if let Some(flag) = check_flag(lex_flags, "unicode") {
    cl_build = cl_build.unicode(flag)
    }
    if let Some(flag) = check_flag(lex_flags, "octal") {
    cl_build = cl_build.octal(flag)
    }

So perhaps those code snippets should also make their way towards nimbleparse?
Anyhow apologies if we are giving mixed signals towards how to approach solving the problem.

@ltratt
Copy link
Member

ltratt commented Jan 28, 2025

@ratmice On reflection, you're right: I've muddled two things up. Mea culpa!

Perhaps the right thing is to make this PR use the existing RegexOptions facility and then think, in another PR, about how we could do the "bigger" LexKind change I'm suggesting. Another way of looking at this is that we could bundle up RegexOptions into LexKind::RustRegex rather than making it a first-class citizen of lrlex.

@ltratt
Copy link
Member

ltratt commented Jan 30, 2025

@414owen Did our discussion make any sense? I think we're proposing a route with a relatively small change to this PR. [@ratmice can correct me if I'm wrong!]

@ltratt ltratt mentioned this pull request Feb 5, 2025
@414owen
Copy link
Contributor Author

414owen commented Feb 12, 2025

@ltratt Sure thing. Introducing this in a non-breaking way makes sense, even though using word break assertions in a lexer is of dubious merit.

I've added the option, defaulted it to off, and added a test case that it's respected. Let me know if I can help any further :)

@ltratt
Copy link
Member

ltratt commented Feb 13, 2025

@414owen This looks great -- thanks! If you can address @ratmice's lifetime suggestion + squash everything down to one commit, we'll be ready to merge!

In posix lex, '\b' represents the backspace character.
In the rust 'regex' crate, it represents a word boundary assertion.

This patch adds an option to maintain posix-lex semantics, and tests that
all escapes are interpreted correctly, under both sets of semantics.
@414owen 414owen force-pushed the os/fix-backspace-lexing branch from c3b6c27 to b2b65ec Compare February 13, 2025 11:36
@ltratt
Copy link
Member

ltratt commented Feb 13, 2025

Thanks!

@ltratt ltratt added this pull request to the merge queue Feb 13, 2025
Merged via the queue into softdevteam:master with commit 4e939af Feb 13, 2025
2 checks passed
@ratmice ratmice mentioned this pull request Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants