Fix backspace lexeme escaping #483

414owen · 2025-01-27T22:43:02Z

In posix lex, '\b' represents the backspace character. In the rust 'regex' crate, it represents a word boundary assertion.

This patch adds a test that all posix escapes are interpreted correctly, and a fix for the backspace escape incongruity.

ratmice · 2025-01-28T00:36:38Z

I guess I have two questions,

Whether people can currently use/rely on the \b as word boundary in regex, (i.e. does the \b as word boundary escape sequence code operate correctly such that people could be using it)?

Similarly, when we make \b as backspace, currently people who want backspace can match the hex code manually.
But if people want word boundary, there is no hex code for that, (i.e. regardless of whether it currently works, should it be something configurable that we should allow to be used).

I'm actually a bit surprised that this isn't something configurable by RegexBuilder but perhaps it would be nice if we could just add a flag to CTLexerBuilder such as fn escape_b_as_backspace(...) and then test that flag in the character replacement code you added? At least I believe that should work correctly in the fall through case once added to the RE_LEX_ESC_LITERAL.

I'm not certain if we actually have a CTLexerBuilder at that point so we can access it's fields, or whether making that possible is going to be too painful a task to be worthwhile though.

ltratt · 2025-01-28T08:08:27Z

This PR is getting at a fairly deep point: because I was lazy, lrlex uses the regex library and hopes for the best in terms of "it does what you expect" for people who aren't aware of that. That means that there are incompatibilities, and seemingly random extensions, from POSiX Lex. I expected more complaints about this by now, but I suppose most lex files only use a small subset of regular expressions (certainly that's the case for me).

I expected that ultimately we would implement a full Lex-compatible matcher. But I'm unlikely to get to any year/decade soon, and we now have a fairly substantial userbase who, whether they intended to or not, are probably now relying on regex compatibility. I would be surprised if someone isn't using \b for "word boundary" already. That means that changing this behaviour will cause surprises for existing users. My laziness has come back to bite us!

Arguably what we should do is allow lrlex to support different flavours of regexes. lrpar does this tolerably well with https://docs.rs/lrpar/0.13.8/lrpar/struct.CTParserBuilder.html#method.yacckind. lrlex should probably grow lexkind and that could take PosixLex or RustRegex (etc) as options, or perhaps more honestly PosixIshLexIsh if we take the "restrict rustregex approach"!

[Note: lrlex has a lexerkind method but that's really quite different, currently meaningless (the accompanying enum has one variant, and arguably not a very good name. For the next major release we should probably remove this method until we work out what it, or some variant, should really be.]

So summary: we can't merge this PR as-is for backwards compatibility reasons, but I would love to merge a variant of it! It would need to do something like:

Add a lexkind method to CTLexerBuilder.
Add an enum similar to https://docs.rs/cfgrammar/0.13.8/cfgrammar/yacc/enum.YaccKind.html. The default we use should, for backwards compatibility, be LexKind::RustRegex but we should allow LexKind::PosixLex (or similar).
Use information from that enum when parsing the lex file to do different things (similarly to https://docs.rs/cfgrammar/0.13.8/src/cfgrammar/yacc/parser.rs.html#625).
[Optional but nice] Allow users of nimbleparse to specify the lex variant they're using.

I know that being asked to do more work is never fun, so I apologise for putting up barriers -- I would like to think, though, that this isn't too much work, and it would make & keep lots of people happy!

ratmice · 2025-01-28T08:55:56Z

4. [Optional but nice] Allow users of nimbleparse to specify the lex variant they're using.

I hadn't really thought of nimbleparse in my earlier reply, it seems like it is also already an issue with the lrlex::lexer::RegexOptions settings I had mentioned.

nimbleparse will always use the default dot_matches_newline behavior (true), there is the fn LrNonStreamingLexerDef::new_with_options which takes a RegexOptions param,
but nimbleparse will always use the default parameters.

I had originally added these for similar reasons as are given by this patch because the default dot_matches_newline behavior doesn't match posix, but (currently) there is no code to alter the setting from nimbleparse, though there is some in the cttests code as can be seen in the lex_flags field and it's implementation though.

grmtools/lrpar/cttests/src/regex_opt.test

Line 3 in 6967d50

lex_flags: ['!dot_matches_new_line', 'octal']

grmtools/lrpar/cttests/src/cgen_helper.rs

Lines 113 to 139 in 6967d50

    
           if let Some(flag) = check_flag(lex_flags, "allow_missing_terms_in_lexer") { 
        
               cl_build = cl_build.allow_missing_terms_in_lexer(flag) 
        
           } 
        
           if let Some(flag) = check_flag(lex_flags, "allow_missing_tokens_in_parser") { 
        
               cl_build = cl_build.allow_missing_tokens_in_parser(flag) 
        
           } 
        
           if let Some(flag) = check_flag(lex_flags, "dot_matches_new_line") { 
        
               cl_build = cl_build.dot_matches_new_line(flag) 
        
           } 
        
           if let Some(flag) = check_flag(lex_flags, "case_insensitive") { 
        
               cl_build = cl_build.case_insensitive(flag) 
        
           } 
        
           if let Some(flag) = check_flag(lex_flags, "multi_line") { 
        
               cl_build = cl_build.multi_line(flag) 
        
           } 
        
           if let Some(flag) = check_flag(lex_flags, "swap_greed") { 
        
               cl_build = cl_build.swap_greed(flag) 
        
           } 
        
           if let Some(flag) = check_flag(lex_flags, "ignore_whitespace") { 
        
               cl_build = cl_build.ignore_whitespace(flag) 
        
           } 
        
           if let Some(flag) = check_flag(lex_flags, "unicode") { 
        
               cl_build = cl_build.unicode(flag) 
        
           } 
        
           if let Some(flag) = check_flag(lex_flags, "octal") { 
        
               cl_build = cl_build.octal(flag) 
        
           }

So perhaps those code snippets should also make their way towards nimbleparse?
Anyhow apologies if we are giving mixed signals towards how to approach solving the problem.

ltratt · 2025-01-28T09:10:56Z

@ratmice On reflection, you're right: I've muddled two things up. Mea culpa!

Perhaps the right thing is to make this PR use the existing RegexOptions facility and then think, in another PR, about how we could do the "bigger" LexKind change I'm suggesting. Another way of looking at this is that we could bundle up RegexOptions into LexKind::RustRegex rather than making it a first-class citizen of lrlex.

ltratt · 2025-01-30T10:56:50Z

@414owen Did our discussion make any sense? I think we're proposing a route with a relatively small change to this PR. [@ratmice can correct me if I'm wrong!]

414owen · 2025-02-12T22:28:21Z

@ltratt Sure thing. Introducing this in a non-breaking way makes sense, even though using word break assertions in a lexer is of dubious merit.

I've added the option, defaulted it to off, and added a test case that it's respected. Let me know if I can help any further :)

lrlex/src/lib/parser.rs

ltratt · 2025-02-13T08:16:42Z

@414owen This looks great -- thanks! If you can address @ratmice's lifetime suggestion + squash everything down to one commit, we'll be ready to merge!

In posix lex, '\b' represents the backspace character. In the rust 'regex' crate, it represents a word boundary assertion. This patch adds an option to maintain posix-lex semantics, and tests that all escapes are interpreted correctly, under both sets of semantics.

ltratt · 2025-02-13T11:45:47Z

Thanks!

ratmice assigned ratmice and ltratt Jan 28, 2025

ltratt mentioned this pull request Feb 5, 2025

Lexing unicode characters #486

Closed

ratmice reviewed Feb 12, 2025

View reviewed changes

lrlex/src/lib/parser.rs Outdated Show resolved Hide resolved

414owen force-pushed the os/fix-backspace-lexing branch from c3b6c27 to b2b65ec Compare February 13, 2025 11:36

ltratt added this pull request to the merge queue Feb 13, 2025

Merged via the queue into softdevteam:master with commit 4e939af Feb 13, 2025
2 checks passed

ratmice mentioned this pull request Feb 21, 2025

Add LexKind #490

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix backspace lexeme escaping #483

Fix backspace lexeme escaping #483

Uh oh!

414owen commented Jan 27, 2025

Uh oh!

ratmice commented Jan 28, 2025

Uh oh!

ltratt commented Jan 28, 2025

Uh oh!

ratmice commented Jan 28, 2025

Uh oh!

ltratt commented Jan 28, 2025

Uh oh!

ltratt commented Jan 30, 2025

Uh oh!

414owen commented Feb 12, 2025

Uh oh!

Uh oh!

ltratt commented Feb 13, 2025

Uh oh!

ltratt commented Feb 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix backspace lexeme escaping #483

Fix backspace lexeme escaping #483

Uh oh!

Conversation

414owen commented Jan 27, 2025

Uh oh!

ratmice commented Jan 28, 2025

Uh oh!

ltratt commented Jan 28, 2025

Uh oh!

ratmice commented Jan 28, 2025

Uh oh!

ltratt commented Jan 28, 2025

Uh oh!

ltratt commented Jan 30, 2025

Uh oh!

414owen commented Feb 12, 2025

Uh oh!

Uh oh!

ltratt commented Feb 13, 2025

Uh oh!

ltratt commented Feb 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants