Add source snippets to missing_from_lexer/parser errors #570

ratmice · 2025-05-21T00:21:10Z

Here is an initial attempt at missing_from_lexer and missing_from_parser for CTLexerBuilder including source snippets
I'm marking this as draft, because this patch is pretty awful.

For instance, it calls set_rule_ids and set_rule_ids_spanned when the lrpar_config callback is used.
but just set_rule_ids otherwise. One thing to note is that the cttests tests don't use the lrpar_config callback, but build CTParserBuilder and CTLexerBuilder separately via the older method, and tying the gordian knot via set_rule_ids.

It is only when lrpar_config is used that we have the span info to produce these nice errors.
The other thing to note is that values like token_map and rules_id_map of CTParser and CTLexerBuilder are now kind of superfluous with the inclusion of YaccGrammar in the return value.

So while this functions, it really seems like it needs some design work to reduce all these separate but related code paths and slightly different duplicates of values.

As I started trying out options, it all seemed to snowball so it seemed best to just stop here before going too far down any one path.

ratmice · 2025-05-21T13:03:06Z

lrpar/src/lib/ctbuilder.rs

    StorageT: Eq + Hash,
 {
    regenerated: bool,
    rule_ids: HashMap<String, StorageT>,


So I guess I would suggest a good first step is to focus on what to do about the duplication of rule_ids here.
Since it actually impacts semver, but also contributes to the other code duplication.

The thing to note is that this patch is adding yacc_grammar field to this struct,
and rule_ids field is just

self.rule_ids = self.yacc_grammar. .tokens_map() .iter() .map(|(&n, &i)| (n.to_owned(), i.as_storaget())) .collect::<HashMap<_, _>>();

So we are essentially storing 2 copies of the rule_ids, one with the &str refs converted to_owned().
It is worth noting that in lrlex we make another owned_map clone of this value in CTLexerBuilder::build(),
as part of the existing code path.

let (missing_from_lexer, missing_from_parser) = match self.rule_ids_map { Some(ref rim) => { // Convert from HashMap<String, _> to HashMap<&str, _> let owned_map = rim .iter() .map(|(x, y)| (&**x, *y)) .collect::<HashMap<_, _>>();

The main issue I see is the semver changes to CTParser::token_map, it isn't actually possible to reimplement the token_map function using self.yacc_grammar, because of the ref-to-owned vs ref-to-ref differences.

/// Returns a [HashMap] from lexeme string types to numeric types (e.g. `INT: 2`), suitable for /// handing to a lexer to coordinate the IDs of lexer and parser. pub fn token_map(&self) -> &HashMap<String, StorageT> { &self.rule_ids }

Dumb idea (off the top of my head): does this suggest that rule_ids should be wrapped in something like an Rc?

The alternative to changing this is just keeping both the rule_ids and yacc_grammar field duplication,
and rewriting the set_rule_ids_spanned codepath from nimbleparse to use the rule_ids field,
making this new code path conform to the existing CTLexerBuilder code paths instead.

I think if we were going to use Rc while avoiding duplication that Rc would still need to be within YaccGrammar, the particular thing we want out of the YaccGrammar is we need to call token_span

Then because we need that, it means we also already have YaccGrammar::token_map with token_map itself based upon iter_tidxs and also allocating a HashMap.

I suppose that would leave us with:

Alloc a HashMap in YaccGrammar::token_map

A rule_ids: Rc<HashMap<...>> shared by CTParserBuilder/CTLexerBuilder

So I guess this is also a variation of keeping both the rule_ids and yacc_grammar fields.

So, I guess another thing to take into account, we have these two functions here:

https://docs.rs/lrlex/latest/lrlex/trait.LexerDef.html#tymethod.set_rule_ids
https://docs.rs/lrlex/latest/lrlex/trait.LexerDef.html#tymethod.set_rule_ids_spanned

Which are the ultimate destination for both these variations, so another question I have is if we had versions
of those functions which accepted an IntoIterator<Item=(&str, LexerTypesT::StorageT)> as a parameter instead of the &HashMap<&str, LexerTypesT::StorageT>, maybe it would be possible to get rid of all these allocations entirely?

Edit: This may be a dumb idea, it looks like both set_rule_ids variations use the hashmaps to actually do lookups, so the simplest implementation of these functions taking an IntoIterator would actually use HashMap::from_iter()

In f4a42cb I tried messing with the iterator API...

Returning a Box<HashMap> from token_map in place of the rule_id field, I'm not sure if the iterator based API pulls it's weight really, we still end up with clones for backwards compatibility of return values, and the CTLexerBuilder::rule_ids_map function.

That said it seems like a bit of an improvement over the original PR implementation, but maybe it's not worth trying to remove the duplicated values which we can't seem to benefit from fully without more drastic semver breakage.

In general, where we break semver, I'd hope to keep it to "nice" parts of the API. This one does feel like it will intrude on pretty much every compile-time user? I suspect, though, there is a nice API here, and it's just eluding us right now...

I think the actual breakage is pretty small, because of the signature for

CTLexerBuilder::rule_ids_map
has std::borrow::Borrow<HashMap<String, LexerTypesT::StorageT>,

The only change I had to make to any compile-time stuff was the manual lexer,
because ct_token_map uses &HashMap in it's signature rather than the Borrow<...>. Perhaps we can switch that to use Borrow<> to further reduce breakage.

Also, I hadn't noticed that all of CTLexer's fields and methods are private so changing the types of those fields isn't actually a semver break like I had thought.

So I did 8b973f2, but I'm starting to think that in this PR I should retreat from removing duplicate fields, and reducing allocations.

I think if we allow the duplicate rule_id field, we maybe can get something working without any semver breakage, or code duplication, very similar to how this patch is now, but without the token_map or LexerDef::set_rule_ids_spanned_iter change.

Then we can potentially revisit all that in a separate patch?

[Oops, s/nice/niche/. That changed the meaning of that sentence quite profoundly. Sorry!]

We can avoid using `HashMap` entirely via avoiding calling `token_map`. This allows that the `lrpar_config` set_rule_ids typically relies entirely upon iterator. By deprecated the API we can also avoid needing to allocate them for the return value. Changing the `ct_token_map` parameter to `Borrow<HashMap>` can increases compatibility with previous versions.

ratmice · 2025-05-22T09:52:30Z

I'll probably close this one for now in favor of #571 (take 2), I think I'll at least make a separate PR for the Borrow<_> change to ct_token_map, but I'm not really convince the rest is worth the API changes/complexity of additional trait stuff. So I may skip the rest unless asked or it becomes desirable for some reason.

Add source snippets to missing_from_lexer/parser errors

fe09db1

ratmice commented May 21, 2025

View reviewed changes

ltratt self-assigned this May 21, 2025

ratmice added 2 commits May 21, 2025 10:39

experiment with alternate rule id setting api

f4a42cb

ratmice mentioned this pull request May 21, 2025

Add source snippets to missing_from_lexer/parser errors. #571

Merged

ratmice closed this May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add source snippets to missing_from_lexer/parser errors #570

Add source snippets to missing_from_lexer/parser errors #570

Uh oh!

ratmice commented May 21, 2025

Uh oh!

ratmice May 21, 2025 •

edited

Loading

Uh oh!

ltratt May 21, 2025

Uh oh!

ratmice May 21, 2025

Uh oh!

ratmice May 21, 2025

Uh oh!

ratmice May 21, 2025 •

edited

Loading

Uh oh!

ratmice May 21, 2025

Uh oh!

ltratt May 21, 2025

Uh oh!

ratmice May 21, 2025 •

edited

Loading

Uh oh!

ratmice May 21, 2025 •

edited

Loading

Uh oh!

ltratt May 22, 2025

Uh oh!

ratmice commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add source snippets to missing_from_lexer/parser errors #570

Add source snippets to missing_from_lexer/parser errors #570

Uh oh!

Conversation

ratmice commented May 21, 2025

Uh oh!

ratmice May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ltratt May 21, 2025

Choose a reason for hiding this comment

Uh oh!

ratmice May 21, 2025

Choose a reason for hiding this comment

Uh oh!

ratmice May 21, 2025

Choose a reason for hiding this comment

Uh oh!

ratmice May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ratmice May 21, 2025

Choose a reason for hiding this comment

Uh oh!

ltratt May 21, 2025

Choose a reason for hiding this comment

Uh oh!

ratmice May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ratmice May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ltratt May 22, 2025

Choose a reason for hiding this comment

Uh oh!

ratmice commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ratmice May 21, 2025 •

edited

Loading

ratmice May 21, 2025 •

edited

Loading

ratmice May 21, 2025 •

edited

Loading

ratmice May 21, 2025 •

edited

Loading