Skip to content

Conversation

@ratmice
Copy link
Collaborator

@ratmice ratmice commented May 21, 2025

Here is an initial attempt at missing_from_lexer and missing_from_parser for CTLexerBuilder including source snippets
I'm marking this as draft, because this patch is pretty awful.

For instance, it calls set_rule_ids and set_rule_ids_spanned when the lrpar_config callback is used.
but just set_rule_ids otherwise. One thing to note is that the cttests tests don't use the lrpar_config callback, but build CTParserBuilder and CTLexerBuilder separately via the older method, and tying the gordian knot via set_rule_ids.

It is only when lrpar_config is used that we have the span info to produce these nice errors.
The other thing to note is that values like token_map and rules_id_map of CTParser and CTLexerBuilder are now kind of superfluous with the inclusion of YaccGrammar in the return value.

So while this functions, it really seems like it needs some design work to reduce all these separate but related code paths and slightly different duplicates of values.

As I started trying out options, it all seemed to snowball so it seemed best to just stop here before going too far down any one path.

StorageT: Eq + Hash,
{
regenerated: bool,
rule_ids: HashMap<String, StorageT>,
Copy link
Collaborator Author

@ratmice ratmice May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess I would suggest a good first step is to focus on what to do about the duplication of rule_ids here.
Since it actually impacts semver, but also contributes to the other code duplication.

The thing to note is that this patch is adding yacc_grammar field to this struct,
and rule_ids field is just

self.rule_ids = self.yacc_grammar.
            .tokens_map()
            .iter()
            .map(|(&n, &i)| (n.to_owned(), i.as_storaget()))
            .collect::<HashMap<_, _>>();

So we are essentially storing 2 copies of the rule_ids, one with the &str refs converted to_owned().
It is worth noting that in lrlex we make another owned_map clone of this value in CTLexerBuilder::build(),
as part of the existing code path.

        let (missing_from_lexer, missing_from_parser) = match self.rule_ids_map {
            Some(ref rim) => {
                // Convert from HashMap<String, _> to HashMap<&str, _>
                let owned_map = rim
                    .iter()
                    .map(|(x, y)| (&**x, *y))
                    .collect::<HashMap<_, _>>();

The main issue I see is the semver changes to CTParser::token_map, it isn't actually possible to reimplement the token_map function using self.yacc_grammar, because of the ref-to-owned vs ref-to-ref differences.

    /// Returns a [HashMap] from lexeme string types to numeric types (e.g. `INT: 2`), suitable for
    /// handing to a lexer to coordinate the IDs of lexer and parser.
    pub fn token_map(&self) -> &HashMap<String, StorageT> {
        &self.rule_ids
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb idea (off the top of my head): does this suggest that rule_ids should be wrapped in something like an Rc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative to changing this is just keeping both the rule_ids and yacc_grammar field duplication,
and rewriting the set_rule_ids_spanned codepath from nimbleparse to use the rule_ids field,
making this new code path conform to the existing CTLexerBuilder code paths instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we were going to use Rc while avoiding duplication that Rc would still need to be within YaccGrammar, the particular thing we want out of the YaccGrammar is we need to call token_span

Then because we need that, it means we also already have YaccGrammar::token_map with token_map itself based upon iter_tidxs and also allocating a HashMap.

I suppose that would leave us with:

  1. Alloc a HashMap in YaccGrammar::token_map
  2. A rule_ids: Rc<HashMap<...>> shared by CTParserBuilder/CTLexerBuilder

So I guess this is also a variation of keeping both the rule_ids and yacc_grammar fields.

Copy link
Collaborator Author

@ratmice ratmice May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I guess another thing to take into account, we have these two functions here:

https://docs.rs/lrlex/latest/lrlex/trait.LexerDef.html#tymethod.set_rule_ids
https://docs.rs/lrlex/latest/lrlex/trait.LexerDef.html#tymethod.set_rule_ids_spanned

Which are the ultimate destination for both these variations, so another question I have is if we had versions
of those functions which accepted an IntoIterator<Item=(&str, LexerTypesT::StorageT)> as a parameter instead of the &HashMap<&str, LexerTypesT::StorageT>, maybe it would be possible to get rid of all these allocations entirely?

Edit: This may be a dumb idea, it looks like both set_rule_ids variations use the hashmaps to actually do lookups, so the simplest implementation of these functions taking an IntoIterator would actually use HashMap::from_iter()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In f4a42cb I tried messing with the iterator API...

Returning a Box<HashMap> from token_map in place of the rule_id field, I'm not sure if the iterator based API pulls it's weight really, we still end up with clones for backwards compatibility of return values, and the CTLexerBuilder::rule_ids_map function.

That said it seems like a bit of an improvement over the original PR implementation, but maybe it's not worth trying to remove the duplicated values which we can't seem to benefit from fully without more drastic semver breakage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, where we break semver, I'd hope to keep it to "nice" parts of the API. This one does feel like it will intrude on pretty much every compile-time user? I suspect, though, there is a nice API here, and it's just eluding us right now...

Copy link
Collaborator Author

@ratmice ratmice May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the actual breakage is pretty small, because of the signature for

CTLexerBuilder::rule_ids_map
has std::borrow::Borrow<HashMap<String, LexerTypesT::StorageT>,

The only change I had to make to any compile-time stuff was the manual lexer,
because ct_token_map uses &HashMap in it's signature rather than the Borrow<...>. Perhaps we can switch that to use Borrow<> to further reduce breakage.

Also, I hadn't noticed that all of CTLexer's fields and methods are private so changing the types of those fields isn't actually a semver break like I had thought.

Copy link
Collaborator Author

@ratmice ratmice May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I did 8b973f2, but I'm starting to think that in this PR I should retreat from removing duplicate fields, and reducing allocations.

I think if we allow the duplicate rule_id field, we maybe can get something working without any semver breakage, or code duplication, very similar to how this patch is now, but without the token_map or LexerDef::set_rule_ids_spanned_iter change.

Then we can potentially revisit all that in a separate patch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Oops, s/nice/niche/. That changed the meaning of that sentence quite profoundly. Sorry!]

@ltratt ltratt self-assigned this May 21, 2025
ratmice added 2 commits May 21, 2025 10:39
We can avoid using `HashMap` entirely via avoiding calling `token_map`.
This allows that the `lrpar_config` set_rule_ids typically relies entirely
upon iterator.

By deprecated the API we can also avoid needing to allocate them for the return value.

Changing the `ct_token_map` parameter to `Borrow<HashMap>` can increases compatibility with previous versions.
@ratmice
Copy link
Collaborator Author

ratmice commented May 22, 2025

I'll probably close this one for now in favor of #571 (take 2), I think I'll at least make a separate PR for the Borrow<_> change to ct_token_map, but I'm not really convince the rest is worth the API changes/complexity of additional trait stuff. So I may skip the rest unless asked or it becomes desirable for some reason.

@ratmice ratmice closed this May 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants