-
Notifications
You must be signed in to change notification settings - Fork 38
Add source snippets to missing_from_lexer/parser errors #570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lrpar/src/lib/ctbuilder.rs
Outdated
| StorageT: Eq + Hash, | ||
| { | ||
| regenerated: bool, | ||
| rule_ids: HashMap<String, StorageT>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess I would suggest a good first step is to focus on what to do about the duplication of rule_ids here.
Since it actually impacts semver, but also contributes to the other code duplication.
The thing to note is that this patch is adding yacc_grammar field to this struct,
and rule_ids field is just
self.rule_ids = self.yacc_grammar.
.tokens_map()
.iter()
.map(|(&n, &i)| (n.to_owned(), i.as_storaget()))
.collect::<HashMap<_, _>>();
So we are essentially storing 2 copies of the rule_ids, one with the &str refs converted to_owned().
It is worth noting that in lrlex we make another owned_map clone of this value in CTLexerBuilder::build(),
as part of the existing code path.
let (missing_from_lexer, missing_from_parser) = match self.rule_ids_map {
Some(ref rim) => {
// Convert from HashMap<String, _> to HashMap<&str, _>
let owned_map = rim
.iter()
.map(|(x, y)| (&**x, *y))
.collect::<HashMap<_, _>>();
The main issue I see is the semver changes to CTParser::token_map, it isn't actually possible to reimplement the token_map function using self.yacc_grammar, because of the ref-to-owned vs ref-to-ref differences.
/// Returns a [HashMap] from lexeme string types to numeric types (e.g. `INT: 2`), suitable for
/// handing to a lexer to coordinate the IDs of lexer and parser.
pub fn token_map(&self) -> &HashMap<String, StorageT> {
&self.rule_ids
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb idea (off the top of my head): does this suggest that rule_ids should be wrapped in something like an Rc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative to changing this is just keeping both the rule_ids and yacc_grammar field duplication,
and rewriting the set_rule_ids_spanned codepath from nimbleparse to use the rule_ids field,
making this new code path conform to the existing CTLexerBuilder code paths instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we were going to use Rc while avoiding duplication that Rc would still need to be within YaccGrammar, the particular thing we want out of the YaccGrammar is we need to call token_span
Then because we need that, it means we also already have YaccGrammar::token_map with token_map itself based upon iter_tidxs and also allocating a HashMap.
I suppose that would leave us with:
- Alloc a HashMap in
YaccGrammar::token_map - A
rule_ids: Rc<HashMap<...>>shared byCTParserBuilder/CTLexerBuilder
So I guess this is also a variation of keeping both the rule_ids and yacc_grammar fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I guess another thing to take into account, we have these two functions here:
https://docs.rs/lrlex/latest/lrlex/trait.LexerDef.html#tymethod.set_rule_ids
https://docs.rs/lrlex/latest/lrlex/trait.LexerDef.html#tymethod.set_rule_ids_spanned
Which are the ultimate destination for both these variations, so another question I have is if we had versions
of those functions which accepted an IntoIterator<Item=(&str, LexerTypesT::StorageT)> as a parameter instead of the &HashMap<&str, LexerTypesT::StorageT>, maybe it would be possible to get rid of all these allocations entirely?
Edit: This may be a dumb idea, it looks like both set_rule_ids variations use the hashmaps to actually do lookups, so the simplest implementation of these functions taking an IntoIterator would actually use HashMap::from_iter()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In f4a42cb I tried messing with the iterator API...
Returning a Box<HashMap> from token_map in place of the rule_id field, I'm not sure if the iterator based API pulls it's weight really, we still end up with clones for backwards compatibility of return values, and the CTLexerBuilder::rule_ids_map function.
That said it seems like a bit of an improvement over the original PR implementation, but maybe it's not worth trying to remove the duplicated values which we can't seem to benefit from fully without more drastic semver breakage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, where we break semver, I'd hope to keep it to "nice" parts of the API. This one does feel like it will intrude on pretty much every compile-time user? I suspect, though, there is a nice API here, and it's just eluding us right now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the actual breakage is pretty small, because of the signature for
CTLexerBuilder::rule_ids_map
has std::borrow::Borrow<HashMap<String, LexerTypesT::StorageT>,
The only change I had to make to any compile-time stuff was the manual lexer,
because ct_token_map uses &HashMap in it's signature rather than the Borrow<...>. Perhaps we can switch that to use Borrow<> to further reduce breakage.
Also, I hadn't noticed that all of CTLexer's fields and methods are private so changing the types of those fields isn't actually a semver break like I had thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I did 8b973f2, but I'm starting to think that in this PR I should retreat from removing duplicate fields, and reducing allocations.
I think if we allow the duplicate rule_id field, we maybe can get something working without any semver breakage, or code duplication, very similar to how this patch is now, but without the token_map or LexerDef::set_rule_ids_spanned_iter change.
Then we can potentially revisit all that in a separate patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Oops, s/nice/niche/. That changed the meaning of that sentence quite profoundly. Sorry!]
We can avoid using `HashMap` entirely via avoiding calling `token_map`. This allows that the `lrpar_config` set_rule_ids typically relies entirely upon iterator. By deprecated the API we can also avoid needing to allocate them for the return value. Changing the `ct_token_map` parameter to `Borrow<HashMap>` can increases compatibility with previous versions.
|
I'll probably close this one for now in favor of #571 (take 2), I think I'll at least make a separate PR for the |
Here is an initial attempt at
missing_from_lexerandmissing_from_parserforCTLexerBuilderincluding source snippetsI'm marking this as draft, because this patch is pretty awful.
For instance, it calls
set_rule_idsandset_rule_ids_spannedwhen thelrpar_configcallback is used.but just
set_rule_idsotherwise. One thing to note is that thectteststests don't use thelrpar_configcallback, but buildCTParserBuilderandCTLexerBuilderseparately via the older method, and tying the gordian knot viaset_rule_ids.It is only when
lrpar_configis used that we have the span info to produce these nice errors.The other thing to note is that values like
token_mapandrules_id_mapofCTParserandCTLexerBuilderare now kind of superfluous with the inclusion ofYaccGrammarin the return value.So while this functions, it really seems like it needs some design work to reduce all these separate but related code paths and slightly different duplicates of values.
As I started trying out options, it all seemed to snowball so it seemed best to just stop here before going too far down any one path.