Added StatefulSpan trait and WithSpanState span interning#950
Added StatefulSpan trait and WithSpanState span interning#950zesterer merged 1 commit intozesterer:mainfrom
Conversation
zesterer
left a comment
There was a problem hiding this comment.
Hello. While I appreciate the PR, I think aspects of it may be misplaced. I believe what you're trying to achieve is already possible with chumsky's state features. In fact, I've changed one of the examples to show how this might be done. Granted, the change is a little noisy, but put that down to this being hacky temporary code.
I did use Claude to help write this PR, and I've reviewed and edited it and I think it's ready for review.
Thank you for your honesty. For what it's worth chumsky does have a soft no-LLM policy, although it's not a strict line in the sand and I am happy to accept LLM-assisted contributions provided I see clear evidence that there's 'a human in the driving seat'.
In this case, I think the use of the LLM was misplaced: I suspect that both your time and my time would have been better spent if you'd opened an issue or discussion first instead of diving head-long into implementation and giving me a very large PR to review. I'm not sure whether my suggested solution above fulfils all your criteria exactly, but I'm sure that a solution is to be found without performing such invasive surgery on the crate.
Thanks for taking the time to review it, and for the example code. I tried a number of different ways to make this work. In you're example code you only used span interning on the final output, but not on the lexer or errors.
Yeah I'm on the same page. I find LLM's useful for brainstorming, and having them do the mechanical work of typing when they can do it faster than me.
I definitely didn't put enough into the example to show the actual underlying issue, sorry. I've been using the same span type for lexing and parsing for a while, as it's helped keep other things simple, like error reporting across the lexer, cst parser, and other passes. So I've built up quite a lot of code that eventually broke with interned spans, but having seen chumsky's flexibilty with different inputs and parsing with state I thought there would probably be a way to make it work. The blocker is that it's not possible to use interned spans in an It looks like the approach wihtout modifying chumsky is instead to only generate the interned spans on the ast parser, and for errors emitted from the lexer. I've spent some time refactoring my parsers to do this and while I've got it working it's quite a lot more verbose. I can't use custom errors directly becuase they require the same span type, and I can't convert the span with While I can make it work, it's messy. I've explored what the absolute minimum change to chumsky might be to enable support for this, where the implementation is external to the crate, and I think it can be done with just adding the I can then implement the rest outside the crate. I lose access to I've added an example implementation here But it definitely seems to be working as expected. I also added some tests at the bottom to confirm error reporting works, and spans are being interned with really large source files. |
You could do exactly the same thing for the lexer without touching chumsky itself. For the errors, I don't think this is necessary: the error state is very small and likely never even leaves the L1 cache during a hot parse. There might be some minor advantage when it comes to codegen (fitting a span in a register, say) but if you're interested in that sort of performance I'd recommend performing parsing once with Even if your concern is the heterogeneous span types, a better approach would be to add a
That is unfortunate. If we take your initial assumption (i.e: that all spans stored anywhere should immediately be interned) as a given (I really don't think you should take it as a given, and I think there are strong engineering reasons to question it), then all you're really looking for is a way to 'inject' some function that has access to the parser state into span generation. Mapping the span between types is easy, there's already precedent for it: my_input.map_span(|s| ...)But what you need is something more akin to the following, so that interning can occur at the point of span generation: my_input.map_span_with_state(|s, state| state.intern(s))This isn't currently possible: One possibility would be to keep As I said at the beginning, I think you should question your assumptions about what exactly you need to have your parser do before jumping head-first into writing code for it because, respectfully, I think the LLM is giving you a bias toward append-only programming over maintainability. |
dfd0ebc to
7854d02
Compare
Intersting, it sounds like custom error types in the parsers are something best avoided? Maybe custom errors are best as a variant of say a
Yep if I drop custom errors from the parser, and use
Yeah with a What I think I need is to use
The intent is to be able to encode a
Yep, and I'm pretty sure I don't need it to know about it now. One thing I was trying to avoid is the verbosity and ceremony For example stuff like this: .map_with(|node, e|{
let simple_span: SimpleSpan = e.span();
let state: State = e.state();
let span = state.intern_span(simple_span);
state.add_node(Cst:Node{
node,
span,
})
})Which got me thinking, about how to solve this external to chumsky. I've come up with an extension trait to pub trait SpanInterner {
fn create_span(&mut self, span: SimpleSpan) -> Span;
}
pub struct ParserState<'src> {
span_cache: &'src mut SpanCache,
}
impl SpanInterner for ParserState<'_> {
fn create_span(&mut self, span: SimpleSpan) -> Span {
self.span_cache.from_simple(span)
}
}
pub trait InternSpanExt<'src, 'b, I, E>
where
I: Input<'src, Span = SimpleSpan>,
E: ParserExtra<'src, I>,
{
fn intern_span(&mut self) -> Span
where
E::State: SpanInterner;
fn intern_spanned<V>(&mut self, inner: V) -> Spanned<V, Span>
where
E::State: SpanInterner;
}
impl<'src, 'b, I, E> InternSpanExt<'src, 'b, I, E> for MapExtra<'src, 'b, I, E>
where
I: Input<'src, Span = SimpleSpan>,
E: ParserExtra<'src, I>,
{
fn intern_span(&mut self) -> Span
where
E::State: SpanInterner,
{
let simple_span = self.span();
self.state().create_span(simple_span)
}
fn intern_spanned<V>(&mut self, inner: V) -> Spanned<V, Span>
where
E::State: SpanInterner,
{
let simple_span = self.span();
Spanned {
inner,
span: self.state().create_span(simple_span),
}
}
}I can use it like this: .map_with(|node, e|{
let span: SimpleSpan = e.intern_span();
let state: State = e.state();
state.add_node(Cst:Node {
node,
span,
})
})So far rust hasn't failed to infer the types, so I'm thinking most repeatable logic I currently have in .map_with(|node, e| {
e.add_node(|span| Cst:Node {
node,
span,
})
})Thanks for your patience in helping me understand how to untangle this. I have the updated example on a different branch of my fork if you or anyone else is interested in how it all pieces together. |
You'll probably want both for a production parser. For example, in Tao (note: using an old version of chumsky, don't treat its parser as idiomatic!) I have an error node for each of the fundamental AST types (expressions, types, patterns, etc.) and error recovery during parsing will spit out such an error node. In later stages of compilation, this error node gets assigned a dedicated 'error type' which unifies with everything without generating an error, preventing the 'original sin' producing errors again and again as it moves through the compiler.
Fwiw, you can go a level deeper! If you define an extension trait for my_parser
.add_node()There's even dedicated support for extending chumsky with custom parsers, although in this case I think the extension trait can be implemented entirely in terms of the 'surface-level' chumsky API, so shouldn't be necessary. Fwiw, this is also some prior work when it comes to CST integration that you might be interested in. |
|
Thanks for the PR! |
I'm developing an incrementally computed LSP server, and I've found span interning to be quite helpful. The primary reason is by finding a structure to allow embedding a
file_id,version,offsetandlengthinto a singleu64.I haven't found any other compilers using interned spans than
rustc.I tried to keep the impact of this PR as small as possible, and I don't think it requires any existing users of the crate to change anything.
I also explored if this could be done as an extension trait, or with the existing crate, but I couldn't get either to work (at least not without an unreasonable performance hit).
Introduced a new StatefulSpan trait that allows span implementations to access mutable state during span creation. This enables patterns like span interning where large spans are stored in a cache while small spans remain inline.
Key changes:
SpanLiketrait as a minimal requirement for span types, replacing the Span bound on Input::SpanStatefulSpantrait withnew_with_state,start_with_state, andend_with_statemethodsInputStatefulandExactSizeInputStatefultraits for inputs that support stateful span creationWithSpanStateinput wrapper and with_span_state constructor functionMappedInputStatefulfor mapping token inputs with stateful spansInput::map_statefulmethod for creatingMappedInputStatefuljson_interned_spansexample demonstrating the span interning patternThe example shows a span type that stores small spans inline (8 bytes) and interns large spans in a separate cache, reducing memory usage for typical parsing while still supporting edge cases with large offsets.
I did use Claude to help write this PR, and I've reviewed and edited it and I think it's ready for review.