Skip to content

Added StatefulSpan trait and WithSpanState span interning#950

Merged
zesterer merged 1 commit intozesterer:mainfrom
ojkelly:stateful-spans
Feb 5, 2026
Merged

Added StatefulSpan trait and WithSpanState span interning#950
zesterer merged 1 commit intozesterer:mainfrom
ojkelly:stateful-spans

Conversation

@ojkelly
Copy link
Contributor

@ojkelly ojkelly commented Jan 19, 2026

I'm developing an incrementally computed LSP server, and I've found span interning to be quite helpful. The primary reason is by finding a structure to allow embedding a file_id, version, offset and length into a single u64.

I haven't found any other compilers using interned spans than rustc.

I tried to keep the impact of this PR as small as possible, and I don't think it requires any existing users of the crate to change anything.

I also explored if this could be done as an extension trait, or with the existing crate, but I couldn't get either to work (at least not without an unreasonable performance hit).


Introduced a new StatefulSpan trait that allows span implementations to access mutable state during span creation. This enables patterns like span interning where large spans are stored in a cache while small spans remain inline.

Key changes:

  • Added SpanLike trait as a minimal requirement for span types, replacing the Span bound on Input::Span
  • Added StatefulSpan trait with new_with_state, start_with_state, and end_with_state methods
  • Added InputStateful and ExactSizeInputStateful traits for inputs that support stateful span creation
  • Added WithSpanState input wrapper and with_span_state constructor function
  • Added MappedInputStateful for mapping token inputs with stateful spans
  • Added Input::map_stateful method for creating MappedInputStateful
  • Added json_interned_spans example demonstrating the span interning pattern

The example shows a span type that stores small spans inline (8 bytes) and interns large spans in a separate cache, reducing memory usage for typical parsing while still supporting edge cases with large offsets.


I did use Claude to help write this PR, and I've reviewed and edited it and I think it's ready for review.

Copy link
Owner

@zesterer zesterer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello. While I appreciate the PR, I think aspects of it may be misplaced. I believe what you're trying to achieve is already possible with chumsky's state features. In fact, I've changed one of the examples to show how this might be done. Granted, the change is a little noisy, but put that down to this being hacky temporary code.

I did use Claude to help write this PR, and I've reviewed and edited it and I think it's ready for review.

Thank you for your honesty. For what it's worth chumsky does have a soft no-LLM policy, although it's not a strict line in the sand and I am happy to accept LLM-assisted contributions provided I see clear evidence that there's 'a human in the driving seat'.

In this case, I think the use of the LLM was misplaced: I suspect that both your time and my time would have been better spent if you'd opened an issue or discussion first instead of diving head-long into implementation and giving me a very large PR to review. I'm not sure whether my suggested solution above fulfils all your criteria exactly, but I'm sure that a solution is to be found without performing such invasive surgery on the crate.

@ojkelly
Copy link
Contributor Author

ojkelly commented Jan 28, 2026

Hello. While I appreciate the PR, I think aspects of it may be misplaced. I believe what you're trying to achieve is already possible with chumsky's state features. In fact, I've changed one of the examples to show how this might be done. Granted, the change is a little noisy, but put that down to this being hacky temporary code.

Thanks for taking the time to review it, and for the example code. I tried a number of different ways to make this work. In you're example code you only used span interning on the final output, but not on the lexer or errors.

I did use Claude to help write this PR, and I've reviewed and edited it and I think it's ready for review.

Thank you for your honesty. For what it's worth chumsky does have a soft no-LLM policy, although it's not a strict line in the sand and I am happy to accept LLM-assisted contributions provided I see clear evidence that there's 'a human in the driving seat'.

Yeah I'm on the same page. I find LLM's useful for brainstorming, and having them do the mechanical work of typing when they can do it faster than me.

In this case, I think the use of the LLM was misplaced: I suspect that both your time and my time would have been better spent if you'd opened an issue or discussion first instead of diving head-long into implementation and giving me a very large PR to review. I'm not sure whether my suggested solution above fulfils all your criteria exactly, but I'm sure that a solution is to be found without performing such invasive surgery on the crate.

I definitely didn't put enough into the example to show the actual underlying issue, sorry.

I've been using the same span type for lexing and parsing for a while, as it's helped keep other things simple, like error reporting across the lexer, cst parser, and other passes.

So I've built up quite a lot of code that eventually broke with interned spans, but having seen chumsky's flexibilty with different inputs and parsing with state I thought there would probably be a way to make it work.

The blocker is that it's not possible to use interned spans in an Input, because we need some state to resolve the small percent that are actually interned.

It looks like the approach wihtout modifying chumsky is instead to only generate the interned spans on the ast parser, and for errors emitted from the lexer.

I've spent some time refactoring my parsers to do this and while I've got it working it's quite a lot more verbose. I can't use custom errors directly becuase they require the same span type, and I can't convert the span with map_err_with_state. So there's a lot of code to manage making spans in all the parsers, and to remap the errors from chumsky's builtins to a custom error type.

While I can make it work, it's messy.

I've explored what the absolute minimum change to chumsky might be to enable support for this, where the implementation is external to the crate, and I think it can be done with just adding the SpanLike trait. (we can probably find a better name?)

I can then implement the rest outside the crate. I lose access to StrInput and the methods on it, but I'm happy to implement what I need of them.

I've added an example implementation here examples/mini_ml_interned_spans.rs, that I think we would remove before merging, as its far more complex than any of the other examples.

But it definitely seems to be working as expected. I also added some tests at the bottom to confirm error reporting works, and spans are being interned with really large source files.

@zesterer
Copy link
Owner

In you're example code you only used span interning on the final output, but not on the lexer or errors.

You could do exactly the same thing for the lexer without touching chumsky itself.

For the errors, I don't think this is necessary: the error state is very small and likely never even leaves the L1 cache during a hot parse. There might be some minor advantage when it comes to codegen (fitting a span in a register, say) but if you're interested in that sort of performance I'd recommend performing parsing once with EmptyErr and then again with a more substantial error type only in the failure case - that'll have far better amortised performance, especially when you take into account the cost of actually performing the interning to begin with.

Even if your concern is the heterogeneous span types, a better approach would be to add a map_span method to Rich (akin to map_token) so that the spans can be immediately interned after parsing (I would be happy to accept a PR for this, for what it's worth). That's going to be substantially faster because then you're only paying the interning cost for true parser errors, not just transient ones generated during parsing.

I can then implement the rest outside the crate. I lose access to StrInput and the methods on it, but I'm happy to implement what I need of them.

That is unfortunate. StrInput is currently sealed because it's unclear exactly what guarantees implementers need to make to fulfil the requirements of the string parsers. That said, the fact that the desired change (interning spans) is having an impact on an entirely unrelated concept (your ability to use StrInput) is, to me, a red flag that indicates that this is all being implemented in the wrong place.

If we take your initial assumption (i.e: that all spans stored anywhere should immediately be interned) as a given (I really don't think you should take it as a given, and I think there are strong engineering reasons to question it), then all you're really looking for is a way to 'inject' some function that has access to the parser state into span generation. Mapping the span between types is easy, there's already precedent for it:

my_input.map_span(|s| ...)

But what you need is something more akin to the following, so that interning can occur at the point of span generation:

my_input.map_span_with_state(|s, state| state.intern(s))

This isn't currently possible: Input does not know about the state type, and they are deliberately kept separate. Adding a type parameter to Input (trait Input<S: Inspector>) and passing the state to Input::span would do the job, but it requires changes for implementers of Input.

One possibility would be to keep Input as-is and have a trait StateAwareInput<S: Inspector> (name pending), and then a blank impl like impl<I: Input, S: Inspector> StateAwareInput<S> for I, but this also starts to get a bit ugly. If I really felt like I needed to open that can of worms, that's probably the path I'd start walking down.

As I said at the beginning, I think you should question your assumptions about what exactly you need to have your parser do before jumping head-first into writing code for it because, respectfully, I think the LLM is giving you a bias toward append-only programming over maintainability.

@ojkelly
Copy link
Contributor Author

ojkelly commented Jan 29, 2026

For the errors, I don't think this is necessary: the error state is very small and likely never even leaves the L1 cache during a hot parse. There might be some minor advantage when it comes to codegen (fitting a span in a register, say) but if you're interested in that sort of performance I'd recommend performing parsing once with EmptyErr and then again with a more substantial error type only in the failure case - that'll have far better amortised performance, especially when you take into account the cost of actually performing the interning to begin with.

Intersting, it sounds like custom error types in the parsers are something best avoided? Maybe custom errors are best as a variant of say a Token or Node?

Even if your concern is the heterogeneous span types, a better approach would be to add a map_span method to Rich (akin to map_token) so that the spans can be immediately interned after parsing (I would be happy to accept a PR for this, for what it's worth). That's going to be substantially faster because then you're only paying the interning cost for true parser errors, not just transient ones generated during parsing.

Yep if I drop custom errors from the parser, and use Rich the only issue is the span. I'll update this branch to have map_span (let me know if you would prefer it in a new PR).

That said, the fact that the desired change (interning spans) is having an impact on an entirely unrelated concept (your ability to use StrInput) is, to me, a red flag that indicates that this is all being implemented in the wrong place.

Yeah with a StrInput typically being source text there would never be any interned spans on it, only derived from it. It's the inputs after this that need state for interned spans. But from this dicussion it's becoming clear that I've been a bit lost in the type signatures of Parser etc.

What I think I need is to use SimpleSpan for any span chumsky needs and use the internable span only on output that chumsky wont see. So embedding the internable span in a token during lexing, but emitting a Vec<(Token, SimpleSpan)> or similar for input to the next parsing phase.

If we take your initial assumption (i.e: that all spans stored anywhere should immediately be interned) as a given (I really don't think you should take it as a given, and I think there are strong engineering reasons to question it),

The intent is to be able to encode a file_id and version and the span into a u64. With 22bits for the offset files 4mb and under never need interning. I did some data analysis with githubs bigquery dataset and this was a pretty good tradeoff where I'm expecting under a percent of files will need any spans to be interned. The goal is allow for larger interned spans in the rare cases they're needed, as the alternitives are not parsing, or panicking.

This isn't currently possible: Input does not know about the state type, and they are deliberately kept separate.

Yep, and I'm pretty sure I don't need it to know about it now.

One thing I was trying to avoid is the verbosity and ceremony map_with needs, in particular when rust can't infer the types.

For example stuff like this:

.map_with(|node, e|{
  let simple_span: SimpleSpan = e.span();
  let state: State = e.state();
  let span = state.intern_span(simple_span);

  state.add_node(Cst:Node{
    node,
    span,
  })
})

Which got me thinking, about how to solve this external to chumsky.

I've come up with an extension trait to MapExtra that works well.

pub trait SpanInterner {
    fn create_span(&mut self, span: SimpleSpan) -> Span;
}

pub struct ParserState<'src> {
    span_cache: &'src mut SpanCache,
}

impl SpanInterner for ParserState<'_> {
    fn create_span(&mut self, span: SimpleSpan) -> Span {
        self.span_cache.from_simple(span)
    }
}

pub trait InternSpanExt<'src, 'b, I, E>
where
    I: Input<'src, Span = SimpleSpan>,
    E: ParserExtra<'src, I>,
{
    fn intern_span(&mut self) -> Span
    where
        E::State: SpanInterner;

    fn intern_spanned<V>(&mut self, inner: V) -> Spanned<V, Span>
    where
        E::State: SpanInterner;
}

impl<'src, 'b, I, E> InternSpanExt<'src, 'b, I, E> for MapExtra<'src, 'b, I, E>
where
    I: Input<'src, Span = SimpleSpan>,
    E: ParserExtra<'src, I>,
{
    fn intern_span(&mut self) -> Span
    where
        E::State: SpanInterner,
    {
        let simple_span = self.span();
        self.state().create_span(simple_span)
    }

    fn intern_spanned<V>(&mut self, inner: V) -> Spanned<V, Span>
    where
        E::State: SpanInterner,
    {
        let simple_span = self.span();
        Spanned {
            inner,
            span: self.state().create_span(simple_span),
        }
    }
}

I can use it like this:

.map_with(|node, e|{
  let span: SimpleSpan = e.intern_span();
  let state: State = e.state();

  state.add_node(Cst:Node {
    node,
    span,
  })
})

So far rust hasn't failed to infer the types, so I'm thinking most repeatable logic I currently have in map_with can be simplified in a similar way. Maybe to something like this which could pass an internable span

.map_with(|node, e| {
  e.add_node(|span| Cst:Node {
    node,
    span,
  })
})

Thanks for your patience in helping me understand how to untangle this.

I have the updated example on a different branch of my fork if you or anyone else is interested in how it all pieces together.

@zesterer
Copy link
Owner

zesterer commented Feb 5, 2026

Intersting, it sounds like custom error types in the parsers are something best avoided? Maybe custom errors are best as a variant of say a Token or Node?

You'll probably want both for a production parser. For example, in Tao (note: using an old version of chumsky, don't treat its parser as idiomatic!) I have an error node for each of the fundamental AST types (expressions, types, patterns, etc.) and error recovery during parsing will spit out such an error node. In later stages of compilation, this error node gets assigned a dedicated 'error type' which unifies with everything without generating an error, preventing the 'original sin' producing errors again and again as it moves through the compiler.

I've come up with an extension trait to MapExtra that works well.

Fwiw, you can go a level deeper! If you define an extension trait for Parser, you can have the method work directly on the parser itself, like

my_parser
  .add_node()

There's even dedicated support for extending chumsky with custom parsers, although in this case I think the extension trait can be implemented entirely in terms of the 'surface-level' chumsky API, so shouldn't be necessary.

Fwiw, this is also some prior work when it comes to CST integration that you might be interested in.

@zesterer zesterer merged commit 2531947 into zesterer:main Feb 5, 2026
4 checks passed
@zesterer
Copy link
Owner

zesterer commented Feb 5, 2026

Thanks for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants