Understanding Suggester Function for SpanCategorizer #8753

pjchungmd · 2021-07-18T04:34:38Z

pjchungmd
Jul 18, 2021

First of all, I'd like to thank the spaCy team for this wonderful project. I'm very excited to use the new SpanCategorizer component, because having probabilities for predicted entities is something I have needed. I have explored the example project, but am somewhat confused by the built-in suggester function (spacy.ngram_suggester.v1). It seems that this will just create n-grams (defaulting to 1-gram, 2-gram, and 3-gram) of all tokens, and then have the model predict whether these spans are actually entities?

If this is true, how does it work with annotations from prodi.gy? This picture from the prodi.gy nightly build announcement shows spans that are many tokens long, and it seems that there might not only be a performance hit if using spacy.ngram_suggester.v1 to enumerate many possible token lengths, but also a loss of generalization.

I realize that this is an experimental feature, and it seems that spacy.ngram_suggester.v1 is a simple baseline (with fairly good results on the experimental project), but I am very interested to know if the team could share the direction they are going with this. Is the expectation that creating the suggester function is a task-specific job, and therefore there are no plans to go further than the spacy.ngram_suggester.v1 model, or can we expect more to come from the team? Thanks in advance for clarifying any misconceptions I have.

polm · 2021-07-19T03:51:40Z

polm
Jul 19, 2021

I don't think we have any specific plans for future span candidate suggestions. A basic ngram suggester is fine for many models, where if you have a large network and a good loss function it can quickly learn to weed out bad candidates. But many applications will require completely different and often custom approaches. The ngram suggester is just provided as an example and a starting point.

We'll probably add more examples of different kinds of span candidate generators later, whether in core or in example projects, but again, no specific plans at the moment.

If this is true, how does it work with annotations from prodi.gy? This picture from the prodi.gy nightly build announcement shows spans that are many tokens long, and it seems that there might not only be a performance hit if using spacy.ngram_suggester.v1 to enumerate many possible token lengths, but also a loss of generalization.

If you use a suggester that generates spans up to 3 tokens long, but your gold spans are longer, the model will be unable to learn the true annotations and you will have bad performance. When configuring a model, supporting very long spans can use too much memory, so you have to decide what length is right for you (and it'll typically be a lot longer than 3).

1 reply

svlandeg Jul 19, 2021

Agreed with everything Paul said. For the upcoming Prodigy 1.11, the sizes for the n-gram suggester will depend on your training data. prodigy train and prodigy data-to-spacy will inspect the spans generated with spans.manual and will configure a spancat component with a suggester that covers all of the spans from the training data. If the test data is significantly different from the training data, you'll indeed take a performance hit. But that's pretty much always the case in any ML task.

You'll also be able to provide your own implementation of a suggester function by registering a custom function, and you'll have the option to turn on validation in spans.manual so that any annotated span that does not comply with the suggester, is flagged during annotation.

Hope that clears things up for you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Understanding Suggester Function for SpanCategorizer #8753

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Understanding Suggester Function for SpanCategorizer #8753

Uh oh!

pjchungmd Jul 18, 2021

Replies: 1 comment · 1 reply

Uh oh!

polm Jul 19, 2021

Uh oh!

svlandeg Jul 19, 2021

pjchungmd
Jul 18, 2021

Replies: 1 comment 1 reply

polm
Jul 19, 2021