Paragraph vs sentence context in NER #10745

deepakdalakoti · 2022-05-03T01:29:31Z

deepakdalakoti
May 3, 2022

Hi,

I have seen some comments in the discussion threads along the lines of "paragraphs give better context for NER prediction than sentences". I was wondering why is that the case. If I understand spacy's MaxoutWindowEncoder.v2 model correctly, (by default) it considers a window size of 1, meaning the context is derived from one surrounding word. It also has 4 layers so in total context would be derived from four neighbouring words. It then seems to me that the prediction of NER should only depend on 4 surrounding words and additional text is irrelevant?

One thing that might break this reasoning is that in the attention layer, the model considers the previously tagged entity. This leads to additional dependencies other than just 4 surrounding words. Is this the reason why paragraphs are expected to give better results than sentences?

Does this all also depend on how training data was split? For e.g. if my training data was one sentence per example, does it affect how the model will respond to longer texts?

I have trained a custom NER model, a peculiar thing happening in prediction is (a made up example below):

Let's say I want to predict Names and my input sentence is "My name is John and I live in a small town called Sutherland". The model is correctly tagging John as a Name. However, if I include additional text like "My name is John and I live in a small town called Sutherland. I love spacy!". In this case the model is not able to identify a named entity. I can't reason why this would happen

Any help clarifying these would be really appreciated.

Answered by polm

May 6, 2022

My intuition about paragraphs vs sentences is that using paragraphs makes the model less sensitive to potentially ambiguous sentence boundaries. A simple example is that if you only train on sentences, then periods will only be end-of-sentence tokens if they're the last token, which isn't true in longer text. There are probably other things going on but it's hard to pin them down - even if we can't explain everything, though, it has been our experience that paragraphs work better for training. This also matches the principle that training text should be like input text.

If you're unsure about if it helps or not I recommend you try both approaches and measure performance.

Let's say I want…

View full answer

polm · 2022-05-06T02:28:39Z

polm
May 6, 2022

My intuition about paragraphs vs sentences is that using paragraphs makes the model less sensitive to potentially ambiguous sentence boundaries. A simple example is that if you only train on sentences, then periods will only be end-of-sentence tokens if they're the last token, which isn't true in longer text. There are probably other things going on but it's hard to pin them down - even if we can't explain everything, though, it has been our experience that paragraphs work better for training. This also matches the principle that training text should be like input text.

If you're unsure about if it helps or not I recommend you try both approaches and measure performance.

Let's say I want to predict Names and my input sentence is "My name is John and I live in a small town called Sutherland". The model is correctly tagging John as a Name. However, if I include additional text like "My name is John and I live in a small town called Sutherland. I love spacy!". In this case the model is not able to identify a named entity. I can't reason why this would happen

That is strange, I'm honestly not sure what could be happening there. Especially given it's a made up example I can't even really speculate on the cause - sometimes models just make weird errors. See #3052.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Paragraph vs sentence context in NER #10745

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Paragraph vs sentence context in NER #10745

Uh oh!

deepakdalakoti May 3, 2022

Replies: 1 comment

Uh oh!

polm May 6, 2022

deepakdalakoti
May 3, 2022

polm
May 6, 2022