Paragraph vs sentence context in NER #10745
-
Hi, I have seen some comments in the discussion threads along the lines of "paragraphs give better context for NER prediction than sentences". I was wondering why is that the case. If I understand spacy's MaxoutWindowEncoder.v2 model correctly, (by default) it considers a window size of 1, meaning the context is derived from one surrounding word. It also has 4 layers so in total context would be derived from four neighbouring words. It then seems to me that the prediction of NER should only depend on 4 surrounding words and additional text is irrelevant? One thing that might break this reasoning is that in the attention layer, the model considers the previously tagged entity. This leads to additional dependencies other than just 4 surrounding words. Is this the reason why paragraphs are expected to give better results than sentences? Does this all also depend on how training data was split? For e.g. if my training data was one sentence per example, does it affect how the model will respond to longer texts? I have trained a custom NER model, a peculiar thing happening in prediction is (a made up example below): Let's say I want to predict Names and my input sentence is "My name is John and I live in a small town called Sutherland". The model is correctly tagging John as a Name. However, if I include additional text like "My name is John and I live in a small town called Sutherland. I love spacy!". In this case the model is not able to identify a named entity. I can't reason why this would happen Any help clarifying these would be really appreciated. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
My intuition about paragraphs vs sentences is that using paragraphs makes the model less sensitive to potentially ambiguous sentence boundaries. A simple example is that if you only train on sentences, then periods will only be end-of-sentence tokens if they're the last token, which isn't true in longer text. There are probably other things going on but it's hard to pin them down - even if we can't explain everything, though, it has been our experience that paragraphs work better for training. This also matches the principle that training text should be like input text. If you're unsure about if it helps or not I recommend you try both approaches and measure performance.
That is strange, I'm honestly not sure what could be happening there. Especially given it's a made up example I can't even really speculate on the cause - sometimes models just make weird errors. See #3052. |
Beta Was this translation helpful? Give feedback.
My intuition about paragraphs vs sentences is that using paragraphs makes the model less sensitive to potentially ambiguous sentence boundaries. A simple example is that if you only train on sentences, then periods will only be end-of-sentence tokens if they're the last token, which isn't true in longer text. There are probably other things going on but it's hard to pin them down - even if we can't explain everything, though, it has been our experience that paragraphs work better for training. This also matches the principle that training text should be like input text.
If you're unsure about if it helps or not I recommend you try both approaches and measure performance.