Skip to content
Discussion options

You must be logged in to vote

My intuition about paragraphs vs sentences is that using paragraphs makes the model less sensitive to potentially ambiguous sentence boundaries. A simple example is that if you only train on sentences, then periods will only be end-of-sentence tokens if they're the last token, which isn't true in longer text. There are probably other things going on but it's hard to pin them down - even if we can't explain everything, though, it has been our experience that paragraphs work better for training. This also matches the principle that training text should be like input text.

If you're unsure about if it helps or not I recommend you try both approaches and measure performance.

Let's say I want…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by polm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / ner Feature: Named Entity Recognizer perf / accuracy Performance: accuracy
2 participants