Do I need null entities in my training dataset for NER? #9272
jhutton1121
started this conversation in
Help: Best practices
Replies: 1 comment 4 replies
-
The model also learns which kinds of tokens aren't entities, so it is usually helpful to include training examples without entities, especially if your future input data may contain examples without entities. (Usually you want your training data to be representative of your future input, so if you expect both texts with and without entities, also train on texts with and without. However, if you're also doing the same filtering on your future input, then you might have better results by only training on similarly filtered data.) |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to train a model that can pull out stock tickers from text on r/wallstreetbets.
My process so far for gathering training data has been creating a large entity ruler from a somewhat filtered list of stocks from the NYSE and Nasdaq. From there, I am querying the Pushshift API to get comments that mention each stock in the entity ruler.
Then when I have a bunch of comments that mention stocks, I use the entity ruler and get ~40,000 training examples of labeled data like so:
['Dude I’ve been holding OPK and AOBC for like a year now.. plus AAOI for months lmao Down like 8k in total ', {'entities': [(23, 26, 'TICK'), (64, 68, 'TICK')]}]
That are then converted into the spacy binary format. It works pretty well for creating labeled training data, not perfectly, and when I created my inital model it's working well on the data it's been exposed to but can't seem to generalize and pull out things it hadn't seen like index funds, i.e. SPY, ARKK, etc.
My process for the model was loading pretrained gensim vectors to a blank en spacy model with an NER pipe. The vectors aren't perfect and I am working on refining them, but I want to improve my training dataset.
Do I need to include "null" entities in my dataset, like
['This is an example of a post with no entities', {'entities':[]}]
Would this help my model generalize more?
Beta Was this translation helpful? Give feedback.
All reactions