Do I need null entities in my training dataset for NER? #9272

jhutton1121 · 2021-09-22T15:22:35Z

jhutton1121
Sep 22, 2021

I am trying to train a model that can pull out stock tickers from text on r/wallstreetbets.

My process so far for gathering training data has been creating a large entity ruler from a somewhat filtered list of stocks from the NYSE and Nasdaq. From there, I am querying the Pushshift API to get comments that mention each stock in the entity ruler.

Then when I have a bunch of comments that mention stocks, I use the entity ruler and get ~40,000 training examples of labeled data like so:

['Dude I’ve been holding OPK and AOBC for like a year now.. plus AAOI for months lmao Down like 8k in total ', {'entities': [(23, 26, 'TICK'), (64, 68, 'TICK')]}]

That are then converted into the spacy binary format. It works pretty well for creating labeled training data, not perfectly, and when I created my inital model it's working well on the data it's been exposed to but can't seem to generalize and pull out things it hadn't seen like index funds, i.e. SPY, ARKK, etc.

My process for the model was loading pretrained gensim vectors to a blank en spacy model with an NER pipe. The vectors aren't perfect and I am working on refining them, but I want to improve my training dataset.

Do I need to include "null" entities in my dataset, like

['This is an example of a post with no entities', {'entities':[]}]

Would this help my model generalize more?

adrianeboyd · 2021-09-23T08:12:27Z

adrianeboyd
Sep 23, 2021

The model also learns which kinds of tokens aren't entities, so it is usually helpful to include training examples without entities, especially if your future input data may contain examples without entities. (Usually you want your training data to be representative of your future input, so if you expect both texts with and without entities, also train on texts with and without. However, if you're also doing the same filtering on your future input, then you might have better results by only training on similarly filtered data.)

4 replies

jhutton1121 Sep 23, 2021
Author

Thanks. It will encounter texts as it comes from Reddit basically so no guarantee of entities. Do you know how much data I need? I have 38,000 or so examples with entities. Would several hundred suffice or should I aim for several thousand? It isn’t much work to get them.

polm Sep 26, 2021

Since this is a relatively simple case you should be able to learn a model with a few hundred examples, but in general more data is better, so definitely make more if it's not hard.

Actually, one thing I'm not clear about is what the NER model gives you over the EntityRuler. Does it help you weed out stuff like USA or HTML that isn't a stock symbol?

jhutton1121 Sep 26, 2021
Author

I thought about that at first and tried it but quickly realized with IPOs, foreign stocks, coins, etc that come and go as well as delisted stocks, it’s really hard to maintain a good enough entity ruler. Generalization is what I’m going for, so it can for example encounter something like “I think SHIB will go to the moon again!” And pull out SHIB even though it hasn’t seen it.

Finally, some stocks are words like SPY and while most people on wsb are talking about the SPY index, I have seen it used referring to geopolitical spies as well. That’s just one example of homonyms which are much more common with thematic ETF funds

polm Sep 27, 2021

Huh, I'm surprised, but glad it's working!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Do I need null entities in my training dataset for NER? #9272

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Do I need null entities in my training dataset for NER? #9272

Uh oh!

Uh oh!

jhutton1121 Sep 22, 2021

Replies: 1 comment · 4 replies

Uh oh!

adrianeboyd Sep 23, 2021

Uh oh!

jhutton1121 Sep 23, 2021 Author

Uh oh!

polm Sep 26, 2021

Uh oh!

Uh oh!

jhutton1121 Sep 26, 2021 Author

Uh oh!

polm Sep 27, 2021

jhutton1121
Sep 22, 2021

Replies: 1 comment 4 replies

adrianeboyd
Sep 23, 2021

jhutton1121 Sep 23, 2021
Author

jhutton1121 Sep 26, 2021
Author