Some questions on best practices for NER #7949

satbalak · 2021-04-29T12:00:50Z

satbalak
Apr 29, 2021

Hello Gurus,
I am just starting off with NLP and have been getting my feet wet over the last weeks with spacy and transformers. Prior to this, I have completed the Deep Learning specialization course on Coursera by Andrew Ng; so I do have some background. Apart from this, I have done other ML stuff but not in NLP.
Our goal: To do information extraction from financial documents
Questions:
The financial documents have statements like "Number of shares sold: 4500" or "Number of shares purchased: 300". Here we are interested in extracting 2 pieces of information i.e. Extract that this document contains "Number of shares sold" and the actual number sold is 4500. For this, we are thinking one of the two options:

Annotate (using labelstudio or ubiai or whatever) that 4500 with a label "NUM_SHARES_SOLD" and 300 with "NUM_SHARES_PURCH". Then run it through an NER using en_core_web_trf and get the two pieces of information (or one or none depending on if they are present in the doc)
Annotate the piece of text "Number of shares sold" and "Number of shares purchased", get these Named entities and fetch the cardinal following these entities.
Any advice on which approach is better or am I getting it completely wrong?

If I train with multiple documents and there is a totally new document with a different heading like let's "Stocks sold" or "Stocks bought", ideally the NER should be able to correctly identify these differences right?

When I refer to the best practices that Andrew Ng talked about in the "Improving Deep NN and Hyperparameter tuning" course, I recall that the following hyper parameters can be tuned

Reduce Bias and Variance by either getting more data or by changing the NN architecture. When I used the Spacy NER example project, I am not able to figure out how to do this. Any good videos or blog posts that explain this well?
Regularization to prevent over fitting: In this case, I do see dropout in the filled config file under [training]. I believe this is for the same purpose?
Weight initialization: Since we are using a pre-trained roberta-base transformer model, I am guessing this is already done?
Mini batch size and epochs: I see eval_frequency under [training] set to 200 and when I train I see the output as below

E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
  0       0         669.65    704.49    0.37    0.19    5.56    0.00
 21     200       61600.22  36868.68    9.27    8.86    9.72    0.09
 42     400        7553.26   4897.89   11.19   11.27   11.11    0.11

I trust this is the 200 that controls this? And E is Epochs? Not sure how in 200 mini batches there are 21 epochs?? Is this because I do not have more than 200 sentences in each training example?

I did read this, but could not figure out the above.
Any help would be greatly appreciated!
Thanks,
Satya

polm · 2021-05-03T06:21:38Z

polm
May 3, 2021

Hey, you have a lot of questions here, so I'm going to address just a few points. I also suggest you read these slides.

The financial documents have statements like "Number of shares sold: 4500"

We need more information about the kinds of variations in format you expect to see to give advice on this. In general I do not expect that tagging the number would allow you to get meaningful labels. Things like "number of shares sold" and "number of shares purchased" also aren't really good named entities.

If your documents are always formatted like Explanation: 12345 then you should consider just extracting the number and explanation and seeing how well simple token matching works. If you want to improve on that I would try textcat on the explanation. But from your description it's not clear if that's actually the format of all your examples or not.

Annotate (using labelstudio or ubiai or whatever)

I would encourage you to consider Prodigy.

When I refer to the best practices that Andrew Ng talked about in the "Improving Deep NN and Hyperparameter tuning" course, I recall that the following hyper parameters can be tuned

Get a working system before you worry about hyperparameters. Tuning a model is one of the last things you do to get a little more improvement out of a running system.

2 replies

satbalak May 4, 2021
Author

Thanks for your response. I have looked at the 3 ner tuturials in this github (ner_drugs, ner_fashion_brands, ner_food_ingredients) and I also saw one more example https://towardsdatascience.com/how-to-fine-tune-bert-transformer-with-spacy-3-6a90bfe57647 (where he extracts entities like SKILLS, DIPLOMA, DIPLOMA_MAJOR, EXPERIENCE from job descriptions).
I also notice that you are saying --> Things like "number of shares sold" and "number of shares purchased" also aren't really good named entities.
Would it be a fair statement to say Named Entities would be proper nouns of a particular type? (Iike drugs, fashion brands, good ingredients, Skills etc).
We do have an alternate approach of getting required information out of these documents using Matcher and regular entities like CARDINAL, MONEY and getting the information out.

I do take your point on hyperparameter tuning and the slide deck was incredibly helpful! Thank you very much!

polm May 4, 2021

Would it be a fair statement to say Named Entities would be proper nouns of a particular type?

Traditionally yes, the main Named Entity types are personal names, place names, organizational names. Dates and numbers are kind of shoe-horned in since they're easy to detect.

Some things that make named entities easier to detect are clear orthographic properties (like capitalization), clear boundaries, being short. So something like "personal name" is pretty easy for many cases, while something like "cause of the problem" is not.

Even if something seems hard you can certainly try it, but if you have a hard time getting annotators to agree on boundaries that can be a sign it won't make a good named entity. (There are some learning strategies that try to deal with ambiguous entity boundaries, but spaCy actually uses a strategy that's quite strict because it's helpful on traditional entities.)

This NER flowchart by Ines might also be helful.

https://twitter.com/_inesmontani/status/1130784652594229248

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Some questions on best practices for NER #7949

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Some questions on best practices for NER #7949

Uh oh!

satbalak Apr 29, 2021

Replies: 1 comment · 2 replies

Uh oh!

polm May 3, 2021

Uh oh!

satbalak May 4, 2021 Author

Uh oh!

polm May 4, 2021

satbalak
Apr 29, 2021

Replies: 1 comment 2 replies

polm
May 3, 2021

satbalak May 4, 2021
Author