NER with only the keyword? #10923

haydenso · 2022-06-07T11:00:52Z

haydenso
Jun 7, 2022

It appears that spaCy's NER only works accurately when the keyword is blended in a sentence, are there any methods to get accurate NER results by training a model with only the keyword? Or is it best to formulate random words to put the keyword in a sentence with?

Answered by polm

Jun 8, 2022

The golden rule of training data is that the more it is like your real data, the better your results will be.

When reading a sentence, the context around the entities can be as important as the entities themselves. For example, if you say, "I went to XXX for vacation", we can guess that XXX is a location. If your training data doesn't have any of that context it will be hard to learn from. Building sentences with "random words" will not help.

If you only have a list of keywords you may be able to use a rule based matcher or do weak supervision.

View full answer

polm · 2022-06-08T03:07:43Z

polm
Jun 8, 2022

The golden rule of training data is that the more it is like your real data, the better your results will be.

When reading a sentence, the context around the entities can be as important as the entities themselves. For example, if you say, "I went to XXX for vacation", we can guess that XXX is a location. If your training data doesn't have any of that context it will be hard to learn from. Building sentences with "random words" will not help.

If you only have a list of keywords you may be able to use a rule based matcher or do weak supervision.

4 replies

haydenso Jun 8, 2022
Author

Hey! Thank you so much for your help!

For extra information, I have a series of research paper abstracts, along with a list of medical terms. I have taken about half (600+) of the abstracts and using a rule-based matcher, found the medical term's position. I then formatted it and exported it as a Doc object, and trained it in a NER model, hoping it'll be able to identify other medical terms.

I now have a model that is able to identify the position of the keywords, however, it seems as though it only matches up ones that appear in the list I have already. How do I improve it such that it is able to detect ones that are beyond the finite list I used? Would using a revision dataset (e.g ones with ORG, PERSON, DATE etc.) and then combining it into one model be the best approach?

polm Jun 9, 2022

For medical abstracts and medical keywords I don't think combining it with standard entities will help much, because I don't think there will be a lot of PERSON/ORG/LOCATION, and when they're there I don't think it'll be ambiguous.

600 abstracts is not a huge corpus, so you probably need more data. I guess a question is:

How big is your list of terms?
How diverse is your list of terms? Do the words look similar?
How diverse is the real world data you're trying to match?
What is your reported accuracy on a dev set in training?

haydenso Jun 9, 2022
Author

I have more abstracts available to test, but would there be an issue if the corpus is also the list of text I am then going to run the model with also?

The list of terms is about 639 long, but the list of terms are all gene symbols that basically have the same format - perhaps this is the problem.

The scores are shown:
================================== Results ==================================

TOK 100.00
NER P 100.00
NER R 100.00
NER F 100.00
SPEED 14845

polm Jun 10, 2022

Are you evaluating on your training data? You should have separate train and dev data.

It seems like your model is definitely overfitting, so I would strongly recommend you annotate a little data by hand so that your data includes more diverse tokens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

NER with only the keyword? #10923

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

NER with only the keyword? #10923

Uh oh!

haydenso Jun 7, 2022

Replies: 1 comment · 4 replies

Uh oh!

polm Jun 8, 2022

Uh oh!

haydenso Jun 8, 2022 Author

Uh oh!

polm Jun 9, 2022

Uh oh!

haydenso Jun 9, 2022 Author

Uh oh!

polm Jun 10, 2022

haydenso
Jun 7, 2022

Replies: 1 comment 4 replies

polm
Jun 8, 2022

haydenso Jun 8, 2022
Author

haydenso Jun 9, 2022
Author