Training Data Sample for NER #9495
-
I am trying to build a custom NER model. The data is scraped from internet and cleaned. I have several thousands of these type of data for training purpose. I have annotated this data using annotation tools and one sample of data in spacy training format is given.
I would like to know if i take this much bigger unstructured data, will NER is going to work. Also any insight for improvement from the above annotated data is requested. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 17 replies
-
You can use documents of that length, but in general it's easier to work with documents if you cut them down to paragraph length. In your case there don't seem to be real paragraphs but it looks like you could split the data into lines without losing information. There are other issues with your data. A lot of lines are irrelevant and could be pre-filtered, which would make the rest of your task much easier. You can filter lines by removing set phrases or overly short lines. The show title is the first line, but there's not any useful context for the model to learn, or any useful keywords really. So if most of your data looks like that it won't help. You're labelling the show start time and date, but those are the only dates in the text except for years (season and copyright), which can be filtered out easily. So you should be able to just use the date label in the pretrained model. |
Beta Was this translation helpful? Give feedback.
-
Hi. The Dev data given while training the ner model for computing the accuracy of the model should also be tagged data right? |
Beta Was this translation helpful? Give feedback.
You can use documents of that length, but in general it's easier to work with documents if you cut them down to paragraph length. In your case there don't seem to be real paragraphs but it looks like you could split the data into lines without losing information.
There are other issues with your data. A lot of lines are irrelevant and could be pre-filtered, which would make the rest of your task much easier. You can filter lines by removing set phrases or overly short lines.
The show title is the first line, but there's not any useful context for the model to learn, or any useful keywords really. So if most of your data looks like that it won't help.
You're labelling the show start time a…