Training Data Sample for NER #9495

aniyyanz08 · 2021-10-18T12:43:57Z

aniyyanz08
Oct 18, 2021

I am trying to build a custom NER model. The data is scraped from internet and cleaned. I have several thousands of these type of data for training purpose. I have annotated this data using annotation tools and one sample of data in spacy training format is given.

[('A Very SoNA Christmas\nView SoNA’s Covid Safety Policies\nSkip to Content\nAbout\nHistory Mission\nStaff Board\nMusic Director\nMusicians\nSoNA Singers\nAuditions\nHire Ensembles\nContact\n2021-22 Season\nSubscriber Series\nTicketed Performances\nSoNA Beyond Series\nVirtual Performances\nVirtual Performances\nSolos from Home\nSpecial Events\nFireworks at the Farm\nReimagined Celebration\nDonate\nGallery\nEducation\nBlog\nOpen Menu\nClose Menu\nAbout\nHistory Mission\nStaff Board\nMusic Director\nMusicians\nSoNA Singers\nAuditions\nHire Ensembles\nContact\n2021-22 Season\nSubscriber Series\nTicketed Performances\nSoNA Beyond Series\nVirtual Performances\nVirtual Performances\nSolos from Home\nSpecial Events\nFireworks at the Farm\nReimagined Celebration\nDonate\nGallery\nEducation\nBlog\nOpen Menu\nClose Menu\nFolder:\nAbout\nFolder:\n2021-22 Season\nSoNA Beyond Series\nFolder:\nVirtual Performances\nFolder:\nSpecial Events\nDonate\nGallery\nEducation\nBlog\nBack\nHistory Mission\nStaff Board\nMusic Director\nMusicians\nSoNA Singers\nAuditions\nHire Ensembles\nContact\nBack\nSubscriber Series\nTicketed Performances\nBack\nVirtual Performances\nSolos from Home\nBack\nFireworks at the Farm\nReimagined Celebration\nA Very SoNA Christmas\nJul 10, 2021\nWritten By SoNA\nSaturday, December 11, 2021 2PM 7:30PM Walton Arts Center, Fayetteville\nA mix of sacred and secular holiday favorites with local guest soloists, The SoNA Singers, and area high school and collegiate choruses. Saturday, December 11, 2021 2PM Matinee Performance Saturday, December 11, 2021 7:30PM Evening Performance\nBuy Tickets\nBuy Tickets\nSingle Tickets: 35, 45, 57 Under 18 FREE with purchase of adult ticket limited quantities Interested in a full season subscription Learn more here . Concert sponsored by Bogle Family Foundation\nWe are committed to ensuring that audiences can experience music safely in person at our performances. Until further notice, patrons, staff, and volunteers are required to wear masks. Learn more about our safety policy here .\nSoNA\nPrevious\nPrevious\nMozart and Beethoven\nNext\nNext\nSoNA Walton Arts Center present The Snowman: A Family Concert\nReceive the latest updates\nEmail Address\nSign Up\nThank you for joining our email list You should receive a verification email shortly to confirm.\nOffice: 479.521.4166 Tickets: 479.443.5600 infosonamusic.org\nCopyright 2021, SoNA. All rights reserved.\nSupport SoNA',
  {'entities': [(1958, 1962, 'organization'),
    (1230, 1236, 'performance_starttime'),
    (1343, 1359, 'organization'),
    (1208, 1225, 'performance_date'),
    (1237, 1255, 'auditorium'),
    (0, 21, 'production_name'),
    (1226, 1229, 'performance_starttime')]})]

I would like to know if i take this much bigger unstructured data, will NER is going to work. Also any insight for improvement from the above annotated data is requested.

Answered by polm

Oct 19, 2021

You can use documents of that length, but in general it's easier to work with documents if you cut them down to paragraph length. In your case there don't seem to be real paragraphs but it looks like you could split the data into lines without losing information.

There are other issues with your data. A lot of lines are irrelevant and could be pre-filtered, which would make the rest of your task much easier. You can filter lines by removing set phrases or overly short lines.

The show title is the first line, but there's not any useful context for the model to learn, or any useful keywords really. So if most of your data looks like that it won't help.

You're labelling the show start time a…

View full answer

polm · 2021-10-19T03:34:46Z

polm
Oct 19, 2021

You can use documents of that length, but in general it's easier to work with documents if you cut them down to paragraph length. In your case there don't seem to be real paragraphs but it looks like you could split the data into lines without losing information.

There are other issues with your data. A lot of lines are irrelevant and could be pre-filtered, which would make the rest of your task much easier. You can filter lines by removing set phrases or overly short lines.

The show title is the first line, but there's not any useful context for the model to learn, or any useful keywords really. So if most of your data looks like that it won't help.

You're labelling the show start time and date, but those are the only dates in the text except for years (season and copyright), which can be filtered out easily. So you should be able to just use the date label in the pretrained model.

15 replies

aniyyanz08 Nov 4, 2021
Author

@polm I have a sample data annotated. I will share the image for easy understanding

here you can see the tagged data. The tagged data is the required ones in that page. (This data is obtained from a single url page as discussed in the thread above)
My questions are:

Rocco variations are tagged as production_name in first line. In the data Rocco Variations is repeating in second line and can repeat somewhere in that document itself. So should we have to tag all the Rocco Variations in the document as production_name or one tag will be enough
Steven Jarvi, Condutor is not associated with that data. But Steven Jarvi is a name tag and conductor is another tag called creative team role. So should we have to tag that also eventhough it is not relevant to this document but will be relevant in another data.
Ann Arbor Orchestra is again repeating which comes to my doubt numbered 1

polm Nov 4, 2021

The model will learn to reproduce the data it sees in training. That almost always means you should annotate every instance of something, so you should mark "Rococo Variations" in all locations. If you don't do that then your model will have conflicting information - sometimes "Rococo Variations" should be labelled, sometimes it shouldn't. How will the model tell the difference? Is there a reason only the first one is tagged in your data?

If you're lucky, the model will learn something like "production names occur on a line by themselves". If you're unlucky it will just behave strangely and unreliably.

Regarding the name/role, it is not clear to me why it is not relevant in this document. If it's not clear to a person, how will it be clear to the model? It's not impossible for the model to make those kinds of distinctions but it sounds hard.

aniyyanz08 Nov 4, 2021
Author

There is no reason that i tagged this "Rococo Variations" only once. I thought it will be fine by tagging once.
This url data is giving more details about Rocco Variations and details about that. As the data is a scraped data, you know the data can point to other pages or details in otherpages. So Steven Jarvi/Conductor is associated with another page. But as i understand from you reply, we have to tag that also. So conclusion is : Whatever be the data, if anything comes related to our tag in that data, we have to tag those all, even multiple times is necessary if repeatedly comes.

aniyyanz08 Nov 9, 2021
Author

@polm According to the suggestion given, i have retagged the data. So can you have a quick glance at the tagged data and give a comment on the new tagging.

One more doubt is if we tagged Steven Jarvi as a person, then some other parts the same name can be mentioned by just steven or jarvi. So if tagging steven jarvi, steven and jarvi as persons, will that cause any issue ?

polm Nov 9, 2021

Those annotations look good.

One more doubt is if we tagged Steven Jarvi as a person, then some other parts the same name can be mentioned by just steven or jarvi. So if tagging steven jarvi, steven and jarvi as persons, will that cause any issue ?

It's fine to tag all those as PERSON.

The most important thing is that your input should be tagged the way you want the model to tag your output. Since "Steven Jarvi", "Steven", and "Jarvi" all refer to a person, they should all be tagged PERSON. This is not always true of substrings - for example, "Twelve Days of Christmas" could be a PRODUCTION, but "Christmas", "Days", and even "of" are of course not PRODUCTION on their own.

shrinidhin · 2022-01-09T12:01:13Z

shrinidhin
Jan 9, 2022

Hi. The Dev data given while training the ner model for computing the accuracy of the model should also be tagged data right?

2 replies

polm Jan 11, 2022

Yes, the dev data needs labels.

I'm not sure why you commented on this issue instead of opening a new one. If in doubt about whether your question is related to an existing discussion, it's often better to open a new thread than commenting on another one, especially if the older thread isn't currently active. (Keep in mind that when you comment on a thread, other participants in the thread usually get notifications.)

shrinidhin Jan 13, 2022

Sure. I'll remember that. Thanks!

Uh oh!

Training Data Sample for NER #9495

Uh oh!

aniyyanz08 Oct 18, 2021

Replies: 2 comments · 17 replies

Uh oh!

polm Oct 19, 2021

Uh oh!

Uh oh!

aniyyanz08 Nov 4, 2021 Author

Uh oh!

polm Nov 4, 2021

Uh oh!

Uh oh!

aniyyanz08 Nov 4, 2021 Author

Uh oh!

Uh oh!

aniyyanz08 Nov 9, 2021 Author

Uh oh!

polm Nov 9, 2021

Uh oh!

Uh oh!

shrinidhin Jan 9, 2022

Uh oh!

polm Jan 11, 2022

Uh oh!

shrinidhin Jan 13, 2022

aniyyanz08
Oct 18, 2021

Replies: 2 comments 17 replies

polm
Oct 19, 2021

aniyyanz08 Nov 4, 2021
Author

aniyyanz08 Nov 4, 2021
Author

aniyyanz08 Nov 9, 2021
Author

shrinidhin
Jan 9, 2022