Finding the similarity for web scraped data #9867

imhans33 · 2021-12-15T12:03:28Z

imhans33
Dec 15, 2021

I am trying to find similarity of documents in my application. The document i am working on are the data scraped from relevant webpages. I have few doubts regarding applying in my application. Valid and invalid data is attached for a quick view

I am interested in data only related to musical related website. So the data contains more of musical related events(concerts, opera etc). I that case if a data from other areas come, is it possible to detect those ?
I already have a model built for our custom NER application. So while loading model for finding similarity, should i use the custom trained model or spacy pretrained model for getting better results.
Also if we take one url data, there can be an event occuring at one future date, time etc ( https://msorchestra.com/event/41st-annual-pepsi-pops-a-blast-in-the-park-3/) . And later that event can get cancelled or postponed etc. So in that case if there is any way to detect that change happening in the data using comparison or any other solution available using spacy.

Sample data is attached as file.
not_valid.txt
valid_text_1.txt
valid_text_2.txt

Answered by polm

Dec 20, 2021

I am interested in data only related to musical related website. So the data contains more of musical related events(concerts, opera etc). I that case if a data from other areas come, is it possible to detect those ?

This is pretty hard. Technically you can train a classifier, but since there's an infinite number of things that are "not music" it's not really guaranteed to work. Maybe you can filter your incoming data using keywords like "concert"?

I already have a model built for our custom NER application. So while loading model for finding similarity, should i use the custom trained model or spacy pretrained model for getting better results.

It would make sense for your model to be…

View full answer

polm · 2021-12-20T03:26:07Z

polm
Dec 20, 2021

I am interested in data only related to musical related website. So the data contains more of musical related events(concerts, opera etc). I that case if a data from other areas come, is it possible to detect those ?

This is pretty hard. Technically you can train a classifier, but since there's an infinite number of things that are "not music" it's not really guaranteed to work. Maybe you can filter your incoming data using keywords like "concert"?

I already have a model built for our custom NER application. So while loading model for finding similarity, should i use the custom trained model or spacy pretrained model for getting better results.

It would make sense for your model to be better, but you should try both and see since it shouldn't be difficult to do so.

Also if we take one url data, there can be an event occuring at one future date, time etc ( https://msorchestra.com/event/41st-annual-pepsi-pops-a-blast-in-the-park-3/) . And later that event can get cancelled or postponed etc. So in that case if there is any way to detect that change happening in the data using comparison or any other solution available using spacy.

You can just check if the data at the URL changed, right? That's the first thing I would check.

In spaCy you could make an NER entity for change/cancel events I guess, but I would first see how effective a simple keyword check is.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Finding the similarity for web scraped data #9867

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Finding the similarity for web scraped data #9867

Uh oh!

imhans33 Dec 15, 2021

Replies: 1 comment

Uh oh!

polm Dec 20, 2021

imhans33
Dec 15, 2021

polm
Dec 20, 2021