NER extraction from website, with context sentences #9894

hmltn-0 · 2021-12-17T10:00:09Z

hmltn-0
Dec 17, 2021

I would like to extract all terms qualifying as “places of interest” (specific zoos, water parks, specific tourist attractions) from this website: https://www.villaplus.com, and get an ideal context sentence/passage to show the use of the term and/or what it means.

Can anyone recommend a standard way to do this?

Is there a smart-crawler which works like Google which crawls, searches and extracts at the same time?

Or should I crawl and plaintext dump first (maybe with Scrapy?), then pass that plaintext to Spacy for Named Entity Recognition?

Thank you very much.

polm · 2021-12-20T05:55:01Z

polm
Dec 20, 2021

Have you tried training an NER model? Your definition of a place of interest sounds kind of vague, and may be hard to learn, but NER should be a decent start.

Your scraping question is kind of out of scope for spaCy, but I would recommend you separate scraping and getting data - that allows you to modify your extractor without having to re-run the scraping part.

0 replies

DuyguA · 2021-12-27T13:59:10Z

DuyguA
Dec 27, 2021

I'd divide this task into 3:

Crawling the website
Cleanign the text
Extracting the entities

Scrape
For this part I recommend using a crawler and dump the data as a dataset. I think Scrapy would do the job here 😉
For our small example, I picked a URL. I'll just simply fetch this URL by requests:

import requests

url = 'https://www.villaplus.com/destinations/villas-in-spain/villas-in-balearic-islands/menorca/cala-galdana'
res = requests.get(url)
html_page = res.content

Now we can process our small dataset of just one document 🙂

Clean

After fetching a URL, one needs to stript out the html parts and get the useful text. BeautifulSoup is a great library for this purpose:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)

output = ''
bad_list = [
    '[document]',
    'noscript',
    'header',
    'html',
    'meta',
    'head',
    'input',
    'script',
]
for t in text:
    if t.parent.name not in bad_list:
        output += '{} '.format(t)
print(output)

You can parse this output more, then you end up with the text you want.

Extract entitites

After we reached the cleaned text, now we can play with spaCy NER. The labels we are interested in are FAC and LOC which refers to specific buildings, monuments, highways etc (most probably tourist attractions) and city,place, country names. Here are some example paragraphs for our text:

import spacy
nlp = spacy.load("en_core_web_md")
doc = nlp("A little further afield, 15 minutes’ drive away, on the far side of Ferreries is the popular Binissues Museum and restaurant. The museum itself is much like a stately home, which you can explore and learn about rural farming methods, watch the folk dancing displays and browse the exhibits.")
doc.ents
(15 minutes, Ferreries, Binissues Museum)
doc.ents[-1].label_]
'FAC'

doc = nlp("Located on the South Coast of the island")
doc.ents
(the South Coast,)
doc.ents[0].label_
'LOC'

That's it, here are the locations 😎

I recommend playing with Soup a bit, then start compiling the corpus. Then you can try spaCy NER on your corpus and see if you need custom training or not. Good luck ✋

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

NER extraction from website, with context sentences #9894

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

NER extraction from website, with context sentences #9894

Uh oh!

hmltn-0 Dec 17, 2021

Replies: 2 comments

Uh oh!

polm Dec 20, 2021

Uh oh!

Uh oh!

DuyguA Dec 27, 2021

hmltn-0
Dec 17, 2021

polm
Dec 20, 2021

DuyguA
Dec 27, 2021