NER extraction from website, with context sentences #9894
Replies: 2 comments
-
Have you tried training an NER model? Your definition of a place of interest sounds kind of vague, and may be hard to learn, but NER should be a decent start. Your scraping question is kind of out of scope for spaCy, but I would recommend you separate scraping and getting data - that allows you to modify your extractor without having to re-run the scraping part. |
Beta Was this translation helpful? Give feedback.
-
I'd divide this task into 3:
Scrape
Now we can process our small dataset of just one document 🙂 Clean After fetching a URL, one needs to stript out the html parts and get the useful text.
You can parse this output more, then you end up with the text you want. Extract entitites After we reached the cleaned text, now we can play with spaCy NER. The labels we are interested in are
That's it, here are the locations 😎 I recommend playing with |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I would like to extract all terms qualifying as “places of interest” (specific zoos, water parks, specific tourist attractions) from this website: https://www.villaplus.com, and get an ideal context sentence/passage to show the use of the term and/or what it means.
Can anyone recommend a standard way to do this?
Is there a smart-crawler which works like Google which crawls, searches and extracts at the same time?
Or should I crawl and plaintext dump first (maybe with Scrapy?), then pass that plaintext to Spacy for Named Entity Recognition?
Thank you very much.
Beta Was this translation helpful? Give feedback.
All reactions