-
In the Contributing section four methods of scraping are provided.
I'd like to check that I understand how each works. Beautiful Soup lets me parse the contents of an HTML page Is that right? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
Basically yes. 1 and 2 are used in tandem (you request the page then scrape with BS4). 2 is also useful if you want to reverse-engineer a council's process for displaying the data (you can use curlconverter to do the heavy lifting). Also, for external files, it's mainly just CSV or text-based files - PDFs aren't really workable due to their unreliable structures. |
Beta Was this translation helpful? Give feedback.
-
Selenium is just an automation engine. It takes commands via an api go to page x it’s the closest thing to a human doing it. The councils go to great lengths to protect their infrastructure from web scraping. Are we web scraping, yes, but it’s being done by the user of that council (a customer) to get at their schedule data. Much like if you did it manually. Are we using it to mass harvest data. No nice work on diving deep @davida72 |
Beta Was this translation helpful? Give feedback.
-
Thanks both. Getting there... |
Beta Was this translation helpful? Give feedback.
Basically yes. 1 and 2 are used in tandem (you request the page then scrape with BS4).
2 is also useful if you want to reverse-engineer a council's process for displaying the data (you can use curlconverter to do the heavy lifting).
Also, for external files, it's mainly just CSV or text-based files - PDFs aren't really workable due to their unreliable structures.