Example project with an implementation of the Scrapy Framework executed from code, handling code and request errors, and exporting the extracted data to a CSV file.
The execution is in the app.py file. It initializes the crawler with the options of the spider_nest/settings.py file, load the spiders and execute them.
The spiders extends the class SpiderHandler of the spider_nest/spider_handler.py file, which has methods to handle code and request errors, and some variables for generate statistics.
When a spider returns an object, it's catch by the process_item() function of the spider_nest/pipelines.py file, where it's written to a CSV file in the root of the project.
If an spider raises an error, it's handled by the SpiderHandler, and all following requests are turned down by the DownloaderMiddleware of the spider_nest/middlewares.py file, to prevent extract incomplete results (This can change according to the needs of each spider).
When the spider finish its execution, it's executed the close_spider() function of the spider_nest/pipelines.py file, where the statistics of the spider execution are printed.
Clone the repository
git clone https://github.com/dmarcosl/scrapy-playground
Create a virtual environment and activate it
cd scrapy_playground
python3 -m venv venv
. venv/bin/activate
Install the Scrapy library
pip3 install -r requirements.txt
or
pip3 install scrapy==1.6.0
Execute it
python3 app.py