(Tip: Change .com to .dev to see the VSCode interface on your web browser.
Cannot run terminal commands on the browser though.)
Clone this repository:
git clone https://github.com/ismaildawoodjee/product-catalogue
cd product-catalogueSet up a Python virtual environment (Windows OS) and activate it:
python -m venv .venv
.venv/Scripts/activateOn Linux and Mac OS:
python3 -m venv .venv
source .venv/bin/activateInstall dependencies:
python -m pip install -U pip
python -m pip install -U wheel setuptools
pip install -r requirements.txtAt this point, the simple scraping scripts can be run to get data from the main equipments page, or the Scrapy spider can be used to extract equipment specifications data from each equipment's page.
Go inside the first catalogue folder, (to be able to run Scrapy commands):
cd catalogueThe current directory should now be product-catalogue/catalogue, not
product-catalogue/catalogue/catalogue:
$ ls
catalogue/ process_data.py scrapy.cfgLet the komatsu spider crawl each equipment's page for the specifications data:
scrapy crawl komatsuWhen this is done, check inside the data directory to see equipment_specifications_raw.csv
has been extracted and 103 images has been downloaded into the images directory.
Process the specifications data with:
python process_data.pySeveral smaller files for each type of industrial equipment will be produced.
Run the Python scripts in this order:
python main.py
python data_preparation.pyWait until the scripts have finished running, then check inside the data directory
to see that raw CSV data has been extracted (with the name equipment_data_raw.csv),
and several smaller files for each type of industrial equipment are also produced.
When running the data preparation script, four warnings will say that some strings cannot be converted to a float, but that's OK, they can be ignored.
Check inside the images directory to see that 103 images with their proper names
(equipment_type underscore equipment_id) are downloaded.
Setup: Start a Scrapy project and go inside the folder using the commands:
scrapy startproject catalogue
cd catalogueCreate: Generate a Scrapy spider called komatsu (and a start url) using the command:
scrapy genspider komatsu 'www.komatsu.com.au/equipment'This will generate boilerplate code in the spiders/komatsu.py script, which can
then be modified accordingly.
List: To list all spiders, use the command:
scrapy listCrawl: To start crawling using the spider, run the command scrapy crawl spider_name:
scrapy crawl komatsuHeaders: To specify headers and parse them as a dictionary, use the scraper-helper
module, and specify the DEFAULT_REQUEST_HEADERS in the settings.py file. Using headers
will make it look like the request is coming from a real browser instead of a bot.
Shell: To explore the website within Scrapy Shell, enter the shell with the command
scrapy shell website_url:
scrapy shell 'www.komatsu.com.au/equipment/excavators'