This project represents a practical scenario in which a major shoe company aims to analyze the shoe market on Mercado Livre to develop a competitive pricing strategy.
- Identify the most prominent brands across the first 20 pages of search results.
- Calculate the average price of athletic shoes by brand.
- Assess customer satisfaction for each brand (based on reviews).
To achieve these objectives, we'll implement a ETL (Extract, Transform, Load) pipeline using Python and widely-used open-source libraries:
- Extraction: Utilize the
Scrapyframework to scrape product data from Mercado Livre. - Transformation: Use
pandasto clean and format the raw data into structured DataFrames. - Loading: Store the processed data in an
SQLite3local database. - Visualization: Present insights through an interactive dashboard built with
Streamlit.
conda create -n scrapy-env
conda activate scrapy-env
conda install scrapy pandas streamlit- Start a Scrapy project with the following command:
scrapy startproject collect- Move into the project folder and create a spider for scraping:
cd collect
scrapy genspider mercadolivre https://lista.mercadolivre.com.br/tenis-corrida-masculinoThe spider will be responsible for sending requests, parsing the necessary data, and handling pagination.
To ensure proper access, set up your user agent:
USER_AGENT = 'your_user_agent'You can find your user agent by searching "my user agent" online.
Disable the robots.txt rule to allow scraping multiple pages:
ROBOTSTXT_OBEY = FalseScrapy allows testing commands directly in the terminal:
scrapy shellFor example:
fetch('https://lista.mercadolivre.com.br/tenis-corrida-masculino')This will fetch the HTML content of the page.
Some useful commands for parsing:
- To capture the HTML block containing relevant product information:
response.css('div.ui-search-result__content')- To count the number of items within the block:
len(response.css('div.ui-search-result__content'))- Assign the content block to a
productsvariable for further parsing:
products = response.css('div.ui-search-result__content')- Extract brand names from the content:
products.css('ui-search-item__brand-discoverability ui-search-item__group__element::text').get()- To exit the shell:
exit()After performing terminal tests, proceed with writing the scraping logic within the parse method in the mercadolivre.py spider.
Execute the spider to gather data and save the results in a JSON Lines file:
scrapy crawl mercadolivre -o .data\data.jsonlUse the main.py script located in the transformation directory to handle data transformation and loading.
First, read the data.jsonl file generated by Scrapy using pandas.
The transformation process includes handling missing values (converting NULL to 0), adjusting data types, and combining price values (e.g., combining reais and cents into a single price column).
After cleaning the data, establish a connection to the SQLite3 database (quotes.db) and load the processed data into a table (mercadolivre_itens). SQLite operates directly on your machine, which simplifies the setup since no external server is needed.
Run the transformation and loading steps with:
python transformation/main.pyWe'll develop the dashboard in app.py. First, connect to the SQLite database to retrieve the scraped and transformed data.
Next, design the layout of the Streamlit dashboard. The dashboard includes 3 Key Performance Indicators (KPIs) to summarize important insights for the client:
- Total number of products;
- Count of unique brands;
- Average price (new price).
Additionally, charts and tables are created to answer key business questions:
- What brands dominate the listings across multiple pages?
- What is the average price by brand?
- How does customer satisfaction vary across brands?
Run the Streamlit app to view the dashboard:
streamlit run dashboard/app.pyAll commands for:
- Data extraction:
scrapy crawl mercadolivre -o data.jsonl- Data transformation and loading:
python transformation/main.py- Dashboard visualization:
streamlit run dashboard/app.pymust be run from within the src folder.
This project demonstrates a complete workflow from data extraction to transformation, loading, and visualization. It uses open-source tools and Python to carry out the entire ETL process and create an insightful dashboard.
