Skip to content

tzii/ScrapingStore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›’ Web Scraping Portfolio Project

ScrapingStore Logo

Python License Pandas Chart.js

A complete end-to-end data engineering portfolio project demonstrating web scraping, data cleaning, visualization, and Power BI integration.

πŸ“Έ Dashboard Preview

Modern Dashboard

Made by Simone β€” Student Project


✨ Features & Skills Demonstrated

Category Technologies & Techniques
Web Scraping Playwright (headless browser), BeautifulSoup, async/await, pagination handling
Data Cleaning Pandas, numpy, duplicate removal, data normalization
Visualization Chart.js, Grid.js, Alpine.js, Jinja2 HTML dashboards (modern + terminal)
Database SQLModel ORM, SQLite, upsert logic
Data Export Power BI-ready CSV (UTF-8 BOM), automated pipeline
DevOps Docker, GitHub Actions CI, pre-commit hooks, pytest

πŸ—οΈ Architecture

graph TD
    User[User] --> CLI[CLI (main.py)]
    CLI --> Scraper[Scraper Module]
    Scraper -->|Structured Products| Cleaner[Cleaner Module]
    Cleaner -->|Validated Products| DB[Database (SQLModel)]
    DB -->|Query| Dashboard[Dashboard Generator]
    DB -->|Export| CSV[CSV File]
    Dashboard -->|HTML| Browser[Browser View]
Loading

🎯 Project Overview

This project scrapes product data from the Oxylabs Sandbox E-commerce website and processes it through a complete data pipeline:

  1. Web Scraping - Extract ~3000 products using Playwright browser automation
  2. Data Cleaning - Normalize and deduplicate data with Pandas
  3. Visualization - Interactive dashboards with Chart.js and Grid.js
  4. Power BI Export - Generate analysis-ready CSV files

πŸ“ Project Structure

ScrapingStore/
β”œβ”€β”€ scraper/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ base.py                     # Abstract base scraper class
β”‚   β”œβ”€β”€ product_scraper.py          # BeautifulSoup scraper (static HTML)
β”‚   └── product_scraper_browser.py  # Playwright scraper (JS-rendered pages)
β”œβ”€β”€ cleaning/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── data_cleaner.py             # Pandas data cleaning pipeline
β”œβ”€β”€ visualization/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ dashboard_generator.py      # Modern dashboard (Tailwind/Chart.js)
β”‚   β”œβ”€β”€ terminal_dashboard_generator.py  # Retro terminal dashboard
β”‚   └── templates/                  # Jinja2 HTML templates
β”œβ”€β”€ tests/                          # pytest test suite
β”‚   β”œβ”€β”€ conftest.py                 # Shared fixtures
β”‚   β”œβ”€β”€ test_scraper.py
β”‚   β”œβ”€β”€ test_cleaner.py
β”‚   β”œβ”€β”€ test_database.py
β”‚   β”œβ”€β”€ test_models.py
β”‚   └── test_cli.py
β”œβ”€β”€ data/                           # Output directory (gitignored)
β”œβ”€β”€ config.py                       # Centralized configuration
β”œβ”€β”€ database.py                     # SQLModel database manager
β”œβ”€β”€ models.py                       # Pydantic/SQLModel data models with validation
β”œβ”€β”€ logger.py                       # Logging configuration (Rich)
β”œβ”€β”€ main.py                         # CLI pipeline orchestrator (Typer)
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ docker-compose.yml
└── README.md

πŸš€ Quick Start

Prerequisites

  • Python 3.9 or higher
  • pip package manager

Installation

# Clone the repository
git clone https://github.com/tzii/ScrapingStore.git
cd ScrapingStore

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install Playwright browsers
playwright install chromium

Running the Pipeline

# Quick test: scrape 2 pages (~64 products)
python main.py scrape --pages 2

# Default: scrape 10 pages (~320 products)
python main.py scrape

# Scrape all pages (~3000 products)
python main.py scrape --all

# Custom delay between requests (be respectful!)
python main.py scrape --pages 10 --delay 2.0

# Use browser scraper for JS-rendered pages
python main.py scrape --type browser --pages 5

Other Commands

# Export existing data to Power BI CSV
python main.py export

# Regenerate dashboards from existing data
python main.py generate-report

Configuration

You can configure the scraper using a .env file (copy from .env.example):

BASE_URL="https://sandbox.oxylabs.io/products"
MAX_RETRIES=3
DEFAULT_TIMEOUT=30
DB_NAME="products.db"

Running Tests

pytest

# With coverage report
pytest --cov=scraper --cov=cleaning --cov=visualization --cov-report=term-missing

πŸ“Š Output Files

File Description
products_powerbi.csv Power BI-ready export (UTF-8 BOM)
dashboard.html Interactive modern dashboard
dashboard_terminal.html Terminal-style dashboard

πŸ”§ Module Details

Web Scraper (scraper/)

  • Two scraper implementations: Static (BeautifulSoup) and Browser (Playwright)
  • Structured data extraction (price, availability, images) at scrape time
  • Async/await with concurrency limiting (semaphore) for browser scraper
  • Rate limiting and configurable delay between requests
  • Automatic pagination with consecutive-empty-page detection
  • Retry logic with exponential backoff (static scraper)

Data Cleaner (cleaning/data_cleaner.py)

  • Availability status normalization (In Stock / Out of Stock / Unknown)
  • Duplicate detection and removal by product name
  • Name whitespace trimming
  • Vectorized operations via Pandas + NumPy for performance

Visualization (visualization/)

  • Modern Dashboard: Tailwind CSS, Chart.js (price distribution, segment doughnut), Grid.js (searchable/sortable product table), Alpine.js (dark mode toggle)
  • Terminal Dashboard: Retro CRT-style with ASCII bar charts, auto-calculated KPIs
  • Auto-detected franchise/keyword analysis (no hardcoded keywords)

Data Models (models.py)

  • SQLModel/Pydantic hybrid with field validators
  • Price must be non-negative; name must not be empty
  • Automatic UTC timestamps on creation

Terminal Dashboard Mode

The project also includes a retro-style terminal dashboard for CLI enthusiasts:

Terminal Dashboard


πŸ“ˆ Power BI Integration

The products_powerbi.csv file is formatted for seamless Power BI import:

  1. Open Power BI Desktop
  2. Click Get Data β†’ Text/CSV
  3. Select data/products_powerbi.csv
  4. Data types will be auto-detected

⚠️ Known Limitations

  • Static scraper vs. JS-rendered sites: The static scraper uses requests + BeautifulSoup, which cannot execute JavaScript. The target sandbox site is JS-rendered, so use --type browser for actual scraping. The static scraper is included to demonstrate the pattern and works with server-rendered HTML.
  • Upsert by name: Products are matched by name during upsert. If two genuinely different products share the same name, only the latest will be kept.
  • Sandbox-specific: The CSS selectors (div.product-card, h4) are tailored to the Oxylabs sandbox. Adapting to a different site would require updating the selectors.

πŸ› οΈ Technologies

  • Python 3.9+
  • Playwright - Browser automation for JS-rendered sites
  • BeautifulSoup4 - HTML parsing
  • Requests - HTTP client
  • Pandas / NumPy - Data manipulation
  • SQLModel / Pydantic - ORM and data validation
  • Typer / Rich - CLI interface
  • Chart.js / Grid.js / Alpine.js - Frontend visualization
  • Jinja2 - HTML templating
  • Docker - Containerization
  • GitHub Actions - CI/CD

πŸ“ License

MIT License - see LICENSE for details.


Made by Simone β€’ Student Project β€’ 2025

About

End-to-end data engineering portfolio: Web Scraping (Playwright), ETL (Pandas/SQLModel), and Visualization (Plotly/Power BI)

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages