A complete end-to-end data engineering portfolio project demonstrating web scraping, data cleaning, visualization, and Power BI integration.
π Live Demo | Terminal View
Made by Simone β Student Project
| Category | Technologies & Techniques |
|---|---|
| Web Scraping | Playwright (headless browser), BeautifulSoup, async/await, pagination handling |
| Data Cleaning | Pandas, numpy, duplicate removal, data normalization |
| Visualization | Chart.js, Grid.js, Alpine.js, Jinja2 HTML dashboards (modern + terminal) |
| Database | SQLModel ORM, SQLite, upsert logic |
| Data Export | Power BI-ready CSV (UTF-8 BOM), automated pipeline |
| DevOps | Docker, GitHub Actions CI, pre-commit hooks, pytest |
graph TD
User[User] --> CLI[CLI (main.py)]
CLI --> Scraper[Scraper Module]
Scraper -->|Structured Products| Cleaner[Cleaner Module]
Cleaner -->|Validated Products| DB[Database (SQLModel)]
DB -->|Query| Dashboard[Dashboard Generator]
DB -->|Export| CSV[CSV File]
Dashboard -->|HTML| Browser[Browser View]
This project scrapes product data from the Oxylabs Sandbox E-commerce website and processes it through a complete data pipeline:
- Web Scraping - Extract ~3000 products using Playwright browser automation
- Data Cleaning - Normalize and deduplicate data with Pandas
- Visualization - Interactive dashboards with Chart.js and Grid.js
- Power BI Export - Generate analysis-ready CSV files
ScrapingStore/
βββ scraper/
β βββ __init__.py
β βββ base.py # Abstract base scraper class
β βββ product_scraper.py # BeautifulSoup scraper (static HTML)
β βββ product_scraper_browser.py # Playwright scraper (JS-rendered pages)
βββ cleaning/
β βββ __init__.py
β βββ data_cleaner.py # Pandas data cleaning pipeline
βββ visualization/
β βββ __init__.py
β βββ dashboard_generator.py # Modern dashboard (Tailwind/Chart.js)
β βββ terminal_dashboard_generator.py # Retro terminal dashboard
β βββ templates/ # Jinja2 HTML templates
βββ tests/ # pytest test suite
β βββ conftest.py # Shared fixtures
β βββ test_scraper.py
β βββ test_cleaner.py
β βββ test_database.py
β βββ test_models.py
β βββ test_cli.py
βββ data/ # Output directory (gitignored)
βββ config.py # Centralized configuration
βββ database.py # SQLModel database manager
βββ models.py # Pydantic/SQLModel data models with validation
βββ logger.py # Logging configuration (Rich)
βββ main.py # CLI pipeline orchestrator (Typer)
βββ requirements.txt
βββ Dockerfile
βββ docker-compose.yml
βββ README.md
- Python 3.9 or higher
- pip package manager
# Clone the repository
git clone https://github.com/tzii/ScrapingStore.git
cd ScrapingStore
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install Playwright browsers
playwright install chromium# Quick test: scrape 2 pages (~64 products)
python main.py scrape --pages 2
# Default: scrape 10 pages (~320 products)
python main.py scrape
# Scrape all pages (~3000 products)
python main.py scrape --all
# Custom delay between requests (be respectful!)
python main.py scrape --pages 10 --delay 2.0
# Use browser scraper for JS-rendered pages
python main.py scrape --type browser --pages 5# Export existing data to Power BI CSV
python main.py export
# Regenerate dashboards from existing data
python main.py generate-reportYou can configure the scraper using a .env file (copy from .env.example):
BASE_URL="https://sandbox.oxylabs.io/products"
MAX_RETRIES=3
DEFAULT_TIMEOUT=30
DB_NAME="products.db"pytest
# With coverage report
pytest --cov=scraper --cov=cleaning --cov=visualization --cov-report=term-missing| File | Description |
|---|---|
products_powerbi.csv |
Power BI-ready export (UTF-8 BOM) |
dashboard.html |
Interactive modern dashboard |
dashboard_terminal.html |
Terminal-style dashboard |
- Two scraper implementations: Static (BeautifulSoup) and Browser (Playwright)
- Structured data extraction (price, availability, images) at scrape time
- Async/await with concurrency limiting (semaphore) for browser scraper
- Rate limiting and configurable delay between requests
- Automatic pagination with consecutive-empty-page detection
- Retry logic with exponential backoff (static scraper)
- Availability status normalization (
In Stock/Out of Stock/Unknown) - Duplicate detection and removal by product name
- Name whitespace trimming
- Vectorized operations via Pandas + NumPy for performance
- Modern Dashboard: Tailwind CSS, Chart.js (price distribution, segment doughnut), Grid.js (searchable/sortable product table), Alpine.js (dark mode toggle)
- Terminal Dashboard: Retro CRT-style with ASCII bar charts, auto-calculated KPIs
- Auto-detected franchise/keyword analysis (no hardcoded keywords)
- SQLModel/Pydantic hybrid with field validators
- Price must be non-negative; name must not be empty
- Automatic UTC timestamps on creation
The project also includes a retro-style terminal dashboard for CLI enthusiasts:
The products_powerbi.csv file is formatted for seamless Power BI import:
- Open Power BI Desktop
- Click Get Data β Text/CSV
- Select
data/products_powerbi.csv - Data types will be auto-detected
- Static scraper vs. JS-rendered sites: The
staticscraper usesrequests+ BeautifulSoup, which cannot execute JavaScript. The target sandbox site is JS-rendered, so use--type browserfor actual scraping. The static scraper is included to demonstrate the pattern and works with server-rendered HTML. - Upsert by name: Products are matched by
nameduring upsert. If two genuinely different products share the same name, only the latest will be kept. - Sandbox-specific: The CSS selectors (
div.product-card,h4) are tailored to the Oxylabs sandbox. Adapting to a different site would require updating the selectors.
- Python 3.9+
- Playwright - Browser automation for JS-rendered sites
- BeautifulSoup4 - HTML parsing
- Requests - HTTP client
- Pandas / NumPy - Data manipulation
- SQLModel / Pydantic - ORM and data validation
- Typer / Rich - CLI interface
- Chart.js / Grid.js / Alpine.js - Frontend visualization
- Jinja2 - HTML templating
- Docker - Containerization
- GitHub Actions - CI/CD
MIT License - see LICENSE for details.
Made by Simone β’ Student Project β’ 2025


