Skip to content

A fully dynamic and extensible e-commerce scraping framework with automatic DOM adaptation, proxy rotation, and robust error handling.

License

Notifications You must be signed in to change notification settings

Yukselcsgn/dynamic_web_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

51 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

πŸš€ Dynamic Web Scraper - Enterprise-Grade Data Intelligence Platform

πŸ“‹ Overview

Dynamic Web Scraper is a sophisticated, enterprise-grade Python platform for extracting, analyzing, and visualizing data from dynamic websites. It has evolved from a basic scraper into a comprehensive data intelligence platform with advanced features including intelligent site detection, automated data enrichment, price analysis, comparative analysis, and interactive dashboards.


✨ Key Features

🧠 Intelligent Site Detection & Adaptation

  • Automatic site type detection (e-commerce, blog, news, etc.)
  • Dynamic CSS selector generation based on site patterns
  • Smart product element detection (titles, prices, images, links)
  • Site-specific rule caching for improved performance
  • Anti-bot measure detection and adaptive responses

🎯 Advanced Data Processing & Enrichment

  • Automatic data cleaning and normalization
  • Price normalization across different currencies and formats
  • Contact information extraction from product descriptions
  • Category classification using intelligent algorithms
  • Quality scoring and outlier detection
  • Data validation and integrity checks

πŸ“Š Comprehensive Analytics & Visualization

  • Interactive data visualization with Plotly charts
  • Price distribution analysis with statistical insights
  • Trend detection with time-series analysis
  • Comparative analysis across sources and categories
  • Heatmap visualizations for pattern recognition
  • Summary dashboards with comprehensive metrics
  • Export capabilities for reports and presentations

πŸ›‘οΈ Advanced Anti-Bot Evasion

  • Multiple stealth profiles (stealth, mobile, aggressive)
  • Browser fingerprint spoofing and header manipulation
  • Human-like timing delays and behavior simulation
  • Undetected ChromeDriver integration for maximum stealth
  • Session persistence and cookie management
  • CAPTCHA detection and handling capabilities
  • Automatic browser automation fallback for JavaScript-heavy sites

πŸ”„ Distributed Scraping & Processing

  • Job Queue System with priority-based scheduling
  • Worker Pool Management for parallel processing
  • Thread-safe Operations with persistent storage
  • Real-time Monitoring and statistics
  • Automatic Retry Logic and error recovery
  • Scalable Architecture for enterprise use

🎨 Plugin System & Extensibility

  • Plugin System with multiple plugin types (data processors, validators, custom scrapers)
  • Configuration Management with multi-format support (JSON, YAML, TOML)
  • Environment Variable Overrides for flexible deployment
  • Template Generation for easy plugin development
  • Runtime Configuration management and validation

πŸ“ˆ Price Analysis & Time Series

  • Statistical analysis (mean, median, std dev, skewness)
  • Trend detection (linear regression, moving averages)
  • Seasonality analysis (autocorrelation, pattern recognition)
  • Anomaly detection (Z-score, IQR, rolling statistics)
  • Price prediction with confidence intervals
  • Intelligent recommendations based on analysis

πŸ” Comparative Analysis & Deal Discovery

  • Cross-site price comparison with comprehensive analysis
  • Intelligent product matching using similarity algorithms
  • Brand and model extraction for accurate identification
  • Deal scoring and classification with savings analysis
  • Best deal discovery with ranking and recommendations
  • Price variance analysis with statistical insights

πŸ“§ Automated Reporting & Alerts

  • Scheduled reports with daily and weekly automation
  • Email notifications for price changes and anomalies
  • Configurable alert thresholds for price drops and increases
  • Statistical anomaly detection using z-score analysis
  • HTML email templates with detailed alert information
  • Background scheduling with automated report generation

🌐 Interactive Web Dashboard

  • Real-time monitoring of scraping jobs
  • Interactive charts and statistics
  • Site analysis and visualization
  • Job queue management
  • Results viewing and export
  • Responsive interface with modern design

πŸ“¦ Multi-Format Export & Sharing

  • Multiple output formats (JSON, CSV, Excel, ZIP)
  • Comprehensive data packaging with metadata
  • Export history tracking and management
  • Automatic file cleanup and maintenance
  • Slack integration for automated sharing
  • Batch export capabilities for multiple formats

πŸ›  Requirements

  • Python 3.8 or later
  • Google Chrome or Firefox (for Selenium)
  • All dependencies are version-pinned in requirements.txt

πŸ“¦ Installation

Quick Start

# Clone the repository
git clone <repository-url>
cd dynamic_web_scraper

# Create virtual environment
python -m venv .venv
.venv\Scripts\activate  # Windows
# or
source .venv/bin/activate  # Mac/Linux

# Install production dependencies (backward compatible)
pip install -r requirements.txt

# Or install only core dependencies (minimal installation)
pip install -r requirements/base.txt

# Install development dependencies (optional)
pip install -r requirements-dev.txt

# Setup and test the scraper
python setup_and_test.py

Modular Installation (Recommended)

# Core scraping only (minimal installation)
pip install -r requirements/base.txt

# Core + data analysis and visualization
pip install -r requirements/base.txt -r requirements/data.txt

# Core + advanced features (Cloudflare bypass, Flask dashboard)
pip install -r requirements/base.txt -r requirements/advanced.txt

# Full development environment
pip install -r requirements/all.txt

See requirements/README.md for detailed dependency documentation.

Development Setup

# Install all dependencies including development tools
pip install -r requirements-dev.txt

# Run the comprehensive test suite
python tests/run_tests.py --all --coverage

# Format and lint the code
black scraper/
flake8 scraper/

πŸš€ Quick Start

Option 1: Web Dashboard (Recommended)

# Start the web dashboard
python run_dashboard.py

The dashboard will automatically open in your browser at http://localhost:5000

Option 2: Command Line

# Run the scraper directly
python scraper/main.py

βš™ Configuration

All settings are managed in config.json:

{
  "user_agents": [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)..."
  ],
  "proxies": [
    "http://proxy1.example.com:8080",
    "http://proxy2.example.com:8080"
  ],
  "use_proxy": true,
  "max_retries": 3,
  "retry_delay": 2,
  "rate_limiting": {
    "requests_per_minute": 60
  }
}

🎯 Usage

πŸš€ Automatic Workflow with All Advanced Features

The scraper now automatically uses all advanced features in a comprehensive workflow:

Step 1: Smart Site Detection

  • Automatically detects site type and anti-bot measures
  • Generates optimal CSS selectors
  • Adapts to different site structures

Step 2: Data Extraction & Enrichment

  • Extracts raw data with intelligent parsing
  • Cleans and normalizes data
  • Adds quality scores and metadata
  • Extracts contact information and categories

Step 3: Advanced Analysis

  • Price Analysis: Statistical analysis, outlier detection, trend analysis
  • Comparative Analysis: Cross-site price comparison, deal scoring
  • Time Series Analysis: Trend detection, seasonality, predictions
  • Data Visualization: Interactive charts and dashboards

Step 4: Automated Reporting

  • Generates comprehensive reports
  • Sends email alerts for anomalies
  • Creates interactive dashboards
  • Exports data in multiple formats

Step 5: Plugin Processing

  • Applies custom data processors
  • Validates data quality
  • Enhances data with external sources

Step 6: Distributed Processing

  • Queues jobs for parallel processing
  • Manages worker pools
  • Handles large-scale operations

πŸ“Š What You Get Automatically

  • Raw data in multiple formats
  • Enriched data with quality scores and metadata
  • Price analysis with statistical insights
  • Cross-site comparisons and best deals
  • Interactive visualizations and dashboards
  • Automated reports with alerts and recommendations
  • Multiple export formats for different use cases

Web Dashboard Features

1. Dashboard Overview

  • Real-time statistics (total jobs, success rate, results count)
  • Interactive charts (success rate trends, job status distribution)
  • Recent jobs list with status indicators
  • Active jobs monitoring with live duration tracking

2. Start Scraping

  • URL input with validation
  • Output format selection (CSV, JSON, Excel)
  • Advanced options (proxy rotation, Selenium)
  • Site analysis before scraping
  • Real-time feedback and progress tracking

3. Job Management

  • Job status tracking (pending, running, completed, failed)
  • Results viewing with pagination
  • Error handling and debugging information
  • Export capabilities for scraped data

4. Site Analysis

  • Automatic site type detection
  • E-commerce pattern recognition
  • CSS selector generation
  • Confidence scoring

Command Line Usage

# Basic usage
python scraper/main.py

# You will be prompted for:
# - Target URL
# - Output file path (default: data/all_listings.csv)

πŸ— Project Structure

dynamic_web_scraper/
β”œβ”€β”€ scraper/                          # Core scraping logic
β”‚   β”œβ”€β”€ main.py                      # Command line entry point
β”‚   β”œβ”€β”€ Scraper.py                   # Main scraper class with all features
β”‚   β”œβ”€β”€ config.py                    # Configuration management
β”‚   β”œβ”€β”€ dashboard/                   # Web dashboard
β”‚   β”‚   β”œβ”€β”€ app.py                   # Flask application
β”‚   β”‚   └── templates/               # HTML templates
β”‚   β”œβ”€β”€ analytics/                   # Data analysis and visualization
β”‚   β”‚   β”œβ”€β”€ data_visualizer.py       # Interactive charts and dashboards
β”‚   β”‚   β”œβ”€β”€ price_analyzer.py        # Price analysis and statistics
β”‚   β”‚   └── time_series_analyzer.py  # Time series analysis and prediction
β”‚   β”œβ”€β”€ comparison/                  # Cross-site comparison
β”‚   β”‚   └── site_comparator.py       # Product matching and deal analysis
β”‚   β”œβ”€β”€ reporting/                   # Automated reporting
β”‚   β”‚   └── automated_reporter.py    # Reports, alerts, and notifications
β”‚   β”œβ”€β”€ export/                      # Data export and sharing
β”‚   β”‚   └── export_manager.py        # Multi-format export capabilities
β”‚   β”œβ”€β”€ plugins/                     # Plugin system
β”‚   β”‚   └── plugin_manager.py        # Plugin management and extensibility
β”‚   β”œβ”€β”€ distributed/                 # Distributed processing
β”‚   β”‚   β”œβ”€β”€ job_queue.py             # Job queue system
β”‚   β”‚   └── worker_pool.py           # Worker pool management
β”‚   β”œβ”€β”€ anti_bot/                    # Anti-bot evasion
β”‚   β”‚   └── stealth_manager.py       # Stealth and anti-detection
β”‚   β”œβ”€β”€ site_detection/              # Intelligent site detection
β”‚   β”‚   β”œβ”€β”€ site_detector.py         # Site structure detection
β”‚   β”‚   β”œβ”€β”€ html_analyzer.py         # HTML analysis
β”‚   β”‚   └── css_selector_builder.py  # Selector building
β”‚   β”œβ”€β”€ css_selectors/               # Dynamic selector system
β”‚   β”‚   β”œβ”€β”€ css_selector_generator.py # Selector generation
β”‚   β”‚   β”œβ”€β”€ css_rules.py             # Rule management
β”‚   β”‚   └── dynamic_selector.py      # Site adaptation
β”‚   β”œβ”€β”€ data_parsers/                # Data processing
β”‚   β”œβ”€β”€ proxy_manager/               # Proxy handling
β”‚   β”œβ”€β”€ user_agent_manager/          # User agent management
β”‚   β”œβ”€β”€ logging_manager/             # Logging system
β”‚   └── exceptions/                  # Custom exceptions
β”œβ”€β”€ tests/                           # Comprehensive test suite
β”‚   β”œβ”€β”€ core/                        # Core functionality tests
β”‚   β”œβ”€β”€ analytics/                   # Analytics and visualization tests
β”‚   β”œβ”€β”€ site_detection/              # Site detection tests
β”‚   β”œβ”€β”€ utils/                       # Utility function tests
β”‚   β”œβ”€β”€ integration/                 # Integration tests
β”‚   β”œβ”€β”€ conftest.py                  # Pytest configuration and fixtures
β”‚   └── run_tests.py                 # Test runner script
β”œβ”€β”€ data/                            # Output data storage
β”œβ”€β”€ logs/                            # Log files
β”œβ”€β”€ config.json                      # Configuration
β”œβ”€β”€ requirements/                    # Organized dependency files
β”‚   β”œβ”€β”€ base.txt                     # Core dependencies
β”‚   β”œβ”€β”€ web.txt                      # Scheduling features
β”‚   β”œβ”€β”€ data.txt                     # Data analysis (optional)
β”‚   β”œβ”€β”€ advanced.txt                 # Experimental features
β”‚   β”œβ”€β”€ testing.txt                  # Testing framework
β”‚   β”œβ”€β”€ dev.txt                      # Development tools
β”‚   β”œβ”€β”€ all.txt                      # All dependencies
β”‚   └── README.md                    # Dependency documentation
β”œβ”€β”€ requirements.txt                 # Backward compatibility wrapper
β”œβ”€β”€ requirements-dev.txt             # Backward compatibility wrapper
β”œβ”€β”€ pytest.ini                      # Pytest configuration
β”œβ”€β”€ run_dashboard.py                 # Dashboard launcher
└── README.md                        # Documentation

πŸ§ͺ Testing

The project includes a comprehensive, organized test suite with professional structure:

Run All Tests

# Run the complete test suite
python tests/run_tests.py --all

# Run with coverage
python tests/run_tests.py --all --coverage

# Run with HTML report
python tests/run_tests.py --all --html

Run Specific Test Categories

# Core functionality tests
python tests/run_tests.py --category core

# Analytics and visualization tests
python tests/run_tests.py --category analytics

# Site detection tests
python tests/run_tests.py --category site_detection

# Utility function tests
python tests/run_tests.py --category utils

# Integration tests
python tests/run_tests.py --category integration

Quick Tests (Unit Tests Only)

# Run only unit tests (fast)
python tests/run_tests.py --quick

Test Categories Available

Category Location Purpose
core tests/core/ Basic scraper functionality and integration
analytics tests/analytics/ Data analysis and visualization
site_detection tests/site_detection/ Site detection and CSS selector generation
utils tests/utils/ Utility functions
integration tests/integration/ Complete workflow testing

Direct Pytest Commands

# Run all tests
pytest tests/

# Run specific category
pytest tests/analytics/

# Run with markers
pytest -m "not slow"
pytest -m integration

πŸ“Š Example Output

CSV Output

title,price,image,link,quality_score,category,source
iPhone 13 Pro,$999.99,https://example.com/iphone.jpg,https://example.com/iphone,0.95,electronics,amazon
Samsung Galaxy S21,$899.99,https://example.com/samsung.jpg,https://example.com/samsung,0.92,electronics,ebay

JSON Output with Enrichment

[
  {
    "title": "iPhone 13 Pro",
    "price": 999.99,
    "currency": "USD",
    "image": "https://example.com/iphone.jpg",
    "link": "https://example.com/iphone",
    "quality_score": 0.95,
    "category": "electronics",
    "source": "amazon",
    "extracted_contacts": [],
    "price_analysis": {
      "is_outlier": false,
      "price_percentile": 75,
      "trend": "stable"
    }
  }
]

Interactive Dashboard

  • Real-time charts and visualizations
  • Interactive filtering by source, category, date
  • Price distribution analysis with histograms
  • Trend analysis with moving averages
  • Comparative analysis across sources

πŸ”§ Advanced Features

Intelligent Site Detection

The scraper automatically:

  • Detects site type (e-commerce, blog, news, etc.)
  • Identifies product patterns (shopping cart, prices, add to cart buttons)
  • Generates appropriate CSS selectors
  • Adapts to different site structures
  • Caches site analysis for improved performance

Dynamic CSS Selector Generation

  • Smart selector strategies based on element attributes
  • Fallback mechanisms for when primary selectors fail
  • Validation and optimization of generated selectors
  • Site-specific caching for improved performance
  • Multiple selector types (ID, class, smart, path-based)

Anti-Detection Measures

  • User agent rotation from a large pool of realistic browsers
  • Proxy rotation with automatic failover
  • Rate limiting with random delays
  • Request header randomization
  • Browser fingerprint spoofing
  • Human-like behavior simulation

Robust Error Handling

  • Retry logic with exponential backoff
  • Graceful degradation when selectors fail
  • Comprehensive logging for debugging
  • Custom exceptions with helpful error messages
  • Automatic recovery from common failures

πŸš€ Performance & Scalability

Optimizations

  • Asynchronous processing for multiple jobs
  • Database caching for site analysis results
  • Efficient memory usage with streaming data processing
  • Background job processing with queue management
  • Parallel processing with worker pools

Monitoring

  • Real-time job status tracking
  • Performance metrics and statistics
  • Error rate monitoring
  • Resource usage tracking
  • Comprehensive logging and debugging

πŸ”’ Security & Privacy

  • No sensitive data logging (credentials, API keys)
  • Input validation and sanitization
  • Secure configuration management
  • Respect for robots.txt (configurable)
  • Data encryption for sensitive information
  • Access control and authentication

🀝 Contributing

We welcome contributions! Please see our CONTRIBUTING.md for detailed guidelines.

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Run the comprehensive test suite
python tests/run_tests.py --all --coverage

# Format code
black scraper/

# Lint code
flake8 scraper/

# Type checking
mypy scraper/

Code Quality

  • Pre-commit hooks for automatic code formatting
  • Comprehensive testing with pytest
  • Code coverage reporting
  • Static analysis with mypy and flake8
  • Security scanning with bandit

πŸ“ Logging & Troubleshooting

Log Files

  • Main logs: logs/scraper.log
  • Error logs: logs/error_YYYY-MM-DD.log
  • Dashboard logs: Console output
  • Test logs: logs/test_YYYY-MM-DD.log

Common Issues

Import Errors

# Ensure all dependencies are installed
pip install -r requirements.txt

# Or install specific dependencies as needed
pip install -r requirements/base.txt

Selenium Issues

# Install webdriver-manager
pip install webdriver-manager

# The scraper will automatically download drivers

Proxy Issues

# Check proxy configuration in config.json
# Disable proxy rotation if needed

Test Issues

# Run tests with verbose output
python tests/run_tests.py --all --verbose

# Check test configuration
python tests/run_tests.py --list

πŸ“„ License

This project is licensed under the Apache License 2.0 β€” see the LICENSE file for details.


πŸŽ‰ What's New in This Version

v0.9 - Enterprise Data Intelligence Platform

  • βœ… Complete Test Organization - Professional test suite with organized structure
  • βœ… Integrated Advanced Features - All features working together seamlessly
  • βœ… Comprehensive Analytics - Data visualization, price analysis, time series
  • βœ… Distributed Processing - Job queues, worker pools, parallel processing
  • βœ… Plugin System - Extensible architecture with custom plugins
  • βœ… Automated Reporting - Scheduled reports, email alerts, notifications
  • βœ… Comparative Analysis - Cross-site price comparison and deal discovery
  • βœ… Multi-Format Export - JSON, CSV, Excel, ZIP with metadata
  • βœ… Interactive Dashboards - Web-based data exploration and visualization
  • βœ… Advanced Anti-Bot Evasion - Multiple stealth profiles and detection avoidance

🎯 Ready to Get Started?

  1. Quick Start: Run python run_dashboard.py for the web interface
  2. Command Line: Use python scraper/main.py for direct scraping
  3. Testing: Run python tests/run_tests.py --all to verify everything works
  4. Development: Install dev dependencies with pip install -r requirements-dev.txt

πŸš€ Transform your web scraping into a comprehensive data intelligence platform!

About

A fully dynamic and extensible e-commerce scraping framework with automatic DOM adaptation, proxy rotation, and robust error handling.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published