Dynamic Web Scraper is a sophisticated, enterprise-grade Python platform for extracting, analyzing, and visualizing data from dynamic websites. It has evolved from a basic scraper into a comprehensive data intelligence platform with advanced features including intelligent site detection, automated data enrichment, price analysis, comparative analysis, and interactive dashboards.
- Automatic site type detection (e-commerce, blog, news, etc.)
- Dynamic CSS selector generation based on site patterns
- Smart product element detection (titles, prices, images, links)
- Site-specific rule caching for improved performance
- Anti-bot measure detection and adaptive responses
- Automatic data cleaning and normalization
- Price normalization across different currencies and formats
- Contact information extraction from product descriptions
- Category classification using intelligent algorithms
- Quality scoring and outlier detection
- Data validation and integrity checks
- Interactive data visualization with Plotly charts
- Price distribution analysis with statistical insights
- Trend detection with time-series analysis
- Comparative analysis across sources and categories
- Heatmap visualizations for pattern recognition
- Summary dashboards with comprehensive metrics
- Export capabilities for reports and presentations
- Multiple stealth profiles (stealth, mobile, aggressive)
- Browser fingerprint spoofing and header manipulation
- Human-like timing delays and behavior simulation
- Undetected ChromeDriver integration for maximum stealth
- Session persistence and cookie management
- CAPTCHA detection and handling capabilities
- Automatic browser automation fallback for JavaScript-heavy sites
- Job Queue System with priority-based scheduling
- Worker Pool Management for parallel processing
- Thread-safe Operations with persistent storage
- Real-time Monitoring and statistics
- Automatic Retry Logic and error recovery
- Scalable Architecture for enterprise use
- Plugin System with multiple plugin types (data processors, validators, custom scrapers)
- Configuration Management with multi-format support (JSON, YAML, TOML)
- Environment Variable Overrides for flexible deployment
- Template Generation for easy plugin development
- Runtime Configuration management and validation
- Statistical analysis (mean, median, std dev, skewness)
- Trend detection (linear regression, moving averages)
- Seasonality analysis (autocorrelation, pattern recognition)
- Anomaly detection (Z-score, IQR, rolling statistics)
- Price prediction with confidence intervals
- Intelligent recommendations based on analysis
- Cross-site price comparison with comprehensive analysis
- Intelligent product matching using similarity algorithms
- Brand and model extraction for accurate identification
- Deal scoring and classification with savings analysis
- Best deal discovery with ranking and recommendations
- Price variance analysis with statistical insights
- Scheduled reports with daily and weekly automation
- Email notifications for price changes and anomalies
- Configurable alert thresholds for price drops and increases
- Statistical anomaly detection using z-score analysis
- HTML email templates with detailed alert information
- Background scheduling with automated report generation
- Real-time monitoring of scraping jobs
- Interactive charts and statistics
- Site analysis and visualization
- Job queue management
- Results viewing and export
- Responsive interface with modern design
- Multiple output formats (JSON, CSV, Excel, ZIP)
- Comprehensive data packaging with metadata
- Export history tracking and management
- Automatic file cleanup and maintenance
- Slack integration for automated sharing
- Batch export capabilities for multiple formats
- Python 3.8 or later
- Google Chrome or Firefox (for Selenium)
- All dependencies are version-pinned in
requirements.txt
# Clone the repository
git clone <repository-url>
cd dynamic_web_scraper
# Create virtual environment
python -m venv .venv
.venv\Scripts\activate # Windows
# or
source .venv/bin/activate # Mac/Linux
# Install production dependencies (backward compatible)
pip install -r requirements.txt
# Or install only core dependencies (minimal installation)
pip install -r requirements/base.txt
# Install development dependencies (optional)
pip install -r requirements-dev.txt
# Setup and test the scraper
python setup_and_test.py# Core scraping only (minimal installation)
pip install -r requirements/base.txt
# Core + data analysis and visualization
pip install -r requirements/base.txt -r requirements/data.txt
# Core + advanced features (Cloudflare bypass, Flask dashboard)
pip install -r requirements/base.txt -r requirements/advanced.txt
# Full development environment
pip install -r requirements/all.txtSee requirements/README.md for detailed dependency documentation.
# Install all dependencies including development tools
pip install -r requirements-dev.txt
# Run the comprehensive test suite
python tests/run_tests.py --all --coverage
# Format and lint the code
black scraper/
flake8 scraper/# Start the web dashboard
python run_dashboard.pyThe dashboard will automatically open in your browser at http://localhost:5000
# Run the scraper directly
python scraper/main.pyAll settings are managed in config.json:
{
"user_agents": [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)..."
],
"proxies": [
"http://proxy1.example.com:8080",
"http://proxy2.example.com:8080"
],
"use_proxy": true,
"max_retries": 3,
"retry_delay": 2,
"rate_limiting": {
"requests_per_minute": 60
}
}The scraper now automatically uses all advanced features in a comprehensive workflow:
- Automatically detects site type and anti-bot measures
- Generates optimal CSS selectors
- Adapts to different site structures
- Extracts raw data with intelligent parsing
- Cleans and normalizes data
- Adds quality scores and metadata
- Extracts contact information and categories
- Price Analysis: Statistical analysis, outlier detection, trend analysis
- Comparative Analysis: Cross-site price comparison, deal scoring
- Time Series Analysis: Trend detection, seasonality, predictions
- Data Visualization: Interactive charts and dashboards
- Generates comprehensive reports
- Sends email alerts for anomalies
- Creates interactive dashboards
- Exports data in multiple formats
- Applies custom data processors
- Validates data quality
- Enhances data with external sources
- Queues jobs for parallel processing
- Manages worker pools
- Handles large-scale operations
- Raw data in multiple formats
- Enriched data with quality scores and metadata
- Price analysis with statistical insights
- Cross-site comparisons and best deals
- Interactive visualizations and dashboards
- Automated reports with alerts and recommendations
- Multiple export formats for different use cases
- Real-time statistics (total jobs, success rate, results count)
- Interactive charts (success rate trends, job status distribution)
- Recent jobs list with status indicators
- Active jobs monitoring with live duration tracking
- URL input with validation
- Output format selection (CSV, JSON, Excel)
- Advanced options (proxy rotation, Selenium)
- Site analysis before scraping
- Real-time feedback and progress tracking
- Job status tracking (pending, running, completed, failed)
- Results viewing with pagination
- Error handling and debugging information
- Export capabilities for scraped data
- Automatic site type detection
- E-commerce pattern recognition
- CSS selector generation
- Confidence scoring
# Basic usage
python scraper/main.py
# You will be prompted for:
# - Target URL
# - Output file path (default: data/all_listings.csv)dynamic_web_scraper/
βββ scraper/ # Core scraping logic
β βββ main.py # Command line entry point
β βββ Scraper.py # Main scraper class with all features
β βββ config.py # Configuration management
β βββ dashboard/ # Web dashboard
β β βββ app.py # Flask application
β β βββ templates/ # HTML templates
β βββ analytics/ # Data analysis and visualization
β β βββ data_visualizer.py # Interactive charts and dashboards
β β βββ price_analyzer.py # Price analysis and statistics
β β βββ time_series_analyzer.py # Time series analysis and prediction
β βββ comparison/ # Cross-site comparison
β β βββ site_comparator.py # Product matching and deal analysis
β βββ reporting/ # Automated reporting
β β βββ automated_reporter.py # Reports, alerts, and notifications
β βββ export/ # Data export and sharing
β β βββ export_manager.py # Multi-format export capabilities
β βββ plugins/ # Plugin system
β β βββ plugin_manager.py # Plugin management and extensibility
β βββ distributed/ # Distributed processing
β β βββ job_queue.py # Job queue system
β β βββ worker_pool.py # Worker pool management
β βββ anti_bot/ # Anti-bot evasion
β β βββ stealth_manager.py # Stealth and anti-detection
β βββ site_detection/ # Intelligent site detection
β β βββ site_detector.py # Site structure detection
β β βββ html_analyzer.py # HTML analysis
β β βββ css_selector_builder.py # Selector building
β βββ css_selectors/ # Dynamic selector system
β β βββ css_selector_generator.py # Selector generation
β β βββ css_rules.py # Rule management
β β βββ dynamic_selector.py # Site adaptation
β βββ data_parsers/ # Data processing
β βββ proxy_manager/ # Proxy handling
β βββ user_agent_manager/ # User agent management
β βββ logging_manager/ # Logging system
β βββ exceptions/ # Custom exceptions
βββ tests/ # Comprehensive test suite
β βββ core/ # Core functionality tests
β βββ analytics/ # Analytics and visualization tests
β βββ site_detection/ # Site detection tests
β βββ utils/ # Utility function tests
β βββ integration/ # Integration tests
β βββ conftest.py # Pytest configuration and fixtures
β βββ run_tests.py # Test runner script
βββ data/ # Output data storage
βββ logs/ # Log files
βββ config.json # Configuration
βββ requirements/ # Organized dependency files
β βββ base.txt # Core dependencies
β βββ web.txt # Scheduling features
β βββ data.txt # Data analysis (optional)
β βββ advanced.txt # Experimental features
β βββ testing.txt # Testing framework
β βββ dev.txt # Development tools
β βββ all.txt # All dependencies
β βββ README.md # Dependency documentation
βββ requirements.txt # Backward compatibility wrapper
βββ requirements-dev.txt # Backward compatibility wrapper
βββ pytest.ini # Pytest configuration
βββ run_dashboard.py # Dashboard launcher
βββ README.md # Documentation
The project includes a comprehensive, organized test suite with professional structure:
# Run the complete test suite
python tests/run_tests.py --all
# Run with coverage
python tests/run_tests.py --all --coverage
# Run with HTML report
python tests/run_tests.py --all --html# Core functionality tests
python tests/run_tests.py --category core
# Analytics and visualization tests
python tests/run_tests.py --category analytics
# Site detection tests
python tests/run_tests.py --category site_detection
# Utility function tests
python tests/run_tests.py --category utils
# Integration tests
python tests/run_tests.py --category integration# Run only unit tests (fast)
python tests/run_tests.py --quick| Category | Location | Purpose |
|---|---|---|
| core | tests/core/ |
Basic scraper functionality and integration |
| analytics | tests/analytics/ |
Data analysis and visualization |
| site_detection | tests/site_detection/ |
Site detection and CSS selector generation |
| utils | tests/utils/ |
Utility functions |
| integration | tests/integration/ |
Complete workflow testing |
# Run all tests
pytest tests/
# Run specific category
pytest tests/analytics/
# Run with markers
pytest -m "not slow"
pytest -m integrationtitle,price,image,link,quality_score,category,source
iPhone 13 Pro,$999.99,https://example.com/iphone.jpg,https://example.com/iphone,0.95,electronics,amazon
Samsung Galaxy S21,$899.99,https://example.com/samsung.jpg,https://example.com/samsung,0.92,electronics,ebay[
{
"title": "iPhone 13 Pro",
"price": 999.99,
"currency": "USD",
"image": "https://example.com/iphone.jpg",
"link": "https://example.com/iphone",
"quality_score": 0.95,
"category": "electronics",
"source": "amazon",
"extracted_contacts": [],
"price_analysis": {
"is_outlier": false,
"price_percentile": 75,
"trend": "stable"
}
}
]- Real-time charts and visualizations
- Interactive filtering by source, category, date
- Price distribution analysis with histograms
- Trend analysis with moving averages
- Comparative analysis across sources
The scraper automatically:
- Detects site type (e-commerce, blog, news, etc.)
- Identifies product patterns (shopping cart, prices, add to cart buttons)
- Generates appropriate CSS selectors
- Adapts to different site structures
- Caches site analysis for improved performance
- Smart selector strategies based on element attributes
- Fallback mechanisms for when primary selectors fail
- Validation and optimization of generated selectors
- Site-specific caching for improved performance
- Multiple selector types (ID, class, smart, path-based)
- User agent rotation from a large pool of realistic browsers
- Proxy rotation with automatic failover
- Rate limiting with random delays
- Request header randomization
- Browser fingerprint spoofing
- Human-like behavior simulation
- Retry logic with exponential backoff
- Graceful degradation when selectors fail
- Comprehensive logging for debugging
- Custom exceptions with helpful error messages
- Automatic recovery from common failures
- Asynchronous processing for multiple jobs
- Database caching for site analysis results
- Efficient memory usage with streaming data processing
- Background job processing with queue management
- Parallel processing with worker pools
- Real-time job status tracking
- Performance metrics and statistics
- Error rate monitoring
- Resource usage tracking
- Comprehensive logging and debugging
- No sensitive data logging (credentials, API keys)
- Input validation and sanitization
- Secure configuration management
- Respect for robots.txt (configurable)
- Data encryption for sensitive information
- Access control and authentication
We welcome contributions! Please see our CONTRIBUTING.md for detailed guidelines.
# Install development dependencies
pip install -r requirements-dev.txt
# Run the comprehensive test suite
python tests/run_tests.py --all --coverage
# Format code
black scraper/
# Lint code
flake8 scraper/
# Type checking
mypy scraper/- Pre-commit hooks for automatic code formatting
- Comprehensive testing with pytest
- Code coverage reporting
- Static analysis with mypy and flake8
- Security scanning with bandit
- Main logs:
logs/scraper.log - Error logs:
logs/error_YYYY-MM-DD.log - Dashboard logs: Console output
- Test logs:
logs/test_YYYY-MM-DD.log
# Ensure all dependencies are installed
pip install -r requirements.txt
# Or install specific dependencies as needed
pip install -r requirements/base.txt# Install webdriver-manager
pip install webdriver-manager
# The scraper will automatically download drivers# Check proxy configuration in config.json
# Disable proxy rotation if needed# Run tests with verbose output
python tests/run_tests.py --all --verbose
# Check test configuration
python tests/run_tests.py --listThis project is licensed under the Apache License 2.0 β see the LICENSE file for details.
- β Complete Test Organization - Professional test suite with organized structure
- β Integrated Advanced Features - All features working together seamlessly
- β Comprehensive Analytics - Data visualization, price analysis, time series
- β Distributed Processing - Job queues, worker pools, parallel processing
- β Plugin System - Extensible architecture with custom plugins
- β Automated Reporting - Scheduled reports, email alerts, notifications
- β Comparative Analysis - Cross-site price comparison and deal discovery
- β Multi-Format Export - JSON, CSV, Excel, ZIP with metadata
- β Interactive Dashboards - Web-based data exploration and visualization
- β Advanced Anti-Bot Evasion - Multiple stealth profiles and detection avoidance
- Quick Start: Run
python run_dashboard.pyfor the web interface - Command Line: Use
python scraper/main.pyfor direct scraping - Testing: Run
python tests/run_tests.py --allto verify everything works - Development: Install dev dependencies with
pip install -r requirements-dev.txt
π Transform your web scraping into a comprehensive data intelligence platform!