Unleashing AI-Powered Web Scraping at Scale π
Deep Seek Crawler represents the next generation of web scraping, combining asyncio's power with DeepSeek's AI capabilities to transform chaotic web data into structured intelligence. Built for performance, scalability, and precision.
- Smart Pagination: Autonomous detection of result boundaries and page termination
- Duplicate Prevention: Intelligent tracking of seen venues using efficient set operations
- Polite Crawling: Built-in rate limiting with configurable sleep intervals
- Robust Error Handling: Graceful handling of no-results scenarios
- Asynchronous Architecture: Built on Python's asyncio for maximum performance
- Modular Design: Clean separation of concerns with utility modules
- Session Management: Persistent crawling sessions with automatic cleanup
- CSV Export: Structured data output with comprehensive venue information
graph TD
A[Main Crawler] --> B[AsyncWebCrawler]
B --> C[Page Processor]
C --> D[LLM Strategy]
D --> E[Data Exporter]
B --> F[Browser Config]
C --> G[Data Utils]
G --> E
- AsyncWebCrawler: High-performance asynchronous crawling engine
- LLM Strategy: AI-powered content extraction and processing
- Browser Configuration: Customizable crawler behavior settings
- Data Utilities: Robust data processing and export functionality
- Efficient Memory Usage: Set-based duplicate detection
- Controlled Crawling: Configurable delay between requests
- Graceful Termination: Smart detection of crawl completion
- Usage Statistics: Built-in LLM strategy usage tracking
-
Clone & Setup:
git clone https://github.com/oussemabenhassena5/Crawl4DeepSeek.git cd Crawl4DeepSeek python -m venv venv && source venv/bin/activate pip install -r requirements.txt
-
Configure Environment:
# .env file GROQ_API_KEY=your_api_key -
Launch Crawler:
python crawler.py
crawl4deepseek/
βββ crawler.py # Main crawling script
βββ config.py # Configuration settings
βββ utils/
β βββ data_utils.py # Data processing utilities
β βββ scraper_utils.py # Crawling utility functions
βββ requirements.txt # Project dependencies
βββ .env # Environment configuration
- Async Processing: Efficient handling of concurrent page fetches
- Smart State Management: Tracking of seen venues and crawl progress
- Configurable Behavior: Easy-to-modify crawler settings
- Comprehensive Logging: Detailed crawl progress and statistics
The crawler follows a systematic approach:
- Initializes configurations and strategies
- Processes pages asynchronously
- Checks for duplicate venues
- Exports structured data
- Provides usage statistics
- Enhanced error recovery mechanisms
- Multi-site crawling support
- Advanced data validation
- Performance optimization for large-scale crawls
Contributions are welcome! Feel free to submit issues and pull requests.
Distributed under the MIT License. See LICENSE for more information.