Skip to content

oussemabenhassena5/Crawl4DeepSeek

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ•ΈοΈ DeepSeek Crawler

Python 3.12 Crawl4AI DeepSeek MIT License

Unleashing AI-Powered Web Scraping at Scale πŸš€

🎯 What Makes This Special

Deep Seek Crawler represents the next generation of web scraping, combining asyncio's power with DeepSeek's AI capabilities to transform chaotic web data into structured intelligence. Built for performance, scalability, and precision.

πŸ”₯ Key Features

Intelligence Layer

  • Smart Pagination: Autonomous detection of result boundaries and page termination
  • Duplicate Prevention: Intelligent tracking of seen venues using efficient set operations
  • Polite Crawling: Built-in rate limiting with configurable sleep intervals
  • Robust Error Handling: Graceful handling of no-results scenarios

Engineering Excellence

  • Asynchronous Architecture: Built on Python's asyncio for maximum performance
  • Modular Design: Clean separation of concerns with utility modules
  • Session Management: Persistent crawling sessions with automatic cleanup
  • CSV Export: Structured data output with comprehensive venue information

πŸ—οΈ Architecture

graph TD
    A[Main Crawler] --> B[AsyncWebCrawler]
    B --> C[Page Processor]
    C --> D[LLM Strategy]
    D --> E[Data Exporter]
    B --> F[Browser Config]
    C --> G[Data Utils]
    G --> E
Loading

πŸ’» Technical Implementation

Core Components

  • AsyncWebCrawler: High-performance asynchronous crawling engine
  • LLM Strategy: AI-powered content extraction and processing
  • Browser Configuration: Customizable crawler behavior settings
  • Data Utilities: Robust data processing and export functionality

Performance Features

  • Efficient Memory Usage: Set-based duplicate detection
  • Controlled Crawling: Configurable delay between requests
  • Graceful Termination: Smart detection of crawl completion
  • Usage Statistics: Built-in LLM strategy usage tracking

πŸš€ Quick Start

  1. Clone & Setup:

    git clone https://github.com/oussemabenhassena5/Crawl4DeepSeek.git
    cd Crawl4DeepSeek
    python -m venv venv && source venv/bin/activate
    pip install -r requirements.txt
  2. Configure Environment:

    # .env file
    GROQ_API_KEY=your_api_key
  3. Launch Crawler:

    python crawler.py

πŸ“Š Project Structure

crawl4deepseek/
β”œβ”€β”€ crawler.py            # Main crawling script
β”œβ”€β”€ config.py            # Configuration settings
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ data_utils.py    # Data processing utilities
β”‚   └── scraper_utils.py # Crawling utility functions
β”œβ”€β”€ requirements.txt     # Project dependencies
└── .env                # Environment configuration

πŸ› οΈ Engineering Highlights

  • Async Processing: Efficient handling of concurrent page fetches
  • Smart State Management: Tracking of seen venues and crawl progress
  • Configurable Behavior: Easy-to-modify crawler settings
  • Comprehensive Logging: Detailed crawl progress and statistics

πŸ”„ Development Workflow

The crawler follows a systematic approach:

  1. Initializes configurations and strategies
  2. Processes pages asynchronously
  3. Checks for duplicate venues
  4. Exports structured data
  5. Provides usage statistics

🎯 Future Roadmap

  • Enhanced error recovery mechanisms
  • Multi-site crawling support
  • Advanced data validation
  • Performance optimization for large-scale crawls

🀝 Contributing

Contributions are welcome! Feel free to submit issues and pull requests.

πŸ“œ License

Distributed under the MIT License. See LICENSE for more information.


Built with πŸ’» by Oussema Ben Hassena

Transforming Web Data into Intelligence

LinkedIn

About

Crawl4DeepSeek = Crawl4AI + DeepSeek πŸš€ Smart, efficient, and built for deep web exploration! πŸŒπŸ€–

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages