Skip to content

arben-adm/crawl-agent-4ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crawl Agent 4AI 🕷️

Advanced asynchronous web scraping with LLM-powered content extraction and Streamlit UI

Introduction

Crawl Agent 4AI is designed to efficiently scrape websites while handling dynamic content and respecting robots.txt. With both LLM-based and structured extraction modes, it caters to different scraping needs. The project uses asynchronous functions for improved performance and leverages Streamlit to provide an interactive UI for initiating and monitoring scrapes.

✨ Key Features

  • 🚀 Async Crawling - Fast, non-blocking operations with asyncio
  • 🤖 LLM Extraction - AI-powered content understanding and extraction
  • 🎯 Structured Extraction - Precise CSS-based data selection
  • 🎨 Modern UI - Clean Streamlit interface for easy operation
  • 🛡️ Smart Protection - Automatic robots.txt validation
  • Dynamic Content - Handles JavaScript-rendered pages

🚀 Quick Start

# Install
git clone https://github.com/arben-adm/crawl-agent-4ai.git
cd crawl-agent-4ai
pip install -r requirements.txt
run crawl4ai-setup

to confirm everthing is working: crawl4ai-doctor 

# Run
streamlit run app/main.py

💡 Usage

  1. Start the app with streamlit run app/main.py
  2. Enter target URL and select mode:
    • LLM Mode: AI-powered content extraction
    • Structured Mode: CSS-based precise extraction
  3. Configure advanced settings if needed:
    • Dynamic content wait time
    • Hidden elements extraction
    • Custom extraction rules
  4. Click "Start Scraping" and view results in tabs

⚙️ Configuration

Setting Description Default
Dynamic Wait Time to wait for JS content 5s
Process Dynamic Handle JS-rendered content true
Extract Hidden Include hidden elements true
LLM Instructions Custom extraction rules "Extract all text"

Troubleshooting

  • Scraping Not Allowed: If the website disallows crawling via robots.txt, the app will show an error. Check the URL or try another site.
  • Errors During Scraping: Ensure you have a stable internet connection and that all dependencies are installed.

🤝 Contributing

  1. Fork the repo
  2. Create feature branch (git checkout -b feature/amazing)
  3. Commit changes (git commit -am 'Add something amazing')
  4. Push branch (git push origin feature/amazing)
  5. Open a Pull Request

📝 License

MIT © Arben Ademi

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages