An AI-powered web scraper that extracts and processes website content using Crawl4AI, LangChain, HuggingFace Embeddings, FAISS, and GROQ LLMs. It features a simple Gradio UI and allows users to download extracted text and ask intelligent questions from web data.
- 🌐 Crawl websites asynchronously with
Crawl4AI
- 📄 Extract and chunk website text data
- 🧾 Download extracted content as a
.txt
file - 🤖 Embed content using
HuggingFaceEmbeddings
- 🔍 Perform semantic search using
FAISS
- 💬 Answer questions using GROQ LLM (via LangChain)
- 🎛️ Clean and interactive UI using
Gradio
- Python
- LangChain
- FAISS
- HuggingFace Transformers
- Crawl4AI
- Gradio
- GROQ LLM
- dotenv
- Clone the Repo
git clone https://github.com/jasoncobra3/WebScraper_AI_Agent.git cd WebScraper_AI_Agent
- Create Virtual Environment
python -m venv venv
- Activate the Virtual Environment
# Windows: venv\Scripts\activate # macOS/Linux: venv/bin/activate
- Install Dependencies
pip install -r requirements.txt
- Create a
.env
file in root folder withGROQ_API_KEY=your_groq_api_key_here
Run the Script in Terminal
python app.py
├── app.py
├── requirements.txt
├── .env
├── .gitignore
├── README.md
└── Assets/
🌐 Scraping Website | Semantic Search 📄 |
---|---|
![]() |
![]() |
Pull requests are welcome. For major changes, please open an issue first to discuss what you’d like to change.