A lightweight, privacy-focused Retrieval-Augmented Generation (RAG) pipeline for crawling, indexing, and querying web content locally. Perfect for grounded AI responses on specific sites (e.g., IANA.org) without external APIs. Runs entirely on your machine using CPU-friendly tools.
- Crawler: Scrapes pages ethically (respects robots.txt), extracts text, limits depth/pages.
- Indexer: Chunks text, embeds with Sentence Transformers, stores in FAISS for fast similarity search.
- Asker: Retrieves relevant chunks, generates answers via local LLM (e.g., Ollama), cites sources, refuses off-topic queries.
- Low-Resource: Optimized for 8GB RAM systems; per-chunk processing avoids crashes.
- Eval-Friendly: Measures latency, handles refusals for assignment demos.
- Python 3.8+
- BeautifulSoup (crawling)
- Sentence Transformers (embeddings)
- FAISS (vector search)
- tqdm, logging (progress)
- Optional: Ollama/HuggingFace for generation in
asker.py
Install via pip:
pip install beautifulsoup4 requests sentence-transformers faiss-cpu tqdm numpy pickle5 torch
- For LLM: Install Ollama (
ollama run llama3) or use HuggingFace Transformers. - Model:
all-MiniLM-L6-v2downloads on first run (~90MB).
- Clone repo:
git clone <repo-url>; cd rag-project - Create
data/folder:mkdir data - (Optional) Edit scripts for custom LLM in
asker.py.
Run sequentially:
-
Crawl (e.g., IANA site, max 5 pages):
python crawler.py --start_url https://www.iana.org --max_pages 5 --max_depth 2- Output:
data/pages.json(raw text per URL). - Tips: Small limits prevent overload; check console logs.
- Output:
-
Index (embed and build FAISS):
python indexer.py --chunk_size 256- Output:
data/index.faiss,data/metadata.pkl. - JSON result: e.g.,
{"vector_count": 45, "errors": []}. - For low RAM: Use smaller
--chunk_size; lighter model--embedding_model sentence-transformers/paraphrase-MiniLM-L3-v2.
- Output:
-
Query:
- Single:
python asker.py --question "What is IANA's role in DNS?" --top_k 3 - Interactive:
python asker.py --interactive(type questions, "exit" to quit). - Expected: Answer + sources + timings. Off-topic: "Not enough information".
- Single:
python crawler.py --start_url https://www.iana.org --max_pages 5
python indexer.py
python asker.py --question "What are IANA's main activities?"
- Answer: Grounded in DNS, IP, protocols from crawled pages.
- Args:
- Crawler:
--max_depth 1,--max_pages 3for tiny tests. - Indexer:
--chunk_overlap 50,--batch_size(if RAM allows). - Asker:
--top_k 5, integrate custom LLM prompt.
- Crawler:
- Eval: Batch questions in a script; check refusals/latency.
- Reset:
rm data/*to start fresh.
- RAM Crashes: Use
--chunk_size 256, close apps, add swap (Windows: Settings > Virtual Memory). - No Answer: Deeper crawl (
--max_depth 3), ensure question matches content. - Errors: Check logs; install missing libs.
- Slow: CPU-bound; normal for embeddings (10-30s small crawl).
- Local only: No internet for LLM (use Ollama offline).
- Scale: Best for <50 pages; larger needs GPU/batching tweaks.
- Ethics: Crawl public sites respectfully.
MIT. Free for educational/commercial use.
Questions? Open an issue. Built for IR/RAG assignments—expand to hybrid search!