Skip to content

jeevanshajujohn/rag_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Local RAG Web Query System

A lightweight, privacy-focused Retrieval-Augmented Generation (RAG) pipeline for crawling, indexing, and querying web content locally. Perfect for grounded AI responses on specific sites (e.g., IANA.org) without external APIs. Runs entirely on your machine using CPU-friendly tools.

Features

  • Crawler: Scrapes pages ethically (respects robots.txt), extracts text, limits depth/pages.
  • Indexer: Chunks text, embeds with Sentence Transformers, stores in FAISS for fast similarity search.
  • Asker: Retrieves relevant chunks, generates answers via local LLM (e.g., Ollama), cites sources, refuses off-topic queries.
  • Low-Resource: Optimized for 8GB RAM systems; per-chunk processing avoids crashes.
  • Eval-Friendly: Measures latency, handles refusals for assignment demos.

Tech Stack

  • Python 3.8+
  • BeautifulSoup (crawling)
  • Sentence Transformers (embeddings)
  • FAISS (vector search)
  • tqdm, logging (progress)
  • Optional: Ollama/HuggingFace for generation in asker.py

Requirements

Install via pip:

pip install beautifulsoup4 requests sentence-transformers faiss-cpu tqdm numpy pickle5 torch
  • For LLM: Install Ollama (ollama run llama3) or use HuggingFace Transformers.
  • Model: all-MiniLM-L6-v2 downloads on first run (~90MB).

Setup

  1. Clone repo: git clone <repo-url>; cd rag-project
  2. Create data/ folder: mkdir data
  3. (Optional) Edit scripts for custom LLM in asker.py.

Usage

Run sequentially:

  1. Crawl (e.g., IANA site, max 5 pages):

    python crawler.py --start_url https://www.iana.org --max_pages 5 --max_depth 2
    
    • Output: data/pages.json (raw text per URL).
    • Tips: Small limits prevent overload; check console logs.
  2. Index (embed and build FAISS):

    python indexer.py --chunk_size 256
    
    • Output: data/index.faiss, data/metadata.pkl.
    • JSON result: e.g., {"vector_count": 45, "errors": []}.
    • For low RAM: Use smaller --chunk_size; lighter model --embedding_model sentence-transformers/paraphrase-MiniLM-L3-v2.
  3. Query:

    • Single: python asker.py --question "What is IANA's role in DNS?" --top_k 3
    • Interactive: python asker.py --interactive (type questions, "exit" to quit).
    • Expected: Answer + sources + timings. Off-topic: "Not enough information".

Example Workflow (IANA Demo)

python crawler.py --start_url https://www.iana.org --max_pages 5
python indexer.py
python asker.py --question "What are IANA's main activities?"
  • Answer: Grounded in DNS, IP, protocols from crawled pages.

Customization

  • Args:
    • Crawler: --max_depth 1, --max_pages 3 for tiny tests.
    • Indexer: --chunk_overlap 50, --batch_size (if RAM allows).
    • Asker: --top_k 5, integrate custom LLM prompt.
  • Eval: Batch questions in a script; check refusals/latency.
  • Reset: rm data/* to start fresh.

Troubleshooting

  • RAM Crashes: Use --chunk_size 256, close apps, add swap (Windows: Settings > Virtual Memory).
  • No Answer: Deeper crawl (--max_depth 3), ensure question matches content.
  • Errors: Check logs; install missing libs.
  • Slow: CPU-bound; normal for embeddings (10-30s small crawl).

Limitations

  • Local only: No internet for LLM (use Ollama offline).
  • Scale: Best for <50 pages; larger needs GPU/batching tweaks.
  • Ethics: Crawl public sites respectfully.

License

MIT. Free for educational/commercial use.

Questions? Open an issue. Built for IR/RAG assignments—expand to hybrid search!

About

This implements a lightweight Retrieval-Augmented Generation (RAG) pipeline for querying crawled web content locally on your machine. It enables grounded AI responses based on real data from sites like IANA.org, without relying on external APIs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages