Local RAG Web Query System

A lightweight, privacy-focused Retrieval-Augmented Generation (RAG) pipeline for crawling, indexing, and querying web content locally. Perfect for grounded AI responses on specific sites (e.g., IANA.org) without external APIs. Runs entirely on your machine using CPU-friendly tools.

Features

Crawler: Scrapes pages ethically (respects robots.txt), extracts text, limits depth/pages.
Indexer: Chunks text, embeds with Sentence Transformers, stores in FAISS for fast similarity search.
Asker: Retrieves relevant chunks, generates answers via local LLM (e.g., Ollama), cites sources, refuses off-topic queries.
Low-Resource: Optimized for 8GB RAM systems; per-chunk processing avoids crashes.
Eval-Friendly: Measures latency, handles refusals for assignment demos.

Tech Stack

Python 3.8+
BeautifulSoup (crawling)
Sentence Transformers (embeddings)
FAISS (vector search)
tqdm, logging (progress)
Optional: Ollama/HuggingFace for generation in asker.py

Requirements

Install via pip:

pip install beautifulsoup4 requests sentence-transformers faiss-cpu tqdm numpy pickle5 torch

For LLM: Install Ollama (ollama run llama3) or use HuggingFace Transformers.
Model: all-MiniLM-L6-v2 downloads on first run (~90MB).

Setup

Clone repo: git clone <repo-url>; cd rag-project
Create data/ folder: mkdir data
(Optional) Edit scripts for custom LLM in asker.py.

Usage

Run sequentially:

Crawl (e.g., IANA site, max 5 pages):
```
python crawler.py --start_url https://www.iana.org --max_pages 5 --max_depth 2
```
- Output: data/pages.json (raw text per URL).
- Tips: Small limits prevent overload; check console logs.
Index (embed and build FAISS):
```
python indexer.py --chunk_size 256
```
- Output: data/index.faiss, data/metadata.pkl.
- JSON result: e.g., {"vector_count": 45, "errors": []}.
- For low RAM: Use smaller --chunk_size; lighter model --embedding_model sentence-transformers/paraphrase-MiniLM-L3-v2.
Query:
- Single: python asker.py --question "What is IANA's role in DNS?" --top_k 3
- Interactive: python asker.py --interactive (type questions, "exit" to quit).
- Expected: Answer + sources + timings. Off-topic: "Not enough information".

Example Workflow (IANA Demo)

python crawler.py --start_url https://www.iana.org --max_pages 5
python indexer.py
python asker.py --question "What are IANA's main activities?"

Answer: Grounded in DNS, IP, protocols from crawled pages.

Customization

Args:
- Crawler: --max_depth 1, --max_pages 3 for tiny tests.
- Indexer: --chunk_overlap 50, --batch_size (if RAM allows).
- Asker: --top_k 5, integrate custom LLM prompt.
Eval: Batch questions in a script; check refusals/latency.
Reset: rm data/* to start fresh.

Troubleshooting

RAM Crashes: Use --chunk_size 256, close apps, add swap (Windows: Settings > Virtual Memory).
No Answer: Deeper crawl (--max_depth 3), ensure question matches content.
Errors: Check logs; install missing libs.
Slow: CPU-bound; normal for embeddings (10-30s small crawl).

Limitations

Local only: No internet for LLM (use Ollama offline).
Scale: Best for <50 pages; larger needs GPU/batching tweaks.
Ethics: Crawl public sites respectfully.

License

MIT. Free for educational/commercial use.

Questions? Open an issue. Built for IR/RAG assignments—expand to hybrid search!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
asker.py		asker.py
crawler.py		crawler.py
indexer.py		indexer.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local RAG Web Query System

Features

Tech Stack

Requirements

Setup

Usage

Example Workflow (IANA Demo)

Customization

Troubleshooting

Limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Local RAG Web Query System

Features

Tech Stack

Requirements

Setup

Usage

Example Workflow (IANA Demo)

Customization

Troubleshooting

Limitations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages