Skip to content

Latest commit

 

History

History
47 lines (31 loc) · 1.07 KB

File metadata and controls

47 lines (31 loc) · 1.07 KB

ranked

Build a local RAG index from EXA search results. The pipeline targets healthcare businesses and NYC restaurants, crawls site content, chunks it, embeds it, and stores the vectors in a local Chroma DB.

Requirements

  • Python 3.12+
  • uv package manager
  • EXA API key

Setup

Create a virtual environment and install dependencies:

uv venv
uv pip install -e .

Set your EXA API key in .env:

EXA_API_KEY=your_key_here

Optional: add OPENAI_API_KEY to enable entity resolution for directory/aggregator pages and follow-up searches for official business websites.

Run

Using just:

just scrape --output-dir rag_index --collection exa_rag

Or directly:

.venv/bin/python main.py --output-dir rag_index --collection exa_rag

Notes

  • Crawling is limited by --max-pages-per-domain and --max-total-pages.
  • Business targets can be set with --target-healthcare and --target-nyc-restaurants.
  • Raw crawled pages are written to rag_pages.json.
  • The raw crawled pages are written to rag_pages.json for inspection.