RAG Module

A high-performance Retrieval-Augmented Generation (RAG) system for extracting structured insights from personal documents (CVs, resumes, publications, and technical reports).

Key Components

Semantic Intelligence: powered by sentence-transformers/all-mpnet-base-v2 for superior context understanding.
Diverse Retrieval: Implements Maximal Marginal Relevance (MMR) to provide non-redundant, diverse information from complex documents.
Structured Extraction: Heuristic engine that categorizes findings into Technical Areas, Core Skills, Key Concepts, and Multi-word Phrases.
High Performance: Uses FAISS (Facebook AI Similarity Search) for blazingly fast local vector search.

Quick Start

# 1. Add your documents to data/personal_docs/
# 2. Run automated setup and ingestion
.\scripts\setup_and_run.bat

Folder Structure

rag/
├── src/                      # Core Logic
│   ├── ingest_documents.py   # Document ingestion & FAISS index builder
│   ├── rag_tool.py           # CrewAI-compatible extraction tool
│   ├── inspect_rag.py        # Utility to verify index quality
│   └── logging_config.py     # Centralized logging engine
├── config/                   # Configuration
│   └── config.yaml           # Extraction & Search parameters
├── scripts/                  # Automation
│   └── setup_and_run.bat     # One-click environment & data setup
├── docs/                     # Documentation
│   ├── SETUP.md              # Detailed installation guide
│   └── EXTERNAL_STORAGE.md   # Privacy & External storage guide
├── data/                     # Data Storage (gitignored)
│   ├── personal_docs/        # Input sources
│   └── vector_db/            # Local FAISS index
├── logs/                     # Execution logs
└── requirements.txt          # Dependency manifest

Features

✓ Multi-format Support: PDF, DOCX, TXT, and MD ingestion.
✓ Deterministic Metadata: Tracks source files and document types for every insight.
✓ MMR Search: Prevents "information loops" by ensuring retrieved chunks are distinct.
✓ Performance Optimized: Includes hf_xet for accelerated model downloads.
✓ Local & Private: No external API calls are made for document processing or storage.

Requirements

Python 3.8+
UV Package Manager (recommended for 10x faster setup)
Windows (scripts optimized for PowerShell/CMD)

Documentation

Setup Guide - Getting started from scratch.
External Storage Guide - How to keep your data outside the Git repo.
Retrieval Mechanics - Detailed look at how the extraction works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG Module

Key Components

Quick Start

Folder Structure

Features

Requirements

Documentation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

RAG Module

Key Components

Quick Start

Folder Structure

Features

Requirements

Documentation