Skip to content

Latest commit

 

History

History
61 lines (48 loc) · 2.81 KB

File metadata and controls

61 lines (48 loc) · 2.81 KB

RAG Module

A high-performance Retrieval-Augmented Generation (RAG) system for extracting structured insights from personal documents (CVs, resumes, publications, and technical reports).

Key Components

  • Semantic Intelligence: powered by sentence-transformers/all-mpnet-base-v2 for superior context understanding.
  • Diverse Retrieval: Implements Maximal Marginal Relevance (MMR) to provide non-redundant, diverse information from complex documents.
  • Structured Extraction: Heuristic engine that categorizes findings into Technical Areas, Core Skills, Key Concepts, and Multi-word Phrases.
  • High Performance: Uses FAISS (Facebook AI Similarity Search) for blazingly fast local vector search.

Quick Start

# 1. Add your documents to data/personal_docs/
# 2. Run automated setup and ingestion
.\scripts\setup_and_run.bat

Folder Structure

rag/
├── src/                      # Core Logic
│   ├── ingest_documents.py   # Document ingestion & FAISS index builder
│   ├── rag_tool.py           # CrewAI-compatible extraction tool
│   ├── inspect_rag.py        # Utility to verify index quality
│   └── logging_config.py     # Centralized logging engine
├── config/                   # Configuration
│   └── config.yaml           # Extraction & Search parameters
├── scripts/                  # Automation
│   └── setup_and_run.bat     # One-click environment & data setup
├── docs/                     # Documentation
│   ├── SETUP.md              # Detailed installation guide
│   └── EXTERNAL_STORAGE.md   # Privacy & External storage guide
├── data/                     # Data Storage (gitignored)
│   ├── personal_docs/        # Input sources
│   └── vector_db/            # Local FAISS index
├── logs/                     # Execution logs
└── requirements.txt          # Dependency manifest

Features

  • Multi-format Support: PDF, DOCX, TXT, and MD ingestion.
  • Deterministic Metadata: Tracks source files and document types for every insight.
  • MMR Search: Prevents "information loops" by ensuring retrieved chunks are distinct.
  • Performance Optimized: Includes hf_xet for accelerated model downloads.
  • Local & Private: No external API calls are made for document processing or storage.

Requirements

  • Python 3.8+
  • UV Package Manager (recommended for 10x faster setup)
  • Windows (scripts optimized for PowerShell/CMD)

Documentation