The scripts folder contains Python scripts that implement the core logic of the project, particularly around document processing, embedding, and retrieval for a Retrieval Augmented Generation (RAG) system.
__init__.py: Marks thescriptsfolder as a Python package.
The folder is organized into several subdirectories, each responsible for a specific aspect of the processing pipeline:
-
agents/:- Contains scripts related to AI agents or agentic behavior within the RAG system.
__init__.py: Marks theagentsfolder as a Python package.- For more details, see
scripts/agents/README.md.
-
api_clients/:- Houses clients for interacting with external APIs, such as OpenAI.
openai/:batch_embedder.py: ImplementsBatchEmbedderfor submitting large embedding jobs to OpenAI's asynchronous/v1/batchesAPI. It handles JSONL file preparation, batch submission, status polling, result downloading, and parsing.
-
chunking/:- Contains modules for splitting documents into smaller, manageable chunks. This is a crucial step for RAG.
__init__.py: Marks thechunkingfolder as a Python package.models.py: DefinesChunkandDocdataclasses for representing text chunks and documents.rules_v3.py: DefinesChunkRuledataclass and functions (get_rule,get_all_rules) to load and manage chunking rules fromconfigs/chunk_rules.yaml. These rules specify strategy, token limits, and overlap.chunker_v3.py: Implements the mainsplitfunction for chunking text based on rules fromrules_v3.py. It supports various strategies (e.g., "by_paragraph", "by_slide", "by_email_block"), usesspacyfor sentence splitting in emails, and handles merging small chunks and overlaps.- For more details, see
scripts/chunking/README.md.
-
core/:- Contains core project management and configuration scripts.
__init__.py: Contains an older or alternativeProjectManagerclass definition.project_manager.py: Defines the primaryProjectManagerclass, responsible for managing the RAG project workspace, including paths to configuration (config.yml), input/output directories, logs, FAISS indexes, and metadata files. It ensures these directories exist.- For more details, see
scripts/core/README.md.
-
embeddings/:- Contains scripts for generating and managing text embeddings.
__init__.py: Marks theembeddingsfolder as a Python package.base.py: Defines an abstract base classBaseEmbedderwith anencodemethod.bge_embedder.py: ImplementsBGEEmbedder, a concrete embedder using SentenceTransformers (e.g., "BAAI/bge-large-en").litellm_embedder.py: ImplementsLiteLLMEmbedderfor generating embeddings via LiteLLM-compatible APIs (OpenAI, Ollama, etc.), using HTTP requests.embedder_registry.py: Providesget_embedderfunction to fetch an embedder instance based on project configuration (e.g., "local" or "litellm").unified_embedder.py: ImplementsUnifiedEmbedder, a comprehensive class for embedding chunks. It supports:- Deduplication of chunks based on content hashes.
- Batch embedding using local models or OpenAI's async batch API (via
BatchEmbedder). - Grouping chunks by document type.
- Storing embeddings in FAISS indexes and metadata in JSONL files, organized by document type.
- Loading chunks from TSV files.
- For more details, see
scripts/embeddings/README.md.
-
index/:- Intended for scripts related to creating, managing, and querying an index of document embeddings.
__init__.py: Marks theindexfolder as a Python package.- For more details, see
scripts/index/README.md.
-
ingestion/:- Contains modules for loading and parsing various document formats.
__init__.py: Initializes aLOADER_REGISTRYmapping file extensions (e.g., ".pdf", ".docx", ".txt", ".xlsx") to their corresponding loader functions or classes.models.py: DefinesRawDocdataclass (for content before chunking),AbstractIngestorbase class, andUnsupportedFileErrorexception.manager.py: DefinesIngestionManagerwhich orchestrates document ingestion. It recursively searches a path, usesLOADER_REGISTRYto find the appropriate loader for each file, and returns a list ofRawDocobjects.csv.py:load_csvfunction to load CSV content as a single string.docx_loader.py:load_docxfunction to extract text from.docxfiles, including from tables, usingpython-docx.email_loader.py:load_emlfunction to parse.emlfiles and extract plain text content.pdf.py:load_pdffunction to extract text from.pdffiles usingpdfplumber, handling encrypted or corrupted files.pptx.py:PptxIngestorclass (subclass ofAbstractIngestor) to extract text from.pptxslides and presenter notes usingpython-pptx.xlsx.py:XlsxIngestorclass (subclass ofAbstractIngestor) to extract data from.xlsxfiles, grouping rows from each sheet into text chunks usingopenpyxl.- For more details, see
scripts/ingestion/README.md.
-
prompting/:- Intended for scripts related to constructing prompts for the language model in the RAG system.
__init__.py: Marks thepromptingfolder as a Python package.
-
retrieval/:- Contains scripts for retrieving relevant chunks from the index based on a query.
__init__.py: Marks theretrievalfolder as a Python package.base.py: DefinesBaseRetrieverabstract class andFaissRetrieverfor searching in a FAISS index and its associated metadata. It uses a shared embedder (fromscripts.api_clients.embedder, though this path might need checking asget_embedderis inscripts.embeddings.embedder_registry) to encode queries.retrieval_manager.py: ImplementsRetrievalManagerwhich loads retrievers (currentlyFaissRetriever) for different document types and applies retrieval strategies (defined inscripts.retrieval.strategies) like "late_fusion".- For more details, see
scripts/retrieval/README_retrieval.md.
-
utils/:- Contains utility scripts and helper functions used across the project.
__init__.py: Marks theutilsfolder as a Python package.chunk_utils.py: Providesdeduplicate_chunks(based on content hashes) andload_chunks(from TSV files).config_loader.py: DefinesConfigLoaderfor loading and accessing YAML configuration files with support for dot notation.create_demo_pptx.py: A script to generate a demo.pptxfile for testing.email_utils.py:clean_email_textfunction to remove quoted lines, reply blocks, and signatures from email text.logger.py: ImplementsLoggerManagerfor creating configuredlogging.Loggerinstances with console/file output, JSON/text formatting, and optional color. Also includesJsonLogFormatter.msg2email.py:msg_to_emlfunction to convert Outlook.msgfiles to.emlformat usingextract_msg.
These scripts work together to form a pipeline: documents are ingested, chunked, converted to embeddings, indexed, and then retrieved to augment prompts for a language model.