hagaybar · hagaybar · Jun 24, 2025 · Jun 24, 2025 · Jun 24, 2025 · Jun 24, 2025
diff --git a/scripts/README.md b/scripts/README.md
@@ -1,20 +1,87 @@
 # Scripts Folder
 
-The `scripts` folder contains Python scripts that implement the core logic of the project, particularly around document processing for a Retrieval Augmented Generation (RAG) system.
+The `scripts` folder contains Python scripts that implement the core logic of the project, particularly around document processing, embedding, and retrieval for a Retrieval Augmented Generation (RAG) system.
 
-- `__init__.py`: This file is empty and is used to mark the `scripts` folder as a Python package.
+- `__init__.py`: Marks the `scripts` folder as a Python package.
 
 The folder is organized into several subdirectories, each responsible for a specific aspect of the processing pipeline:
 
-- **`agents/`**: Likely contains scripts related to AI agents or agentic behavior within the RAG system. Contains a `README.md`.
-- **`chunking/`**: Contains modules for splitting documents into smaller, manageable chunks. This includes `chunker_v2.py` and `chunker_v3.py` for chunking logic, `models.py` for data structures, and `rules.py` and `rules_v3.py` for defining chunking rules.
-- **`core/`**: Contains core project management scripts, including `project_manager.py`.
-- **`embeddings/`**: Contains scripts for generating embeddings for text chunks, including `ChunkEmbedder.py`. Also contains a `README.md`.
-- **`index/`**: Likely holds scripts for creating and managing an index of the embeddings, which is used for efficient retrieval. Contains a `README.md`.
-- **`ingestion/`**: Contains modules for loading and parsing various document formats. This includes loaders for CSV (`csv.py`), DOCX (`docx_loader.py`), email (`email_loader.py`), PDF (`pdf.py`), PowerPoint (`pptx.py`), and XLSX (`xlsx.py`). It also includes `manager.py` to orchestrate the ingestion process and `models.py` for data structures related to ingestion.
-- **`prompting/`**: Intended for scripts related to constructing prompts for the language model in the RAG system. (Currently appears to contain only an `__init__.py`)
-- **`retrieval/`**: Likely contains scripts for retrieving relevant chunks from the index based on a query. (Currently appears to contain only an `__init__.py`)
-- **`tools/`**: Contains utility scripts and tools, such as `create_demo_pptx.py`.
-- **`utils/`**: Contains utility scripts, such as `config_loader.py` for loading configurations, `email_utils.py` for email processing, `logger.py` for logging, and `msg2email.py` for converting MSG files to EML format.
-
-These scripts work together to form a pipeline: documents are ingested, chunked, converted to embeddings, indexed, and then retrieved to augment prompts for a language model. The `app/cli.py` often serves as an interface to these scripts.
+- **`agents/`**:
+    - Contains scripts related to AI agents or agentic behavior within the RAG system.
+    - `__init__.py`: Marks the `agents` folder as a Python package.
+    - For more details, see `scripts/agents/README.md`.
+
+- **`api_clients/`**:
+    - Houses clients for interacting with external APIs, such as OpenAI.
+    - **`openai/`**:
+        - `batch_embedder.py`: Implements `BatchEmbedder` for submitting large embedding jobs to OpenAI's asynchronous `/v1/batches` API. It handles JSONL file preparation, batch submission, status polling, result downloading, and parsing.
+
+- **`chunking/`**:
+    - Contains modules for splitting documents into smaller, manageable chunks. This is a crucial step for RAG.
+    - `__init__.py`: Marks the `chunking` folder as a Python package.
+    - `models.py`: Defines `Chunk` and `Doc` dataclasses for representing text chunks and documents.
+    - `rules_v3.py`: Defines `ChunkRule` dataclass and functions (`get_rule`, `get_all_rules`) to load and manage chunking rules from `configs/chunk_rules.yaml`. These rules specify strategy, token limits, and overlap.
+    - `chunker_v3.py`: Implements the main `split` function for chunking text based on rules from `rules_v3.py`. It supports various strategies (e.g., "by_paragraph", "by_slide", "by_email_block"), uses `spacy` for sentence splitting in emails, and handles merging small chunks and overlaps.
+    - For more details, see `scripts/chunking/README.md`.
+
+- **`core/`**:
+    - Contains core project management and configuration scripts.
+    - `__init__.py`: Contains an older or alternative `ProjectManager` class definition.
+    - `project_manager.py`: Defines the primary `ProjectManager` class, responsible for managing the RAG project workspace, including paths to configuration (`config.yml`), input/output directories, logs, FAISS indexes, and metadata files. It ensures these directories exist.
+    - For more details, see `scripts/core/README.md`.
+
+- **`embeddings/`**:
+    - Contains scripts for generating and managing text embeddings.
+    - `__init__.py`: Marks the `embeddings` folder as a Python package.
+    - `base.py`: Defines an abstract base class `BaseEmbedder` with an `encode` method.
+    - `bge_embedder.py`: Implements `BGEEmbedder`, a concrete embedder using SentenceTransformers (e.g., "BAAI/bge-large-en").
+    - `litellm_embedder.py`: Implements `LiteLLMEmbedder` for generating embeddings via LiteLLM-compatible APIs (OpenAI, Ollama, etc.), using HTTP requests.
+    - `embedder_registry.py`: Provides `get_embedder` function to fetch an embedder instance based on project configuration (e.g., "local" or "litellm").
+    - `unified_embedder.py`: Implements `UnifiedEmbedder`, a comprehensive class for embedding chunks. It supports:
+        - Deduplication of chunks based on content hashes.
+        - Batch embedding using local models or OpenAI's async batch API (via `BatchEmbedder`).
+        - Grouping chunks by document type.
+        - Storing embeddings in FAISS indexes and metadata in JSONL files, organized by document type.
+        - Loading chunks from TSV files.
+    - For more details, see `scripts/embeddings/README.md`.
+
+- **`index/`**:
+    - Intended for scripts related to creating, managing, and querying an index of document embeddings.
+    - `__init__.py`: Marks the `index` folder as a Python package.
+    - For more details, see `scripts/index/README.md`.
+
+- **`ingestion/`**:
+    - Contains modules for loading and parsing various document formats.
+    - `__init__.py`: Initializes a `LOADER_REGISTRY` mapping file extensions (e.g., ".pdf", ".docx", ".txt", ".xlsx") to their corresponding loader functions or classes.
+    - `models.py`: Defines `RawDoc` dataclass (for content before chunking), `AbstractIngestor` base class, and `UnsupportedFileError` exception.
+    - `manager.py`: Defines `IngestionManager` which orchestrates document ingestion. It recursively searches a path, uses `LOADER_REGISTRY` to find the appropriate loader for each file, and returns a list of `RawDoc` objects.
+    - `csv.py`: `load_csv` function to load CSV content as a single string.
+    - `docx_loader.py`: `load_docx` function to extract text from `.docx` files, including from tables, using `python-docx`.
+    - `email_loader.py`: `load_eml` function to parse `.eml` files and extract plain text content.
+    - `pdf.py`: `load_pdf` function to extract text from `.pdf` files using `pdfplumber`, handling encrypted or corrupted files.
+    - `pptx.py`: `PptxIngestor` class (subclass of `AbstractIngestor`) to extract text from `.pptx` slides and presenter notes using `python-pptx`.
+    - `xlsx.py`: `XlsxIngestor` class (subclass of `AbstractIngestor`) to extract data from `.xlsx` files, grouping rows from each sheet into text chunks using `openpyxl`.
+    - For more details, see `scripts/ingestion/README.md`.
+
+- **`prompting/`**:
+    - Intended for scripts related to constructing prompts for the language model in the RAG system.
+    - `__init__.py`: Marks the `prompting` folder as a Python package.
+
+- **`retrieval/`**:
+    - Contains scripts for retrieving relevant chunks from the index based on a query.
+    - `__init__.py`: Marks the `retrieval` folder as a Python package.
+    - `base.py`: Defines `BaseRetriever` abstract class and `FaissRetriever` for searching in a FAISS index and its associated metadata. It uses a shared embedder (from `scripts.api_clients.embedder`, though this path might need checking as `get_embedder` is in `scripts.embeddings.embedder_registry`) to encode queries.
+    - `retrieval_manager.py`: Implements `RetrievalManager` which loads retrievers (currently `FaissRetriever`) for different document types and applies retrieval strategies (defined in `scripts.retrieval.strategies`) like "late_fusion".
+    - For more details, see `scripts/retrieval/README_retrieval.md`.
+
+- **`utils/`**:
+    - Contains utility scripts and helper functions used across the project.
+    - `__init__.py`: Marks the `utils` folder as a Python package.
+    - `chunk_utils.py`: Provides `deduplicate_chunks` (based on content hashes) and `load_chunks` (from TSV files).
+    - `config_loader.py`: Defines `ConfigLoader` for loading and accessing YAML configuration files with support for dot notation.
+    - `create_demo_pptx.py`: A script to generate a demo `.pptx` file for testing.
+    - `email_utils.py`: `clean_email_text` function to remove quoted lines, reply blocks, and signatures from email text.
+    - `logger.py`: Implements `LoggerManager` for creating configured `logging.Logger` instances with console/file output, JSON/text formatting, and optional color. Also includes `JsonLogFormatter`.
+    - `msg2email.py`: `msg_to_eml` function to convert Outlook `.msg` files to `.eml` format using `extract_msg`.
+
+These scripts work together to form a pipeline: documents are ingested, chunked, converted to embeddings, indexed, and then retrieved to augment prompts for a language model.