hagaybar
diff --git a/‎README.md‎
Lines changed: 76 additions & 67 deletions b/‎README.md‎
Lines changed: 76 additions & 67 deletions
diff --git a/‎app/README.md‎
Lines changed: 11 additions & 0 deletions b/‎app/README.md‎
Lines changed: 11 additions & 0 deletions
diff --git a/‎configs/README.md‎
Lines changed: 19 additions & 11 deletions b/‎configs/README.md‎
Lines changed: 19 additions & 11 deletions
diff --git a/‎docs/README.md‎
Lines changed: 11 additions & 14 deletions b/‎docs/README.md‎
Lines changed: 11 additions & 14 deletions
@@ -1,107 +1,116 @@
-# RAG‑Pipeline Prototype
+# Multi-Source RAG Platform
 
-A **local‑first Retrieval‑Augmented‑Generation platform** for turning mixed organisational content (PDFs, office docs, presentations, emails, images, websites, etc.) into an interactive knowledge base.  
-Built for librarians, researchers and other knowledge‑workers who need reliable answers from their in‑house documentation without sending private data to external services.
+This project is a sophisticated, local-first Retrieval-Augmented Generation (RAG) platform designed to transform a diverse range of organizational content—including PDFs, Office documents, emails, and images—into an interactive and searchable knowledge base. It is built for anyone who needs to derive reliable answers from in-house documentation without compromising data privacy.
 
 <p align="center">
-  <img src="https://raw.githubusercontent.com/your‑org/rag‑pipeline/main/docs/architecture_simplified.png" width="600" alt="High‑level architecture"/>
+  <img src="docs/architecture.png" width="700" alt="High-level architecture of the RAG platform"/>
 </p>
 
 ---
 
-## ✨ Key Features (v0.1)
+## ✨ Key Features
 
-* **Drag‑and‑drop ingestion** for TXT, PDF, PPTX, DOCX, XLSX/CSV, images, email files and public URLs (up to **1 GB** total).
-* **Source‑aware chunking** – pluggable rule‑sets per file type.
-* **Hybrid embeddings** – choose local HuggingFace models *or* OpenAI API with one click.
-* **FAISS vector search** with late‑fusion across source types.
-* **Streamlit UI** – create projects, upload data, run queries and view logs without touching the CLI.
-* **Answer generation** with citations using your preferred chat‑LLM endpoint.
-* 100 % offline data storage – raw files, vectors and logs stay on your machine.
-
-See the full [Roadmap](docs/rag_prototype_roadmap.md) for detailed design and future milestones.
+*   **Multi-Source Ingestion**: Supports a wide variety of file formats, including PDF, DOCX, PPTX, CSV, and TXT. *Please note that XLSX and EML ingestion are still under development.*
+*   **Configurable Chunking**: Employs a rule-based chunking system that allows for different strategies (e.g., by paragraph, by slide) to be applied to different document types, ensuring optimal data segmentation.
+*   **Flexible Embedding Models**: Easily switch between local, open-source embedding models (via `sentence-transformers`) and powerful API-based models like OpenAI's.
+*   **Multi-Modal Retrieval**: Capable of retrieving both text and image-based information. The system can generate textual descriptions for images, making visual content fully searchable.
+*   **Advanced Retrieval Strategies**: Uses a late-fusion approach to combine results from multiple sources, ensuring comprehensive and relevant context for every query.
+*   **Streamlit UI**: An intuitive user interface for creating and managing projects, uploading documents, and editing configurations.
+*   **Command-Line Interface**: A powerful CLI for interacting with the platform, allowing you to ingest documents, generate embeddings, and ask questions directly from your terminal.
+*   **Local-First and Secure**: All your data, including raw files, indexes, and logs, is stored locally on your machine, ensuring complete privacy and control.
 
 ---
 
-## 🚀 Quick Start
-
-```bash
-# 1. Clone & install
-git clone https://github.com/your‑org/rag‑pipeline.git
-cd rag‑pipeline
-poetry install          # or: pip install -r requirements.txt
+## 🚀 Getting Started
 
-# 2. Launch Streamlit UI
-poetry run streamlit run app/ui_streamlit.py   # default browser opens
+### Prerequisites
 
-# 3. Create a new project in the UI, upload some PDFs, and ask a question!
-```
+*   Python 3.10 or higher
+*   Poetry for dependency management
+*   An API key for your chosen LLM and embedding providers (e.g., `OPENAI_API_KEY`)
 
-> **Tip:** Prefer local embeddings? Select **bge‑large‑en** under *Settings → Embeddings* before indexing.
+### Installation
 
----
+1.  **Clone the repository:**
+    ```bash
+    git clone <repository-url>
+    cd <repository-name>
+    ```
 
-## 💻 UI Usage
+2.  **Install the dependencies using Poetry:**
+    ```bash
+    poetry install
+    ```
 
-The Streamlit UI provides a user-friendly interface for managing RAG-GP projects.
+### Usage
 
-### Creating a New Project
+The platform can be operated through the Streamlit UI or the command-line interface.
 
-1.  Navigate to the "Projects" section in the sidebar.
-2.  Fill out the "Create New Project" form:
-    *   **Project Name:** A unique name for your project.
-    *   **Project Description:** An optional description of your project.
-    *   **Language:** The primary language of your documents.
-    *   **Enable Image Enrichment:** Check this box to enable image analysis features.
-    *   **Embedding Model:** Select the embedding model to use for your project.
-3.  Click the "Create Project" button.
+#### Streamlit UI
 
-### Managing a Project
+To launch the user interface, run the following command:
+```bash
+poetry run streamlit run scripts/ui/ui_project_manager.py
+```
+The UI allows you to:
+- Create new projects.
+- Upload documents.
+- View and edit project configurations.
 
-Once you have created a project, you can manage it from the "Projects" section.
+#### Command-Line Interface
 
-*   **Select a Project:** Choose a project from the dropdown menu to view its details.
-*   **Configuration Editor:** The `config.yml` file for the selected project is displayed in a text editor. You can make changes to the configuration and save them by clicking the "Save Config" button.
-*   **Upload Raw Data:** You can upload raw data files (e.g., .pdf, .docx, .txt) to your project using the file uploader. The files will be saved to the appropriate subdirectory under `data/projects/<project_name>/input/raw/`.
-*   **Raw File Repository:** The "Raw File Repository" section displays a list of all the raw data files in your project, grouped by file type.
+The CLI provides a powerful way to interact with the platform. Here is a typical workflow:
 
----
+1.  **Ingest and Chunk Documents:**
+    ```bash
+    python -m app.cli ingest /path/to/your/project --chunk
+    ```
 
-## 🗂️ Folder Structure (excerpt)
+2.  **Generate Embeddings:**
+    ```bash
+    python -m app.cli embed /path/to/your/project
+    ```
 
-```text
-rag‑pipeline/
-├── app/                 # CLI & Streamlit entry‑points
-├── scripts/             # Core library (ingestion, chunking, embeddings…)
-├── configs/             # YAML templates for datasets & tasks
-├── data/                # Your local datasets, chunks & indexes (git‑ignored)
-├── docs/                # Technical docs & design diagrams
-└── tests/               # Pytest suite
-```
+3.  **Ask a Question:**
+    ```bash
+    python -m app.cli ask /path/to/your/project "Your question here"
+    ```
 
-For a full tree and design conventions, check the [Codebase Structure](docs/rag_prototype_roadmap.md#10  repository--codebase-structure).
+For more detailed information on the available commands and their options, please refer to the `app/README.md` file.
 
 ---
 
-## 🛠️ Requirements
+## Core Concepts
+
+The platform is built around a modular pipeline that processes your data in several stages:
 
-* **Python 3.12**
-* **Tesseract OCR** (optional, for image ingestion) – install via your OS package manager.
-* For OpenAI or other API models: set `OPENAI_API_KEY` or relevant environment variables.
+1.  **Ingestion**: The first step is to ingest your raw documents. The platform provides a suite of loaders that can handle a wide variety of file formats.
+2.  **Chunking**: Once ingested, the documents are split into smaller, more manageable chunks. This process is highly configurable and can be tailored to the specific characteristics of each document type.
+3.  **Enrichment**: The platform includes an `ImageInsightAgent` that can analyze images and generate textual descriptions for them. This makes visual content searchable and adds another layer of context to your knowledge base.
+4.  **Embedding**: The text and image chunks are then converted into numerical representations (embeddings) using a chosen embedding model.
+5.  **Indexing**: The embeddings are stored in a local FAISS index, which allows for efficient similarity searches.
+6.  **Retrieval**: When you ask a question, the platform uses a late-fusion retrieval strategy to find the most relevant text and image chunks from the index.
+7.  **Generation**: The retrieved context is then used to construct a detailed prompt, which is sent to a large language model to generate a final answer.
 
 ---
 
-## 🤝 Contributing
+## 🗂️ Project Structure
+
+The project is organized into the following key directories:
 
-Pull requests are welcome! Please read **CONTRIBUTING.md** (to be added) and open an issue before starting major work.
+-   **`app/`**: Contains the command-line interface for the platform. See `app/README.md`.
+-   **`assets/`**: A place for static assets. See `assets/README.md`.
+-   **`configs/`**: Home to the `chunk_rules.yaml` file, which defines the chunking strategies for different document types. See `configs/README.md`.
+-   **`docs/`**: Contains project-related documentation, including architecture diagrams and planning documents. See `docs/README.md`.
+-   **`scripts/`**: The heart of the platform, containing the core logic for ingestion, chunking, embedding, retrieval, and more. See `scripts/README.md` for a high-level overview.
+-   **`tests/`**: Contains the test suite for the project.
 
 ---
 
-## 📜 License
+## 🤝 Contributing
 
-This project is released under the MIT License © 2025 — see [LICENSE](LICENSE) for details.
+Contributions are welcome! Please feel free to open an issue or submit a pull request.
 
----
+## 📜 License
 
-> *Built with ❤️ and lots of coffee by the Library Innovation Lab.*  
-> *“Organisational knowledge belongs at your fingertips.”*
+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
@@ -84,6 +84,17 @@ The `cli.py` script exposes the following commands:
         *   `scripts.agents.image_insight_agent.ImageInsightAgent`: For enriching chunks with image summaries.
         *   `scripts.core.project_manager.ProjectManager`: For project context.
 
+7.  **`index-images`**
+    *   **Description**: Index enriched image summaries into a dedicated FAISS index (`image_index.faiss`) and metadata file (`image_metadata.jsonl`).
+    *   **Usage**: `python -m app.cli index-images <project_path> [--doc_type <doc_type>]`
+    *   **Arguments**:
+        *   `project_path`: (Required) Path to the RAG project directory.
+    *   **Options**:
+        *   `--doc_type <doc_type>`: (Optional) The document type to read the enriched chunks from (default: "pptx").
+    *   **Modules Used**:
+        *   `scripts.embeddings.image_indexer.ImageIndexer`: For indexing the image chunks.
+        *   `scripts.core.project_manager.ProjectManager`: For project context.
+
 ## Integration with the Project
 
 The `app` folder serves as the user-facing layer of the project. It orchestrates calls to various managers and utilities within the `scripts` directory (e.g., `IngestionManager`, `UnifiedEmbedder`, `RetrievalManager`, `ProjectManager`). This separation allows for a clean distinction between the CLI definition and the underlying implementation of core functionalities.
@@ -1,14 +1,22 @@
-# Configs Folder
+# Configurations
 
-The `configs` folder is used to store configuration files for the project.
+This directory stores configuration files that control the behavior of various components of the RAG platform.
 
-- `__init__.py`: This file is empty and is used to mark the `configs` folder as a Python package.
-- `chunk_rules.yaml`: This YAML file defines the rules for chunking different types of documents.
-    - For each document type (e.g., `email`, `docx`, `pdf`), it specifies:
-        - `split_strategy`: The method to use for splitting the document (e.g., `split_on_blank_lines`, `split_on_headings`).
-        - `min_chunk_size`: The minimum desired size for each chunk.
-        - `notes`: Additional information or considerations for processing that document type.
-    - For `pptx` (PowerPoint presentations), it also includes `token_bounds` and `overlap` parameters.
-- `chunk_rules_old.yaml`: This YAML file contains older or deprecated chunking rules.
+## `chunk_rules.yaml`
 
-The `chunk_rules.yaml` file is crucial for the document processing pipeline, as it allows for customized chunking behavior based on file type, ensuring that the content is divided into meaningful segments for further processing (e.g., embedding and retrieval).
+This is the central configuration file for the chunking process. It defines a set of rules that determine how different types of documents are split into smaller chunks. The `scripts/chunking/rules_v3.py` module is responsible for loading and parsing this file.
+
+### Structure
+
+The file is a YAML dictionary where each key is a `doc_type` (e.g., `docx`, `pdf`, `eml`) and the value is a rule object with the following keys:
+
+- **`strategy`**: The name of the strategy to use for splitting the document. This corresponds to one of the strategies implemented in `scripts/chunking/chunker_v3.py` (e.g., `by_paragraph`, `by_slide`, `split_on_rows`).
+- **`min_tokens`**: The minimum number of tokens that a chunk should have.
+- **`max_tokens`**: The maximum number of tokens that a chunk can have.
+- **`overlap`**: The number of tokens to overlap between consecutive chunks.
+
+A `default` rule is also defined, which is used as a fallback for any `doc_type` that does not have a specific rule.
+
+### Purpose
+
+By externalizing the chunking rules into a configuration file, we can easily modify the chunking behavior for different document types without having to change the code. This makes the platform more flexible and easier to maintain. For example, we can define a `by_paragraph` strategy for text-heavy documents like DOCX and PDF, and a `by_slide` strategy for presentations.
@@ -1,18 +1,15 @@
-# Docs Folder
+# Documentation
 
-The `docs` folder contains documentation files for the project.
+This directory contains various documents related to the project's design, planning, and architecture.
 
-- `chunk_rules.md`: This file provides a detailed explanation of the chunking strategies used for different document types. It outlines:
-    - Rules by document type (e.g., email, docx, pdf), including split strategy, minimum chunk size, and specific notes.
-    - Definitions for various split strategies (e.g., `split_on_blank_lines`, `split_on_headings`).
-    - Guidelines for including headers and footers.
-    - Special processing notes for certain file types.
-    - Considerations for chunk sizes.
-    This document is a reference for understanding how content is segmented before further processing in the RAG system.
+## Contents
 
-- `ingest.md`: This file describes the available data loaders for ingesting content into the system. It currently details:
-    - Email (`.eml`) loader: Explains how `.eml` files are processed, the function used (`scripts.ingestion.email_loader.load_eml`), what it returns (text content and metadata), and provides a usage example.
-    - DOCX (`.docx`) loader: Explains how `.docx` files are processed, the function used (`scripts.ingestion.docx_loader.load_docx`), what it returns (text content and metadata), how it handles various elements like tables and whitespace, and provides a usage example.
-    This document serves as a guide for developers on how to use the ingestion scripts and what to expect from them.
+- **`architecture.md`**: An overview of the high-level architecture of the RAG platform.
+- **`chunk_rules.md`**: Detailed documentation on the chunking strategies and rules used for different document types.
+- **`ingest.md`**: A guide to the data loaders available for ingesting various file formats.
+- **`roadmap.txt`**: The product roadmap, outlining future features and development plans.
+- **`Second_month_plan.md`**: A document outlining the plan for the second month of development.
+- **`Second_month_plan.pdf`**: A PDF version of the second-month plan.
+- **`new_images_index_plan.docx`**: A DOCX file detailing the plan for implementing the new image indexing functionality.
 
-The `docs` folder is essential for project maintainability and onboarding new developers, providing clear explanations of key components and processes.
+This folder is essential for understanding the project's history, current state, and future direction.