Skip to content

Commit f5f8a8c

Browse files
authored
Merge pull request #76 from hagaybar/main
add readme files
2 parents 183086e + 10b7e47 commit f5f8a8c

File tree

13 files changed

+386
-343
lines changed

13 files changed

+386
-343
lines changed

README.md

Lines changed: 76 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -1,107 +1,116 @@
1-
# RAG‑Pipeline Prototype
1+
# Multi-Source RAG Platform
22

3-
A **local‑first Retrieval‑Augmented‑Generation platform** for turning mixed organisational content (PDFs, office docs, presentations, emails, images, websites, etc.) into an interactive knowledge base.
4-
Built for librarians, researchers and other knowledge‑workers who need reliable answers from their in‑house documentation without sending private data to external services.
3+
This project is a sophisticated, local-first Retrieval-Augmented Generation (RAG) platform designed to transform a diverse range of organizational content—including PDFs, Office documents, emails, and images—into an interactive and searchable knowledge base. It is built for anyone who needs to derive reliable answers from in-house documentation without compromising data privacy.
54

65
<p align="center">
7-
<img src="https://raw.githubusercontent.com/your‑org/rag‑pipeline/main/docs/architecture_simplified.png" width="600" alt="Highlevel architecture"/>
6+
<img src="docs/architecture.png" width="700" alt="High-level architecture of the RAG platform"/>
87
</p>
98

109
---
1110

12-
## ✨ Key Features (v0.1)
11+
## ✨ Key Features
1312

14-
* **Drag‑and‑drop ingestion** for TXT, PDF, PPTX, DOCX, XLSX/CSV, images, email files and public URLs (up to **1 GB** total).
15-
* **Source‑aware chunking** – pluggable rule‑sets per file type.
16-
* **Hybrid embeddings** – choose local HuggingFace models *or* OpenAI API with one click.
17-
* **FAISS vector search** with late‑fusion across source types.
18-
* **Streamlit UI** – create projects, upload data, run queries and view logs without touching the CLI.
19-
* **Answer generation** with citations using your preferred chat‑LLM endpoint.
20-
* 100 % offline data storage – raw files, vectors and logs stay on your machine.
21-
22-
See the full [Roadmap](docs/rag_prototype_roadmap.md) for detailed design and future milestones.
13+
* **Multi-Source Ingestion**: Supports a wide variety of file formats, including PDF, DOCX, PPTX, CSV, and TXT. *Please note that XLSX and EML ingestion are still under development.*
14+
* **Configurable Chunking**: Employs a rule-based chunking system that allows for different strategies (e.g., by paragraph, by slide) to be applied to different document types, ensuring optimal data segmentation.
15+
* **Flexible Embedding Models**: Easily switch between local, open-source embedding models (via `sentence-transformers`) and powerful API-based models like OpenAI's.
16+
* **Multi-Modal Retrieval**: Capable of retrieving both text and image-based information. The system can generate textual descriptions for images, making visual content fully searchable.
17+
* **Advanced Retrieval Strategies**: Uses a late-fusion approach to combine results from multiple sources, ensuring comprehensive and relevant context for every query.
18+
* **Streamlit UI**: An intuitive user interface for creating and managing projects, uploading documents, and editing configurations.
19+
* **Command-Line Interface**: A powerful CLI for interacting with the platform, allowing you to ingest documents, generate embeddings, and ask questions directly from your terminal.
20+
* **Local-First and Secure**: All your data, including raw files, indexes, and logs, is stored locally on your machine, ensuring complete privacy and control.
2321

2422
---
2523

26-
## 🚀 Quick Start
27-
28-
```bash
29-
# 1. Clone & install
30-
git clone https://github.com/your‑org/rag‑pipeline.git
31-
cd rag‑pipeline
32-
poetry install # or: pip install -r requirements.txt
24+
## 🚀 Getting Started
3325

34-
# 2. Launch Streamlit UI
35-
poetry run streamlit run app/ui_streamlit.py # default browser opens
26+
### Prerequisites
3627

37-
# 3. Create a new project in the UI, upload some PDFs, and ask a question!
38-
```
28+
* Python 3.10 or higher
29+
* Poetry for dependency management
30+
* An API key for your chosen LLM and embedding providers (e.g., `OPENAI_API_KEY`)
3931

40-
> **Tip:** Prefer local embeddings? Select **bge‑large‑en** under *Settings → Embeddings* before indexing.
32+
### Installation
4133

42-
---
34+
1. **Clone the repository:**
35+
```bash
36+
git clone <repository-url>
37+
cd <repository-name>
38+
```
4339

44-
## 💻 UI Usage
40+
2. **Install the dependencies using Poetry:**
41+
```bash
42+
poetry install
43+
```
4544

46-
The Streamlit UI provides a user-friendly interface for managing RAG-GP projects.
45+
### Usage
4746

48-
### Creating a New Project
47+
The platform can be operated through the Streamlit UI or the command-line interface.
4948

50-
1. Navigate to the "Projects" section in the sidebar.
51-
2. Fill out the "Create New Project" form:
52-
* **Project Name:** A unique name for your project.
53-
* **Project Description:** An optional description of your project.
54-
* **Language:** The primary language of your documents.
55-
* **Enable Image Enrichment:** Check this box to enable image analysis features.
56-
* **Embedding Model:** Select the embedding model to use for your project.
57-
3. Click the "Create Project" button.
49+
#### Streamlit UI
5850

59-
### Managing a Project
51+
To launch the user interface, run the following command:
52+
```bash
53+
poetry run streamlit run scripts/ui/ui_project_manager.py
54+
```
55+
The UI allows you to:
56+
- Create new projects.
57+
- Upload documents.
58+
- View and edit project configurations.
6059

61-
Once you have created a project, you can manage it from the "Projects" section.
60+
#### Command-Line Interface
6261

63-
* **Select a Project:** Choose a project from the dropdown menu to view its details.
64-
* **Configuration Editor:** The `config.yml` file for the selected project is displayed in a text editor. You can make changes to the configuration and save them by clicking the "Save Config" button.
65-
* **Upload Raw Data:** You can upload raw data files (e.g., .pdf, .docx, .txt) to your project using the file uploader. The files will be saved to the appropriate subdirectory under `data/projects/<project_name>/input/raw/`.
66-
* **Raw File Repository:** The "Raw File Repository" section displays a list of all the raw data files in your project, grouped by file type.
62+
The CLI provides a powerful way to interact with the platform. Here is a typical workflow:
6763

68-
---
64+
1. **Ingest and Chunk Documents:**
65+
```bash
66+
python -m app.cli ingest /path/to/your/project --chunk
67+
```
6968

70-
## 🗂️ Folder Structure (excerpt)
69+
2. **Generate Embeddings:**
70+
```bash
71+
python -m app.cli embed /path/to/your/project
72+
```
7173

72-
```text
73-
rag‑pipeline/
74-
├── app/ # CLI & Streamlit entry‑points
75-
├── scripts/ # Core library (ingestion, chunking, embeddings…)
76-
├── configs/ # YAML templates for datasets & tasks
77-
├── data/ # Your local datasets, chunks & indexes (git‑ignored)
78-
├── docs/ # Technical docs & design diagrams
79-
└── tests/ # Pytest suite
80-
```
74+
3. **Ask a Question:**
75+
```bash
76+
python -m app.cli ask /path/to/your/project "Your question here"
77+
```
8178

82-
For a full tree and design conventions, check the [Codebase Structure](docs/rag_prototype_roadmap.md#10  repository--codebase-structure).
79+
For more detailed information on the available commands and their options, please refer to the `app/README.md` file.
8380

8481
---
8582

86-
## 🛠️ Requirements
83+
## Core Concepts
84+
85+
The platform is built around a modular pipeline that processes your data in several stages:
8786

88-
* **Python 3.12**
89-
* **Tesseract OCR** (optional, for image ingestion) – install via your OS package manager.
90-
* For OpenAI or other API models: set `OPENAI_API_KEY` or relevant environment variables.
87+
1. **Ingestion**: The first step is to ingest your raw documents. The platform provides a suite of loaders that can handle a wide variety of file formats.
88+
2. **Chunking**: Once ingested, the documents are split into smaller, more manageable chunks. This process is highly configurable and can be tailored to the specific characteristics of each document type.
89+
3. **Enrichment**: The platform includes an `ImageInsightAgent` that can analyze images and generate textual descriptions for them. This makes visual content searchable and adds another layer of context to your knowledge base.
90+
4. **Embedding**: The text and image chunks are then converted into numerical representations (embeddings) using a chosen embedding model.
91+
5. **Indexing**: The embeddings are stored in a local FAISS index, which allows for efficient similarity searches.
92+
6. **Retrieval**: When you ask a question, the platform uses a late-fusion retrieval strategy to find the most relevant text and image chunks from the index.
93+
7. **Generation**: The retrieved context is then used to construct a detailed prompt, which is sent to a large language model to generate a final answer.
9194

9295
---
9396

94-
## 🤝 Contributing
97+
## 🗂️ Project Structure
98+
99+
The project is organized into the following key directories:
95100

96-
Pull requests are welcome! Please read **CONTRIBUTING.md** (to be added) and open an issue before starting major work.
101+
- **`app/`**: Contains the command-line interface for the platform. See `app/README.md`.
102+
- **`assets/`**: A place for static assets. See `assets/README.md`.
103+
- **`configs/`**: Home to the `chunk_rules.yaml` file, which defines the chunking strategies for different document types. See `configs/README.md`.
104+
- **`docs/`**: Contains project-related documentation, including architecture diagrams and planning documents. See `docs/README.md`.
105+
- **`scripts/`**: The heart of the platform, containing the core logic for ingestion, chunking, embedding, retrieval, and more. See `scripts/README.md` for a high-level overview.
106+
- **`tests/`**: Contains the test suite for the project.
97107

98108
---
99109

100-
## 📜 License
110+
## 🤝 Contributing
101111

102-
This project is released under the MIT License © 2025 — see [LICENSE](LICENSE) for details.
112+
Contributions are welcome! Please feel free to open an issue or submit a pull request.
103113

104-
---
114+
## 📜 License
105115

106-
> *Built with ❤️ and lots of coffee by the Library Innovation Lab.*
107-
> *“Organisational knowledge belongs at your fingertips.”*
116+
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

app/README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,17 @@ The `cli.py` script exposes the following commands:
8484
* `scripts.agents.image_insight_agent.ImageInsightAgent`: For enriching chunks with image summaries.
8585
* `scripts.core.project_manager.ProjectManager`: For project context.
8686

87+
7. **`index-images`**
88+
* **Description**: Index enriched image summaries into a dedicated FAISS index (`image_index.faiss`) and metadata file (`image_metadata.jsonl`).
89+
* **Usage**: `python -m app.cli index-images <project_path> [--doc_type <doc_type>]`
90+
* **Arguments**:
91+
* `project_path`: (Required) Path to the RAG project directory.
92+
* **Options**:
93+
* `--doc_type <doc_type>`: (Optional) The document type to read the enriched chunks from (default: "pptx").
94+
* **Modules Used**:
95+
* `scripts.embeddings.image_indexer.ImageIndexer`: For indexing the image chunks.
96+
* `scripts.core.project_manager.ProjectManager`: For project context.
97+
8798
## Integration with the Project
8899

89100
The `app` folder serves as the user-facing layer of the project. It orchestrates calls to various managers and utilities within the `scripts` directory (e.g., `IngestionManager`, `UnifiedEmbedder`, `RetrievalManager`, `ProjectManager`). This separation allows for a clean distinction between the CLI definition and the underlying implementation of core functionalities.

configs/README.md

Lines changed: 19 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,22 @@
1-
# Configs Folder
1+
# Configurations
22

3-
The `configs` folder is used to store configuration files for the project.
3+
This directory stores configuration files that control the behavior of various components of the RAG platform.
44

5-
- `__init__.py`: This file is empty and is used to mark the `configs` folder as a Python package.
6-
- `chunk_rules.yaml`: This YAML file defines the rules for chunking different types of documents.
7-
- For each document type (e.g., `email`, `docx`, `pdf`), it specifies:
8-
- `split_strategy`: The method to use for splitting the document (e.g., `split_on_blank_lines`, `split_on_headings`).
9-
- `min_chunk_size`: The minimum desired size for each chunk.
10-
- `notes`: Additional information or considerations for processing that document type.
11-
- For `pptx` (PowerPoint presentations), it also includes `token_bounds` and `overlap` parameters.
12-
- `chunk_rules_old.yaml`: This YAML file contains older or deprecated chunking rules.
5+
## `chunk_rules.yaml`
136

14-
The `chunk_rules.yaml` file is crucial for the document processing pipeline, as it allows for customized chunking behavior based on file type, ensuring that the content is divided into meaningful segments for further processing (e.g., embedding and retrieval).
7+
This is the central configuration file for the chunking process. It defines a set of rules that determine how different types of documents are split into smaller chunks. The `scripts/chunking/rules_v3.py` module is responsible for loading and parsing this file.
8+
9+
### Structure
10+
11+
The file is a YAML dictionary where each key is a `doc_type` (e.g., `docx`, `pdf`, `eml`) and the value is a rule object with the following keys:
12+
13+
- **`strategy`**: The name of the strategy to use for splitting the document. This corresponds to one of the strategies implemented in `scripts/chunking/chunker_v3.py` (e.g., `by_paragraph`, `by_slide`, `split_on_rows`).
14+
- **`min_tokens`**: The minimum number of tokens that a chunk should have.
15+
- **`max_tokens`**: The maximum number of tokens that a chunk can have.
16+
- **`overlap`**: The number of tokens to overlap between consecutive chunks.
17+
18+
A `default` rule is also defined, which is used as a fallback for any `doc_type` that does not have a specific rule.
19+
20+
### Purpose
21+
22+
By externalizing the chunking rules into a configuration file, we can easily modify the chunking behavior for different document types without having to change the code. This makes the platform more flexible and easier to maintain. For example, we can define a `by_paragraph` strategy for text-heavy documents like DOCX and PDF, and a `by_slide` strategy for presentations.

docs/README.md

Lines changed: 11 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,15 @@
1-
# Docs Folder
1+
# Documentation
22

3-
The `docs` folder contains documentation files for the project.
3+
This directory contains various documents related to the project's design, planning, and architecture.
44

5-
- `chunk_rules.md`: This file provides a detailed explanation of the chunking strategies used for different document types. It outlines:
6-
- Rules by document type (e.g., email, docx, pdf), including split strategy, minimum chunk size, and specific notes.
7-
- Definitions for various split strategies (e.g., `split_on_blank_lines`, `split_on_headings`).
8-
- Guidelines for including headers and footers.
9-
- Special processing notes for certain file types.
10-
- Considerations for chunk sizes.
11-
This document is a reference for understanding how content is segmented before further processing in the RAG system.
5+
## Contents
126

13-
- `ingest.md`: This file describes the available data loaders for ingesting content into the system. It currently details:
14-
- Email (`.eml`) loader: Explains how `.eml` files are processed, the function used (`scripts.ingestion.email_loader.load_eml`), what it returns (text content and metadata), and provides a usage example.
15-
- DOCX (`.docx`) loader: Explains how `.docx` files are processed, the function used (`scripts.ingestion.docx_loader.load_docx`), what it returns (text content and metadata), how it handles various elements like tables and whitespace, and provides a usage example.
16-
This document serves as a guide for developers on how to use the ingestion scripts and what to expect from them.
7+
- **`architecture.md`**: An overview of the high-level architecture of the RAG platform.
8+
- **`chunk_rules.md`**: Detailed documentation on the chunking strategies and rules used for different document types.
9+
- **`ingest.md`**: A guide to the data loaders available for ingesting various file formats.
10+
- **`roadmap.txt`**: The product roadmap, outlining future features and development plans.
11+
- **`Second_month_plan.md`**: A document outlining the plan for the second month of development.
12+
- **`Second_month_plan.pdf`**: A PDF version of the second-month plan.
13+
- **`new_images_index_plan.docx`**: A DOCX file detailing the plan for implementing the new image indexing functionality.
1714

18-
The `docs` folder is essential for project maintainability and onboarding new developers, providing clear explanations of key components and processes.
15+
This folder is essential for understanding the project's history, current state, and future direction.

0 commit comments

Comments
 (0)