|
1 | | -# RAG‑Pipeline Prototype |
| 1 | +# Multi-Source RAG Platform |
2 | 2 |
|
3 | | -A **local‑first Retrieval‑Augmented‑Generation platform** for turning mixed organisational content (PDFs, office docs, presentations, emails, images, websites, etc.) into an interactive knowledge base. |
4 | | -Built for librarians, researchers and other knowledge‑workers who need reliable answers from their in‑house documentation without sending private data to external services. |
| 3 | +This project is a sophisticated, local-first Retrieval-Augmented Generation (RAG) platform designed to transform a diverse range of organizational content—including PDFs, Office documents, emails, and images—into an interactive and searchable knowledge base. It is built for anyone who needs to derive reliable answers from in-house documentation without compromising data privacy. |
5 | 4 |
|
6 | 5 | <p align="center"> |
7 | | - <img src="https://raw.githubusercontent.com/your‑org/rag‑pipeline/main/docs/architecture_simplified.png" width="600" alt="High‑level architecture"/> |
| 6 | + <img src="docs/architecture.png" width="700" alt="High-level architecture of the RAG platform"/> |
8 | 7 | </p> |
9 | 8 |
|
10 | 9 | --- |
11 | 10 |
|
12 | | -## ✨ Key Features (v0.1) |
| 11 | +## ✨ Key Features |
13 | 12 |
|
14 | | -* **Drag‑and‑drop ingestion** for TXT, PDF, PPTX, DOCX, XLSX/CSV, images, email files and public URLs (up to **1 GB** total). |
15 | | -* **Source‑aware chunking** – pluggable rule‑sets per file type. |
16 | | -* **Hybrid embeddings** – choose local HuggingFace models *or* OpenAI API with one click. |
17 | | -* **FAISS vector search** with late‑fusion across source types. |
18 | | -* **Streamlit UI** – create projects, upload data, run queries and view logs without touching the CLI. |
19 | | -* **Answer generation** with citations using your preferred chat‑LLM endpoint. |
20 | | -* 100 % offline data storage – raw files, vectors and logs stay on your machine. |
21 | | - |
22 | | -See the full [Roadmap](docs/rag_prototype_roadmap.md) for detailed design and future milestones. |
| 13 | +* **Multi-Source Ingestion**: Supports a wide variety of file formats, including PDF, DOCX, PPTX, CSV, and TXT. *Please note that XLSX and EML ingestion are still under development.* |
| 14 | +* **Configurable Chunking**: Employs a rule-based chunking system that allows for different strategies (e.g., by paragraph, by slide) to be applied to different document types, ensuring optimal data segmentation. |
| 15 | +* **Flexible Embedding Models**: Easily switch between local, open-source embedding models (via `sentence-transformers`) and powerful API-based models like OpenAI's. |
| 16 | +* **Multi-Modal Retrieval**: Capable of retrieving both text and image-based information. The system can generate textual descriptions for images, making visual content fully searchable. |
| 17 | +* **Advanced Retrieval Strategies**: Uses a late-fusion approach to combine results from multiple sources, ensuring comprehensive and relevant context for every query. |
| 18 | +* **Streamlit UI**: An intuitive user interface for creating and managing projects, uploading documents, and editing configurations. |
| 19 | +* **Command-Line Interface**: A powerful CLI for interacting with the platform, allowing you to ingest documents, generate embeddings, and ask questions directly from your terminal. |
| 20 | +* **Local-First and Secure**: All your data, including raw files, indexes, and logs, is stored locally on your machine, ensuring complete privacy and control. |
23 | 21 |
|
24 | 22 | --- |
25 | 23 |
|
26 | | -## 🚀 Quick Start |
27 | | - |
28 | | -```bash |
29 | | -# 1. Clone & install |
30 | | -git clone https://github.com/your‑org/rag‑pipeline.git |
31 | | -cd rag‑pipeline |
32 | | -poetry install # or: pip install -r requirements.txt |
| 24 | +## 🚀 Getting Started |
33 | 25 |
|
34 | | -# 2. Launch Streamlit UI |
35 | | -poetry run streamlit run app/ui_streamlit.py # default browser opens |
| 26 | +### Prerequisites |
36 | 27 |
|
37 | | -# 3. Create a new project in the UI, upload some PDFs, and ask a question! |
38 | | -``` |
| 28 | +* Python 3.10 or higher |
| 29 | +* Poetry for dependency management |
| 30 | +* An API key for your chosen LLM and embedding providers (e.g., `OPENAI_API_KEY`) |
39 | 31 |
|
40 | | -> **Tip:** Prefer local embeddings? Select **bge‑large‑en** under *Settings → Embeddings* before indexing. |
| 32 | +### Installation |
41 | 33 |
|
42 | | ---- |
| 34 | +1. **Clone the repository:** |
| 35 | + ```bash |
| 36 | + git clone <repository-url> |
| 37 | + cd <repository-name> |
| 38 | + ``` |
43 | 39 |
|
44 | | -## 💻 UI Usage |
| 40 | +2. **Install the dependencies using Poetry:** |
| 41 | + ```bash |
| 42 | + poetry install |
| 43 | + ``` |
45 | 44 |
|
46 | | -The Streamlit UI provides a user-friendly interface for managing RAG-GP projects. |
| 45 | +### Usage |
47 | 46 |
|
48 | | -### Creating a New Project |
| 47 | +The platform can be operated through the Streamlit UI or the command-line interface. |
49 | 48 |
|
50 | | -1. Navigate to the "Projects" section in the sidebar. |
51 | | -2. Fill out the "Create New Project" form: |
52 | | - * **Project Name:** A unique name for your project. |
53 | | - * **Project Description:** An optional description of your project. |
54 | | - * **Language:** The primary language of your documents. |
55 | | - * **Enable Image Enrichment:** Check this box to enable image analysis features. |
56 | | - * **Embedding Model:** Select the embedding model to use for your project. |
57 | | -3. Click the "Create Project" button. |
| 49 | +#### Streamlit UI |
58 | 50 |
|
59 | | -### Managing a Project |
| 51 | +To launch the user interface, run the following command: |
| 52 | +```bash |
| 53 | +poetry run streamlit run scripts/ui/ui_project_manager.py |
| 54 | +``` |
| 55 | +The UI allows you to: |
| 56 | +- Create new projects. |
| 57 | +- Upload documents. |
| 58 | +- View and edit project configurations. |
60 | 59 |
|
61 | | -Once you have created a project, you can manage it from the "Projects" section. |
| 60 | +#### Command-Line Interface |
62 | 61 |
|
63 | | -* **Select a Project:** Choose a project from the dropdown menu to view its details. |
64 | | -* **Configuration Editor:** The `config.yml` file for the selected project is displayed in a text editor. You can make changes to the configuration and save them by clicking the "Save Config" button. |
65 | | -* **Upload Raw Data:** You can upload raw data files (e.g., .pdf, .docx, .txt) to your project using the file uploader. The files will be saved to the appropriate subdirectory under `data/projects/<project_name>/input/raw/`. |
66 | | -* **Raw File Repository:** The "Raw File Repository" section displays a list of all the raw data files in your project, grouped by file type. |
| 62 | +The CLI provides a powerful way to interact with the platform. Here is a typical workflow: |
67 | 63 |
|
68 | | ---- |
| 64 | +1. **Ingest and Chunk Documents:** |
| 65 | + ```bash |
| 66 | + python -m app.cli ingest /path/to/your/project --chunk |
| 67 | + ``` |
69 | 68 |
|
70 | | -## 🗂️ Folder Structure (excerpt) |
| 69 | +2. **Generate Embeddings:** |
| 70 | + ```bash |
| 71 | + python -m app.cli embed /path/to/your/project |
| 72 | + ``` |
71 | 73 |
|
72 | | -```text |
73 | | -rag‑pipeline/ |
74 | | -├── app/ # CLI & Streamlit entry‑points |
75 | | -├── scripts/ # Core library (ingestion, chunking, embeddings…) |
76 | | -├── configs/ # YAML templates for datasets & tasks |
77 | | -├── data/ # Your local datasets, chunks & indexes (git‑ignored) |
78 | | -├── docs/ # Technical docs & design diagrams |
79 | | -└── tests/ # Pytest suite |
80 | | -``` |
| 74 | +3. **Ask a Question:** |
| 75 | + ```bash |
| 76 | + python -m app.cli ask /path/to/your/project "Your question here" |
| 77 | + ``` |
81 | 78 |
|
82 | | -For a full tree and design conventions, check the [Codebase Structure](docs/rag_prototype_roadmap.md#10 repository--codebase-structure). |
| 79 | +For more detailed information on the available commands and their options, please refer to the `app/README.md` file. |
83 | 80 |
|
84 | 81 | --- |
85 | 82 |
|
86 | | -## 🛠️ Requirements |
| 83 | +## Core Concepts |
| 84 | + |
| 85 | +The platform is built around a modular pipeline that processes your data in several stages: |
87 | 86 |
|
88 | | -* **Python 3.12** |
89 | | -* **Tesseract OCR** (optional, for image ingestion) – install via your OS package manager. |
90 | | -* For OpenAI or other API models: set `OPENAI_API_KEY` or relevant environment variables. |
| 87 | +1. **Ingestion**: The first step is to ingest your raw documents. The platform provides a suite of loaders that can handle a wide variety of file formats. |
| 88 | +2. **Chunking**: Once ingested, the documents are split into smaller, more manageable chunks. This process is highly configurable and can be tailored to the specific characteristics of each document type. |
| 89 | +3. **Enrichment**: The platform includes an `ImageInsightAgent` that can analyze images and generate textual descriptions for them. This makes visual content searchable and adds another layer of context to your knowledge base. |
| 90 | +4. **Embedding**: The text and image chunks are then converted into numerical representations (embeddings) using a chosen embedding model. |
| 91 | +5. **Indexing**: The embeddings are stored in a local FAISS index, which allows for efficient similarity searches. |
| 92 | +6. **Retrieval**: When you ask a question, the platform uses a late-fusion retrieval strategy to find the most relevant text and image chunks from the index. |
| 93 | +7. **Generation**: The retrieved context is then used to construct a detailed prompt, which is sent to a large language model to generate a final answer. |
91 | 94 |
|
92 | 95 | --- |
93 | 96 |
|
94 | | -## 🤝 Contributing |
| 97 | +## 🗂️ Project Structure |
| 98 | + |
| 99 | +The project is organized into the following key directories: |
95 | 100 |
|
96 | | -Pull requests are welcome! Please read **CONTRIBUTING.md** (to be added) and open an issue before starting major work. |
| 101 | +- **`app/`**: Contains the command-line interface for the platform. See `app/README.md`. |
| 102 | +- **`assets/`**: A place for static assets. See `assets/README.md`. |
| 103 | +- **`configs/`**: Home to the `chunk_rules.yaml` file, which defines the chunking strategies for different document types. See `configs/README.md`. |
| 104 | +- **`docs/`**: Contains project-related documentation, including architecture diagrams and planning documents. See `docs/README.md`. |
| 105 | +- **`scripts/`**: The heart of the platform, containing the core logic for ingestion, chunking, embedding, retrieval, and more. See `scripts/README.md` for a high-level overview. |
| 106 | +- **`tests/`**: Contains the test suite for the project. |
97 | 107 |
|
98 | 108 | --- |
99 | 109 |
|
100 | | -## 📜 License |
| 110 | +## 🤝 Contributing |
101 | 111 |
|
102 | | -This project is released under the MIT License © 2025 — see [LICENSE](LICENSE) for details. |
| 112 | +Contributions are welcome! Please feel free to open an issue or submit a pull request. |
103 | 113 |
|
104 | | ---- |
| 114 | +## 📜 License |
105 | 115 |
|
106 | | -> *Built with ❤️ and lots of coffee by the Library Innovation Lab.* |
107 | | -> *“Organisational knowledge belongs at your fingertips.”* |
| 116 | +This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details. |
0 commit comments