Skip to content

Commit 2ed17c2

Browse files
Merge pull request #29 from r3v5/kfp-docling-spreadsheets
feat: create KFP Spreadsheets conversion demo pipeline using Docling …
2 parents dbb402b + b185a29 commit 2ed17c2

File tree

6 files changed

+2196
-0
lines changed

6 files changed

+2196
-0
lines changed
Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# Kubeflow Docling Spreadsheets Conversion Pipeline for RAG
2+
This document explains the **Kubeflow Docling Spreadsheets Conversion Pipeline** - a Kubeflow pipeline that processes spreadsheets of different formats like "*.csv", "*.xlsx", "*.xls", "*.xlsm" with Docling to extract text and generate embeddings for Retrieval-Augmented Generation (RAG) applications. The pipeline supports execution on both GPU and CPU-only nodes.
3+
4+
5+
## Pipeline Overview
6+
The pipeline transforms spreadsheet files into searchable vector embeddings through the following stages:
7+
8+
```mermaid
9+
10+
graph TD
11+
12+
A[Register Vector DB] --> B[Import spreadsheet files]
13+
14+
B --> C[Create spreadsheet splits]
15+
16+
C --> D[Convert all spreadsheets to CSV]
17+
18+
D --> E[Conversion using Docling]
19+
20+
E --> F[Text Chunking]
21+
22+
F --> G[Generate Embeddings]
23+
24+
G --> H[Store chunks with embeddings in Vector Database]
25+
26+
H --> I[Ready for RAG Queries]
27+
28+
```
29+
30+
31+
## Pipeline Components
32+
33+
### 1. **Vector Database Registration** (`register_vector_db`)
34+
35+
- **Purpose**: Sets up the vector database with the proper configuration.
36+
37+
38+
### 2. **Spreadsheets Import** (`import_spreadsheet_files`)
39+
40+
- **Purpose**: Downloads spreadsheet files from remote URLs.
41+
42+
43+
44+
### 3. **Spreadsheets Splitting** (`create_spreadsheet_splits`)
45+
46+
- **Purpose**: Distributes spreadsheet files across multiple parallel workers for faster processing.
47+
48+
49+
50+
### 4. **Spreadsheet Conversion and Embedding Generation** (`docling_convert_and_ingest_spreadsheets`)
51+
52+
53+
54+
- **Purpose**: Main processing component that extracts data from spreadsheet rows, chunks the text, and generates vector embeddings.
55+
56+
## Supported Spreadsheets Formats
57+
58+
- `.csv`
59+
60+
- `.xlsx`
61+
62+
- `.xls`
63+
64+
- `.xlsm`
65+
66+
67+
## 🔄 RAG Query Flow
68+
69+
1. **User Query** → Embedding Model → Query Vector
70+
71+
2. **Vector Search** → Vector Database → Similar Data Chunks
72+
73+
4. **Context Assembly** → Row Content + Source Metadata
74+
75+
5. **LLM Generation** → Final Answer with Context from Spreadsheet
76+
77+
78+
The pipeline enables rich RAG applications that can answer questions about spreadsheet content by leveraging the structured data extracted from spreadsheet files.
79+
80+
81+
82+
## 🚀 Getting Started
83+
84+
### Prerequisites
85+
- [Data Science Project in OpenShift AI with a configured Workbench](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/getting_started)
86+
87+
- [Configuring a pipeline server](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/latest/html/working_with_data_science_pipelines/managing-data-science-pipelines_ds-pipelines#configuring-a-pipeline-server_ds-pipelines)
88+
89+
- A LlamaStack service with a vector database backend deployed (follow our [official deployment documentation](https://github.com/opendatahub-io/rag/blob/main/DEPLOYMENT.md))
90+
91+
- GPU-enabled nodes are highly recommended for faster processing.
92+
- You can still use only CPU nodes but it will take longer.
93+
94+
95+
**Pipeline Parameters**
96+
97+
- `base_url`: URL where spreadsheet files are hosted
98+
99+
- `spreadsheet_filenames`: Comma-separated list of spreadsheet files to process
100+
101+
- `num_workers`: Number of parallel workers (default: 1)
102+
103+
- `vector_db_id`: ID of the vector database to store embeddings
104+
105+
- `service_url`: URL of the LlamaStack service
106+
107+
- `embed_model_id`: Embedding model to use (default: `ibm-granite/granite-embedding-125m-english`)
108+
109+
- `max_tokens`: Maximum tokens per chunk (default: 512)
110+
111+
- `use_gpu`: Whether to use GPU for processing (default: true)
112+
113+
114+
115+
### Creating the Pipeline for running on GPU node
116+
117+
```
118+
# Install dependencies for pipeline
119+
cd demos/kfp/docling/spreadsheets-conversion
120+
pip3 install -r requirements.txt
121+
122+
# Compile the Kubeflow pipeline for running with help of GPU or use existing pipeline
123+
# set use_gpu = True in docling_convert_pipeline() in docling_spreadsheets_convert_pipeline.py
124+
python3 docling_spreadsheets_convert_pipeline.py
125+
```
126+
127+
### Creating the Pipeline for running on CPU only
128+
```
129+
# Install dependencies for pipeline
130+
cd demos/kfp/docling/spreadsheets-conversion
131+
pip3 install -r requirements.txt
132+
133+
# Compile the Kubeflow pipeline for running on CPU only or use existing pipeline
134+
# set use_gpu = False in docling_convert_pipeline() in docling_spreadsheets_convert_pipeline.py
135+
python3 docling_spreadsheets_convert_pipeline.py
136+
```
137+
138+
### Import Kubeflow pipeline to OpenShift AI
139+
- Import the compiled YAML to in Pipeline server in your Data Science project in OpenShift AI
140+
- [Running a data science pipeline generated from Python code](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/openshift_ai_tutorial_-_fraud_detection_example/implementing-pipelines#running-a-pipeline-generated-from-python-code)
141+
- Configure the pipeline parameters as needed
142+
143+
144+
145+
146+
### Query RAG Agent in your Workbench within a Data Science project on OpenShift AI
147+
1. Open your Workbench
148+
2. Clone the rag repo and use main branch
149+
- Use this link `https://github.com/opendatahub-io/rag.git` for cloning the repo
150+
151+
- [Collaborating on Jupyter notebooks by using Git](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/working_with_connected_applications/using_basic_workbenches#collaborating-on-jupyter-notebooks-by-using-git_connected-apps)
152+
3. Install dependencies for Jupyter Notebook with RAG Agent
153+
154+
```
155+
cd demos/kfp/docling/spreadsheets-conversion/rag-agent
156+
pip3 install -r requirements.txt
157+
```
158+
159+
4. Follow the instructions in the corresponding RAG Jupyter Notebook `spreadsheets_rag_agent.ipynb` in `demos/kfp/docling/spreadsheets-conversion/rag-agent` to query the content ingested by the pipeline.

0 commit comments

Comments
 (0)