|
| 1 | +# Kubeflow Docling Spreadsheets Conversion Pipeline for RAG |
| 2 | +This document explains the **Kubeflow Docling Spreadsheets Conversion Pipeline** - a Kubeflow pipeline that processes spreadsheets of different formats like "*.csv", "*.xlsx", "*.xls", "*.xlsm" with Docling to extract text and generate embeddings for Retrieval-Augmented Generation (RAG) applications. The pipeline supports execution on both GPU and CPU-only nodes. |
| 3 | + |
| 4 | + |
| 5 | +## Pipeline Overview |
| 6 | +The pipeline transforms spreadsheet files into searchable vector embeddings through the following stages: |
| 7 | + |
| 8 | +```mermaid |
| 9 | +
|
| 10 | +graph TD |
| 11 | +
|
| 12 | +A[Register Vector DB] --> B[Import spreadsheet files] |
| 13 | +
|
| 14 | +B --> C[Create spreadsheet splits] |
| 15 | +
|
| 16 | +C --> D[Convert all spreadsheets to CSV] |
| 17 | +
|
| 18 | +D --> E[Conversion using Docling] |
| 19 | +
|
| 20 | +E --> F[Text Chunking] |
| 21 | +
|
| 22 | +F --> G[Generate Embeddings] |
| 23 | +
|
| 24 | +G --> H[Store chunks with embeddings in Vector Database] |
| 25 | +
|
| 26 | +H --> I[Ready for RAG Queries] |
| 27 | +
|
| 28 | +``` |
| 29 | + |
| 30 | + |
| 31 | +## Pipeline Components |
| 32 | + |
| 33 | +### 1. **Vector Database Registration** (`register_vector_db`) |
| 34 | + |
| 35 | +- **Purpose**: Sets up the vector database with the proper configuration. |
| 36 | + |
| 37 | + |
| 38 | +### 2. **Spreadsheets Import** (`import_spreadsheet_files`) |
| 39 | + |
| 40 | +- **Purpose**: Downloads spreadsheet files from remote URLs. |
| 41 | + |
| 42 | + |
| 43 | + |
| 44 | +### 3. **Spreadsheets Splitting** (`create_spreadsheet_splits`) |
| 45 | + |
| 46 | +- **Purpose**: Distributes spreadsheet files across multiple parallel workers for faster processing. |
| 47 | + |
| 48 | + |
| 49 | + |
| 50 | +### 4. **Spreadsheet Conversion and Embedding Generation** (`docling_convert_and_ingest_spreadsheets`) |
| 51 | + |
| 52 | + |
| 53 | + |
| 54 | +- **Purpose**: Main processing component that extracts data from spreadsheet rows, chunks the text, and generates vector embeddings. |
| 55 | + |
| 56 | +## Supported Spreadsheets Formats |
| 57 | + |
| 58 | +- `.csv` |
| 59 | + |
| 60 | +- `.xlsx` |
| 61 | + |
| 62 | +- `.xls` |
| 63 | + |
| 64 | +- `.xlsm` |
| 65 | + |
| 66 | + |
| 67 | +## 🔄 RAG Query Flow |
| 68 | + |
| 69 | +1. **User Query** → Embedding Model → Query Vector |
| 70 | + |
| 71 | +2. **Vector Search** → Vector Database → Similar Data Chunks |
| 72 | + |
| 73 | +4. **Context Assembly** → Row Content + Source Metadata |
| 74 | + |
| 75 | +5. **LLM Generation** → Final Answer with Context from Spreadsheet |
| 76 | + |
| 77 | + |
| 78 | +The pipeline enables rich RAG applications that can answer questions about spreadsheet content by leveraging the structured data extracted from spreadsheet files. |
| 79 | + |
| 80 | + |
| 81 | + |
| 82 | +## 🚀 Getting Started |
| 83 | + |
| 84 | +### Prerequisites |
| 85 | +- [Data Science Project in OpenShift AI with a configured Workbench](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/getting_started) |
| 86 | + |
| 87 | + - [Configuring a pipeline server](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/latest/html/working_with_data_science_pipelines/managing-data-science-pipelines_ds-pipelines#configuring-a-pipeline-server_ds-pipelines) |
| 88 | + |
| 89 | + - A LlamaStack service with a vector database backend deployed (follow our [official deployment documentation](https://github.com/opendatahub-io/rag/blob/main/DEPLOYMENT.md)) |
| 90 | + |
| 91 | +- GPU-enabled nodes are highly recommended for faster processing. |
| 92 | +- You can still use only CPU nodes but it will take longer. |
| 93 | + |
| 94 | + |
| 95 | +**Pipeline Parameters** |
| 96 | + |
| 97 | +- `base_url`: URL where spreadsheet files are hosted |
| 98 | + |
| 99 | +- `spreadsheet_filenames`: Comma-separated list of spreadsheet files to process |
| 100 | + |
| 101 | +- `num_workers`: Number of parallel workers (default: 1) |
| 102 | + |
| 103 | +- `vector_db_id`: ID of the vector database to store embeddings |
| 104 | + |
| 105 | +- `service_url`: URL of the LlamaStack service |
| 106 | + |
| 107 | +- `embed_model_id`: Embedding model to use (default: `ibm-granite/granite-embedding-125m-english`) |
| 108 | + |
| 109 | +- `max_tokens`: Maximum tokens per chunk (default: 512) |
| 110 | + |
| 111 | +- `use_gpu`: Whether to use GPU for processing (default: true) |
| 112 | + |
| 113 | + |
| 114 | + |
| 115 | +### Creating the Pipeline for running on GPU node |
| 116 | + |
| 117 | +``` |
| 118 | +# Install dependencies for pipeline |
| 119 | +cd demos/kfp/docling/spreadsheets-conversion |
| 120 | +pip3 install -r requirements.txt |
| 121 | +
|
| 122 | +# Compile the Kubeflow pipeline for running with help of GPU or use existing pipeline |
| 123 | +# set use_gpu = True in docling_convert_pipeline() in docling_spreadsheets_convert_pipeline.py |
| 124 | +python3 docling_spreadsheets_convert_pipeline.py |
| 125 | +``` |
| 126 | + |
| 127 | +### Creating the Pipeline for running on CPU only |
| 128 | +``` |
| 129 | +# Install dependencies for pipeline |
| 130 | +cd demos/kfp/docling/spreadsheets-conversion |
| 131 | +pip3 install -r requirements.txt |
| 132 | +
|
| 133 | +# Compile the Kubeflow pipeline for running on CPU only or use existing pipeline |
| 134 | +# set use_gpu = False in docling_convert_pipeline() in docling_spreadsheets_convert_pipeline.py |
| 135 | +python3 docling_spreadsheets_convert_pipeline.py |
| 136 | +``` |
| 137 | + |
| 138 | +### Import Kubeflow pipeline to OpenShift AI |
| 139 | +- Import the compiled YAML to in Pipeline server in your Data Science project in OpenShift AI |
| 140 | + - [Running a data science pipeline generated from Python code](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/openshift_ai_tutorial_-_fraud_detection_example/implementing-pipelines#running-a-pipeline-generated-from-python-code) |
| 141 | +- Configure the pipeline parameters as needed |
| 142 | + |
| 143 | + |
| 144 | + |
| 145 | + |
| 146 | +### Query RAG Agent in your Workbench within a Data Science project on OpenShift AI |
| 147 | +1. Open your Workbench |
| 148 | +2. Clone the rag repo and use main branch |
| 149 | + - Use this link `https://github.com/opendatahub-io/rag.git` for cloning the repo |
| 150 | + |
| 151 | + - [Collaborating on Jupyter notebooks by using Git](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/working_with_connected_applications/using_basic_workbenches#collaborating-on-jupyter-notebooks-by-using-git_connected-apps) |
| 152 | +3. Install dependencies for Jupyter Notebook with RAG Agent |
| 153 | + |
| 154 | +``` |
| 155 | +cd demos/kfp/docling/spreadsheets-conversion/rag-agent |
| 156 | +pip3 install -r requirements.txt |
| 157 | +``` |
| 158 | + |
| 159 | +4. Follow the instructions in the corresponding RAG Jupyter Notebook `spreadsheets_rag_agent.ipynb` in `demos/kfp/docling/spreadsheets-conversion/rag-agent` to query the content ingested by the pipeline. |
0 commit comments