You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Kubeflow Docling ASR Conversion Pipeline for RAG
2
+
2
3
This document explains the **Kubeflow Docling ASR (Automatic Speech Recognition) Conversion Pipeline** - a Kubeflow pipeline that processes audio files using Automatic Speech Recognition (ASR) with Docling to extract transcripts and generate embeddings for Retrieval-Augmented Generation (RAG) applications. The pipeline supports execution on both GPU and CPU-only nodes.
3
4
4
5
6
+
5
7
## Pipeline Overview
8
+
6
9
The pipeline transforms audio files into searchable vector embeddings through the following stages:
10
+
7
11
```mermaid
8
12
graph TD
9
13
A[Register Vector DB] --> B[Import audio files]
10
14
B --> C[Create audio splits]
11
15
C --> D[Install FFmpeg Dependency]
12
-
D --> E[Conversion using Docling ASR via Whisper]
13
-
E --> F[Text Chunking]
14
-
F --> G[Generate Embeddings]
15
-
G --> H[Store in Vector Database]
16
-
H --> I[Ready for RAG Queries]
16
+
D --> E[Convert audio files to WAV format that Whisper Turbo ASR model can process]
17
+
E --> F[Conversion using Docling ASR via Whisper Turbo]
18
+
F --> G[Text Chunking]
19
+
G --> H[Generate Embeddings using Sentence Transformer powered by Embedding Model]
-**Purpose**: Distributes audio files across multiple parallel workers for faster processing.
29
42
43
+
44
+
30
45
### 4. **ASR Conversion and Embedding Generation** (`docling_convert_and_ingest_audio`)
46
+
31
47
-**Purpose**: Main processing component that transcribes audio, chunks the text, and generates vector embeddings.
32
48
49
+
50
+
33
51
## Supported Audio Formats
52
+
34
53
-`.wav`
54
+
35
55
-`.m4a`
56
+
36
57
-`.mp3`
58
+
37
59
-`.flac`
60
+
38
61
-`.ogg`
62
+
39
63
-`.aac`
40
64
65
+
66
+
41
67
In fact, Whisper model works exceptionally well with **WAV files**. It's the ideal format to use.
42
68
69
+
70
+
43
71
## Why WAV is the Best Choice
72
+
44
73
-**Uncompressed Data**: WAV files contain raw, uncompressed audio data (PCM), which is exactly what the Whisper model needs to analyze the sound waves and perform transcription.
45
74
75
+
76
+
46
77
-**Standardization**: You can easily save a WAV file with the precise specifications that Whisper was trained on: **16kHz sample rate** and a **single mono channel**. This consistency leads to the highest accuracy.
47
78
79
+
80
+
48
81
-**No Decoding Needed**: When the model receives a properly formatted WAV file, it can process the audio directly without needing any external tools like FFmpeg to decode it first.
49
82
83
+
84
+
50
85
In short, providing Whisper with a 16kHz mono WAV file is giving it the exact type of data it was designed to read, which ensures the most reliable and accurate results.
51
86
52
87
53
88
54
-
89
+
90
+
55
91
## 🔄 RAG Query Flow
92
+
56
93
1.**User Query** → Embedding Model → Query Vector
94
+
57
95
2.**Vector Search** → Vector Database → Similar Transcript Chunks
4.**LLM Generation** → Final Answer with Context from Audio
60
100
101
+
102
+
61
103
The pipeline enables rich RAG applications that can answer questions about spoken content by leveraging the structured transcripts extracted from audio files.
62
104
63
-
105
+
106
+
64
107
## 🚀 Getting Started
65
-
### Prerequisites
66
108
109
+
### Prerequisites
67
110
-[Data Science Project in OpenShift AI with a configured Workbench](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/getting_started)
111
+
68
112
-[Configuring a pipeline server](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/latest/html/working_with_data_science_pipelines/managing-data-science-pipelines_ds-pipelines#configuring-a-pipeline-server_ds-pipelines)
113
+
69
114
- A LlamaStack service with a vector database backend deployed (follow our [official deployment documentation](https://github.com/opendatahub-io/rag/blob/main/DEPLOYMENT.md))
115
+
70
116
-`ffmpeg` dependency (note: this is installed automatically by the pipeline components).
117
+
71
118
- GPU-enabled nodes are highly recommended for faster processing.
119
+
72
120
- You can still use only CPU nodes but it will take longer time to execute pipeline.
73
121
74
122
75
123
76
124
125
+
77
126
**Pipeline Parameters**
127
+
78
128
-`base_url`: URL where audio files are hosted
129
+
79
130
-`audio_filenames`: Comma-separated list of audio files to process
131
+
80
132
-`num_workers`: Number of parallel workers (default: 1)
133
+
81
134
-`vector_db_id`: ID of the vector database to store embeddings
135
+
82
136
-`service_url`: URL of the LlamaStack service
137
+
83
138
-`embed_model_id`: Embedding model to use (default: `ibm-granite/granite-embedding-125m-english`)
139
+
84
140
-`max_tokens`: Maximum tokens per chunk (default: 512)
141
+
85
142
-`use_gpu`: Whether to use GPU for processing (default: true)
# set use_gpu = True in docling_convert_pipeline() in docling_asr_convert_pipeline.py
97
155
python3 docling_asr_convert_pipeline.py
98
156
```
157
+
99
158
### Creating the Pipeline for running on CPU only
159
+
100
160
```
101
161
# Install dependencies for pipeline
102
162
cd demos/kfp/docling/asr-conversion
103
163
pip3 install -r requirements.txt
104
-
164
+
105
165
# Compile the Kubeflow pipeline for running on CPU only or use existing pipeline
106
166
# set use_gpu = False in docling_convert_pipeline() in docling_asr_convert_pipeline.py
107
167
python3 docling_asr_convert_pipeline.py
108
168
```
109
169
170
+
171
+
110
172
### Import Kubeflow pipeline to OpenShift AI
173
+
111
174
- Import the compiled YAML to in Pipeline server in your Data Science project in OpenShift AI
112
-
- [Running a data science pipeline generated from Python code](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/openshift_ai_tutorial_-_fraud_detection_example/implementing-pipelines#running-a-pipeline-generated-from-python-code)
175
+
176
+
-[Running a data science pipeline generated from Python code](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/openshift_ai_tutorial_-_fraud_detection_example/implementing-pipelines#running-a-pipeline-generated-from-python-code)
177
+
113
178
- Configure the pipeline parameters as needed
114
179
115
180
116
181
117
-
182
+
183
+
118
184
### Query RAG Agent in your Workbench within a Data Science project on OpenShift AI
185
+
119
186
1. Open your Workbench
120
187
188
+
189
+
121
190
3. Clone the rag repo and use main branch
191
+
122
192
- Use this link `https://github.com/opendatahub-io/rag.git` for cloning the repo
193
+
123
194
- [Collaborating on Jupyter notebooks by using Git](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/working_with_connected_applications/using_basic_workbenches#collaborating-on-jupyter-notebooks-by-using-git_connected-apps)
124
-
195
+
125
196
4. Install dependencies for Jupyter Notebook with RAG Agent
197
+
126
198
```
127
199
cd demos/kfp/docling/asr-conversion/rag-agent
128
200
pip3 install -r requirements.txt
129
201
```
130
202
131
-
4. Follow the instructions in the corresponding RAG Jupyter Notebook `asr_rag_agent.ipynb` to query the content ingested by the pipeline.
203
+
204
+
205
+
4. Follow the instructions in the corresponding RAG Jupyter Notebook `asr_rag_agent.ipynb` to query the content ingested by the pipeline.
0 commit comments