You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Kubeflow Docling ASR Conversion Pipeline for RAG
2
2
3
-
This document explains the **Kubeflow Docling ASR (Automatic Speech Recognition) Conversion Pipeline** - a Kubeflow pipeline that processes audio files using Automatic Speech Recognition (ASR) with Docling to extract transcripts and generate embeddings for Retrieval-Augmented Generation (RAG) applications. The pipeline supports execution on both GPU and CPU-only nodes.
4
-
5
3
6
4
7
-
## Pipeline Overview
5
+
This document explains the **Kubeflow Docling ASR (Automatic Speech Recognition) Conversion Pipeline** - a Kubeflow pipeline that processes audio files using Automatic Speech Recognition (ASR) with Docling to extract transcripts and generate embeddings for Retrieval-Augmented Generation (RAG) applications. The pipeline supports execution on both GPU and CPU-only nodes.
8
6
7
+
8
+
## Pipeline Overview
9
9
The pipeline transforms audio files into searchable vector embeddings through the following stages:
10
10
11
11
```mermaid
12
+
12
13
graph TD
13
-
A[Register Vector DB] --> B[Import audio files]
14
-
B --> C[Create audio splits]
15
-
C --> D[Install FFmpeg Dependency]
16
-
D --> E[Convert audio files to WAV format that Whisper Turbo ASR model can process]
17
-
E --> F[Conversion using Docling ASR via Whisper Turbo]
18
-
F --> G[Text Chunking]
19
-
G --> H[Generate Embeddings using Sentence Transformer powered by Embedding Model]
-**Purpose**: Distributes audio files across multiple parallel workers for faster processing.
42
46
43
-
44
-
45
47
### 4. **ASR Conversion and Embedding Generation** (`docling_convert_and_ingest_audio`)
46
-
47
48
-**Purpose**: Main processing component that transcribes audio, chunks the text, and generates vector embeddings.
48
49
49
50
50
51
52
+
53
+
51
54
## Supported Audio Formats
52
55
53
56
-`.wav`
54
-
55
57
-`.m4a`
56
-
57
58
-`.mp3`
58
-
59
59
-`.flac`
60
-
61
60
-`.ogg`
62
-
63
61
-`.aac`
64
62
65
-
66
-
67
63
In fact, Whisper model works exceptionally well with **WAV files**. It's the ideal format to use.
68
64
69
-
70
-
71
65
## Why WAV is the Best Choice
72
66
73
67
-**Uncompressed Data**: WAV files contain raw, uncompressed audio data (PCM), which is exactly what the Whisper model needs to analyze the sound waves and perform transcription.
74
68
75
-
76
-
77
69
-**Standardization**: You can easily save a WAV file with the precise specifications that Whisper was trained on: **16kHz sample rate** and a **single mono channel**. This consistency leads to the highest accuracy.
78
70
79
-
80
-
81
71
-**No Decoding Needed**: When the model receives a properly formatted WAV file, it can process the audio directly without needing any external tools like FFmpeg to decode it first.
82
72
83
-
84
-
85
73
In short, providing Whisper with a 16kHz mono WAV file is giving it the exact type of data it was designed to read, which ensures the most reliable and accurate results.
86
74
87
75
88
-
89
-
90
-
91
76
## 🔄 RAG Query Flow
92
-
93
77
1.**User Query** → Embedding Model → Query Vector
94
-
95
78
2.**Vector Search** → Vector Database → Similar Transcript Chunks
4.**LLM Generation** → Final Answer with Context from Audio
100
81
101
-
102
-
103
82
The pipeline enables rich RAG applications that can answer questions about spoken content by leveraging the structured transcripts extracted from audio files.
104
83
105
84
106
85
107
-
## 🚀 Getting Started
86
+
108
87
88
+
## 🚀 Getting Started
109
89
### Prerequisites
110
-
-[Data Science Project in OpenShift AI with a configured Workbench](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/getting_started)
111
90
91
+
-[Data Science Project in OpenShift AI with a configured Workbench](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/getting_started)
112
92
-[Configuring a pipeline server](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/latest/html/working_with_data_science_pipelines/managing-data-science-pipelines_ds-pipelines#configuring-a-pipeline-server_ds-pipelines)
113
-
114
93
- A LlamaStack service with a vector database backend deployed (follow our [official deployment documentation](https://github.com/opendatahub-io/rag/blob/main/DEPLOYMENT.md))
115
-
116
94
-`ffmpeg` dependency (note: this is installed automatically by the pipeline components).
117
-
118
95
- GPU-enabled nodes are highly recommended for faster processing.
119
-
120
96
- You can still use only CPU nodes but it will take longer time to execute pipeline.
121
97
122
98
123
99
124
-
125
-
126
100
**Pipeline Parameters**
127
-
128
101
-`base_url`: URL where audio files are hosted
129
-
130
102
-`audio_filenames`: Comma-separated list of audio files to process
131
-
132
103
-`num_workers`: Number of parallel workers (default: 1)
133
-
134
104
-`vector_db_id`: ID of the vector database to store embeddings
135
-
136
105
-`service_url`: URL of the LlamaStack service
137
-
138
106
-`embed_model_id`: Embedding model to use (default: `ibm-granite/granite-embedding-125m-english`)
139
-
140
107
-`max_tokens`: Maximum tokens per chunk (default: 512)
141
-
142
108
-`use_gpu`: Whether to use GPU for processing (default: true)
- Import the compiled YAML to in Pipeline server in your Data Science project in OpenShift AI
175
-
176
146
-[Running a data science pipeline generated from Python code](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/openshift_ai_tutorial_-_fraud_detection_example/implementing-pipelines#running-a-pipeline-generated-from-python-code)
177
-
178
147
- Configure the pipeline parameters as needed
179
148
180
149
181
150
182
-
183
151
184
152
### Query RAG Agent in your Workbench within a Data Science project on OpenShift AI
185
-
186
153
1. Open your Workbench
187
-
188
-
189
-
190
-
3. Clone the rag repo and use main branch
191
-
154
+
2. Clone the rag repo and use main branch
192
155
- Use this link `https://github.com/opendatahub-io/rag.git` for cloning the repo
193
-
194
156
- [Collaborating on Jupyter notebooks by using Git](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/working_with_connected_applications/using_basic_workbenches#collaborating-on-jupyter-notebooks-by-using-git_connected-apps)
195
157
196
-
4. Install dependencies for Jupyter Notebook with RAG Agent
158
+
197
159
160
+
3. Install dependencies for Jupyter Notebook with RAG Agent
198
161
```
199
162
cd demos/kfp/docling/asr-conversion/rag-agent
200
163
pip3 install -r requirements.txt
201
164
```
202
165
203
-
204
-
205
166
4. Follow the instructions in the corresponding RAG Jupyter Notebook `asr_rag_agent.ipynb` to query the content ingested by the pipeline.
0 commit comments