Skip to content

Commit 0070d3c

Browse files
committed
docs: improved the README with more clarity
1 parent b1235de commit 0070d3c

File tree

1 file changed

+40
-79
lines changed

1 file changed

+40
-79
lines changed
Lines changed: 40 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -1,150 +1,116 @@
11
# Kubeflow Docling ASR Conversion Pipeline for RAG
22

3-
This document explains the **Kubeflow Docling ASR (Automatic Speech Recognition) Conversion Pipeline** - a Kubeflow pipeline that processes audio files using Automatic Speech Recognition (ASR) with Docling to extract transcripts and generate embeddings for Retrieval-Augmented Generation (RAG) applications. The pipeline supports execution on both GPU and CPU-only nodes.
4-
53

64

7-
## Pipeline Overview
5+
This document explains the **Kubeflow Docling ASR (Automatic Speech Recognition) Conversion Pipeline** - a Kubeflow pipeline that processes audio files using Automatic Speech Recognition (ASR) with Docling to extract transcripts and generate embeddings for Retrieval-Augmented Generation (RAG) applications. The pipeline supports execution on both GPU and CPU-only nodes.
86

7+
8+
## Pipeline Overview
99
The pipeline transforms audio files into searchable vector embeddings through the following stages:
1010

1111
```mermaid
12+
1213
graph TD
13-
A[Register Vector DB] --> B[Import audio files]
14-
B --> C[Create audio splits]
15-
C --> D[Install FFmpeg Dependency]
16-
D --> E[Convert audio files to WAV format that Whisper Turbo ASR model can process]
17-
E --> F[Conversion using Docling ASR via Whisper Turbo]
18-
F --> G[Text Chunking]
19-
G --> H[Generate Embeddings using Sentence Transformer powered by Embedding Model]
20-
H --> I[Store in Vector Database]
14+
15+
A[Register Milvus Vector DB] --> B[Import audio files in AWS S3 bucket storage]
16+
17+
B --> C[Create audio splits based on input file format for parallel processing]
18+
19+
C --> D[Install FFmpeg to convert all audio files to WAV format]
20+
21+
D --> E[Convert audio files to WAV format via FFmpeg that Whisper Turbo ASR model can process]
22+
23+
E --> F[Conversion of WAV files using Docling ASR via Whisper Turbo]
24+
25+
F --> G[Chunk each created Docling Document and extract raw chunks with text data]
26+
27+
G --> H[Generate Embeddings based on raw text chunks using Sentence Transformer powered by Embedding Model]
28+
29+
H --> I[Insert chunks with text content, embedding and metadata in Milvus DB]
30+
2131
I --> J[Ready for RAG Queries]
2232
```
2333

24-
34+
35+
2536
## Pipeline Components
2637

2738
### 1. **Vector Database Registration** (`register_vector_db`)
28-
29-
- **Purpose**: Sets up the vector database with the proper configuration.
30-
39+
- **Purpose**: Sets up the vector database with the proper configuration
3140

32-
3341
### 2. **Audio Import** (`import_audio_files`)
34-
3542
- **Purpose**: Downloads audio files from remote URLs.
3643

37-
38-
3944
### 3. **Audio Splitting** (`create_audio_splits`)
40-
4145
- **Purpose**: Distributes audio files across multiple parallel workers for faster processing.
4246

43-
44-
4547
### 4. **ASR Conversion and Embedding Generation** (`docling_convert_and_ingest_audio`)
46-
4748
- **Purpose**: Main processing component that transcribes audio, chunks the text, and generates vector embeddings.
4849

4950

5051

52+
53+
5154
## Supported Audio Formats
5255

5356
- `.wav`
54-
5557
- `.m4a`
56-
5758
- `.mp3`
58-
5959
- `.flac`
60-
6160
- `.ogg`
62-
6361
- `.aac`
6462

65-
66-
6763
In fact, Whisper model works exceptionally well with **WAV files**. It's the ideal format to use.
6864

69-
70-
7165
## Why WAV is the Best Choice
7266

7367
- **Uncompressed Data**: WAV files contain raw, uncompressed audio data (PCM), which is exactly what the Whisper model needs to analyze the sound waves and perform transcription.
7468

75-
76-
7769
- **Standardization**: You can easily save a WAV file with the precise specifications that Whisper was trained on: **16kHz sample rate** and a **single mono channel**. This consistency leads to the highest accuracy.
7870

79-
80-
8171
- **No Decoding Needed**: When the model receives a properly formatted WAV file, it can process the audio directly without needing any external tools like FFmpeg to decode it first.
8272

83-
84-
8573
In short, providing Whisper with a 16kHz mono WAV file is giving it the exact type of data it was designed to read, which ensures the most reliable and accurate results.
8674

8775

88-
89-
90-
9176
## 🔄 RAG Query Flow
92-
9377
1. **User Query** → Embedding Model → Query Vector
94-
9578
2. **Vector Search** → Vector Database → Similar Transcript Chunks
96-
9779
3. **Context Assembly** → Markdown Transcript Content + Timestamps
98-
9980
4. **LLM Generation** → Final Answer with Context from Audio
10081

101-
102-
10382
The pipeline enables rich RAG applications that can answer questions about spoken content by leveraging the structured transcripts extracted from audio files.
10483

10584

10685

107-
## 🚀 Getting Started
86+
10887

88+
## 🚀 Getting Started
10989
### Prerequisites
110-
- [Data Science Project in OpenShift AI with a configured Workbench](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/getting_started)
11190

91+
- [Data Science Project in OpenShift AI with a configured Workbench](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/getting_started)
11292
- [Configuring a pipeline server](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/latest/html/working_with_data_science_pipelines/managing-data-science-pipelines_ds-pipelines#configuring-a-pipeline-server_ds-pipelines)
113-
11493
- A LlamaStack service with a vector database backend deployed (follow our [official deployment documentation](https://github.com/opendatahub-io/rag/blob/main/DEPLOYMENT.md))
115-
11694
- `ffmpeg` dependency (note: this is installed automatically by the pipeline components).
117-
11895
- GPU-enabled nodes are highly recommended for faster processing.
119-
12096
- You can still use only CPU nodes but it will take longer time to execute pipeline.
12197

12298

12399

124-
125-
126100
**Pipeline Parameters**
127-
128101
- `base_url`: URL where audio files are hosted
129-
130102
- `audio_filenames`: Comma-separated list of audio files to process
131-
132103
- `num_workers`: Number of parallel workers (default: 1)
133-
134104
- `vector_db_id`: ID of the vector database to store embeddings
135-
136105
- `service_url`: URL of the LlamaStack service
137-
138106
- `embed_model_id`: Embedding model to use (default: `ibm-granite/granite-embedding-125m-english`)
139-
140107
- `max_tokens`: Maximum tokens per chunk (default: 512)
141-
142108
- `use_gpu`: Whether to use GPU for processing (default: true)
143109

144-
145-
110+
146111
### Creating the Pipeline for running on GPU node
147112

113+
148114
```
149115
# Install dependencies for pipeline
150116
cd demos/kfp/docling/asr-conversion
@@ -155,8 +121,12 @@ pip3 install -r requirements.txt
155121
python3 docling_asr_convert_pipeline.py
156122
```
157123

124+
125+
158126
### Creating the Pipeline for running on CPU only
159127

128+
129+
160130
```
161131
# Install dependencies for pipeline
162132
cd demos/kfp/docling/asr-conversion
@@ -169,37 +139,28 @@ python3 docling_asr_convert_pipeline.py
169139

170140

171141

172-
### Import Kubeflow pipeline to OpenShift AI
142+
173143

144+
### Import Kubeflow pipeline to OpenShift AI
174145
- Import the compiled YAML to in Pipeline server in your Data Science project in OpenShift AI
175-
176146
- [Running a data science pipeline generated from Python code](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/openshift_ai_tutorial_-_fraud_detection_example/implementing-pipelines#running-a-pipeline-generated-from-python-code)
177-
178147
- Configure the pipeline parameters as needed
179148

180149

181150

182-
183151

184152
### Query RAG Agent in your Workbench within a Data Science project on OpenShift AI
185-
186153
1. Open your Workbench
187-
188-
189-
190-
3. Clone the rag repo and use main branch
191-
154+
2. Clone the rag repo and use main branch
192155
- Use this link `https://github.com/opendatahub-io/rag.git` for cloning the repo
193-
194156
- [Collaborating on Jupyter notebooks by using Git](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/working_with_connected_applications/using_basic_workbenches#collaborating-on-jupyter-notebooks-by-using-git_connected-apps)
195157

196-
4. Install dependencies for Jupyter Notebook with RAG Agent
158+
197159

160+
3. Install dependencies for Jupyter Notebook with RAG Agent
198161
```
199162
cd demos/kfp/docling/asr-conversion/rag-agent
200163
pip3 install -r requirements.txt
201164
```
202165

203-
204-
205166
4. Follow the instructions in the corresponding RAG Jupyter Notebook `asr_rag_agent.ipynb` to query the content ingested by the pipeline.

0 commit comments

Comments
 (0)