Skip to content

Commit 38256fa

Browse files
Merge pull request #27 from r3v5/docling-asr-pipeline
feat: create KFP ASR (Automatic Speech Recognition) conversion demo pipeline using Docling, FFmpeg, Whisper Turbo and LLamaStack
2 parents 2ed17c2 + e344aef commit 38256fa

File tree

6 files changed

+3732
-0
lines changed

6 files changed

+3732
-0
lines changed
Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
# Kubeflow Docling ASR Conversion Pipeline for RAG
2+
3+
4+
5+
This document explains the **Kubeflow Docling ASR (Automatic Speech Recognition) Conversion Pipeline** - a Kubeflow pipeline that processes audio files using Automatic Speech Recognition (ASR) with Docling to extract transcripts and generate embeddings for Retrieval-Augmented Generation (RAG) applications. The pipeline supports execution on both GPU and CPU-only nodes.
6+
7+
8+
## Pipeline Overview
9+
The pipeline transforms audio files into searchable vector embeddings through the following stages:
10+
11+
```mermaid
12+
13+
graph TD
14+
15+
A[Register Milvus Vector DB] --> B[Import audio files in AWS S3 bucket storage]
16+
17+
B --> C[Create audio files splits for parallel processing]
18+
19+
C --> D[Install FFmpeg to convert all audio files to WAV format]
20+
21+
D --> E[Convert audio files to WAV format via FFmpeg that Whisper Turbo ASR model can process]
22+
23+
E --> F[Conversion of WAV files using Docling ASR via Whisper Turbo]
24+
25+
F --> G[Chunk each created Docling Document and extract raw chunks with text data]
26+
27+
G --> H[Generate Embeddings based on raw text chunks using Sentence Transformer powered by Embedding Model]
28+
29+
H --> I[Insert chunks with text content, embedding and metadata in Milvus DB]
30+
31+
I --> J[Ready for RAG Queries]
32+
```
33+
34+
35+
36+
## Pipeline Components
37+
38+
### 1. **Vector Database Registration** (`register_vector_db`)
39+
- **Purpose**: Sets up the vector database with the proper configuration
40+
41+
### 2. **Audio Import** (`import_audio_files`)
42+
- **Purpose**: Downloads audio files from remote URLs.
43+
44+
### 3. **Audio Splitting** (`create_audio_splits`)
45+
- **Purpose**: Distributes audio files across multiple parallel workers for faster processing.
46+
47+
### 4. **ASR Conversion and Embedding Generation** (`docling_convert_and_ingest_audio`)
48+
- **Purpose**: Main processing component that transcribes audio, chunks the text, and generates vector embeddings.
49+
50+
51+
52+
53+
54+
## Supported Audio Formats
55+
56+
- `.wav`
57+
- `.m4a`
58+
- `.mp3`
59+
- `.flac`
60+
- `.ogg`
61+
- `.aac`
62+
63+
In fact, Whisper model works exceptionally well with **WAV files**. It's the ideal format to use.
64+
65+
## Why WAV is the Best Choice
66+
67+
- **Uncompressed Data**: WAV files contain raw, uncompressed audio data (PCM), which is exactly what the Whisper model needs to analyze the sound waves and perform transcription.
68+
69+
- **Standardization**: You can easily save a WAV file with the precise specifications that Whisper was trained on: **16kHz sample rate** and a **single mono channel**. This consistency leads to the highest accuracy.
70+
71+
- **No Decoding Needed**: When the model receives a properly formatted WAV file, it can process the audio directly without needing any external tools like FFmpeg to decode it first.
72+
73+
In short, providing Whisper with a 16kHz mono WAV file is giving it the exact type of data it was designed to read, which ensures the most reliable and accurate results.
74+
75+
76+
## 🔄 RAG Query Flow
77+
1. **User Query** → Embedding Model → Query Vector
78+
2. **Vector Search** → Vector Database → Similar Transcript Chunks
79+
3. **Context Assembly** → Markdown Transcript Content + Timestamps
80+
4. **LLM Generation** → Final Answer with Context from Audio
81+
82+
The pipeline enables rich RAG applications that can answer questions about spoken content by leveraging the structured transcripts extracted from audio files.
83+
84+
85+
86+
87+
88+
## 🚀 Getting Started
89+
### Prerequisites
90+
91+
- [Data Science Project in OpenShift AI with a configured Workbench](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/getting_started)
92+
- [Configuring a pipeline server](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/latest/html/working_with_data_science_pipelines/managing-data-science-pipelines_ds-pipelines#configuring-a-pipeline-server_ds-pipelines)
93+
- A LlamaStack service with a vector database backend deployed (follow our [official deployment documentation](https://github.com/opendatahub-io/rag/blob/main/DEPLOYMENT.md))
94+
- `ffmpeg` dependency (note: this is installed automatically by the pipeline components).
95+
- GPU-enabled nodes are highly recommended for faster processing.
96+
- You can still use only CPU nodes but it will take longer time to execute pipeline.
97+
98+
99+
100+
**Pipeline Parameters**
101+
- `base_url`: URL where audio files are hosted
102+
- `audio_filenames`: Comma-separated list of audio files to process
103+
- `num_workers`: Number of parallel workers (default: 1)
104+
- `vector_db_id`: ID of the vector database to store embeddings
105+
- `service_url`: URL of the LlamaStack service
106+
- `embed_model_id`: Embedding model to use (default: `ibm-granite/granite-embedding-125m-english`)
107+
- `max_tokens`: Maximum tokens per chunk (default: 512)
108+
- `use_gpu`: Whether to use GPU for processing (default: true)
109+
- `clean_vector_db`: The vector database will be cleared during running the pipeline (default: false)
110+
111+
112+
### Creating the Pipeline for running on GPU node
113+
114+
115+
```
116+
# Install dependencies for pipeline
117+
cd demos/kfp/docling/asr-conversion
118+
pip3 install -r requirements.txt
119+
120+
# Compile the Kubeflow pipeline for running with help of GPU or use existing pipeline
121+
# set use_gpu = True in docling_convert_pipeline() in docling_asr_convert_pipeline.py
122+
python3 docling_asr_convert_pipeline.py
123+
```
124+
125+
126+
127+
### Creating the Pipeline for running on CPU only
128+
```
129+
# Install dependencies for pipeline
130+
cd demos/kfp/docling/asr-conversion
131+
pip3 install -r requirements.txt
132+
133+
# Compile the Kubeflow pipeline for running on CPU only or use existing pipeline
134+
# set use_gpu = False in docling_convert_pipeline() in docling_asr_convert_pipeline.py
135+
python3 docling_asr_convert_pipeline.py
136+
```
137+
138+
139+
140+
141+
142+
### Import Kubeflow pipeline to OpenShift AI
143+
- Import the compiled YAML to in Pipeline server in your Data Science project in OpenShift AI
144+
- [Running a data science pipeline generated from Python code](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/openshift_ai_tutorial_-_fraud_detection_example/implementing-pipelines#running-a-pipeline-generated-from-python-code)
145+
- Configure the pipeline parameters as needed
146+
147+
148+
149+
150+
### Query RAG Agent in your Workbench within a Data Science project on OpenShift AI
151+
1. Open your Workbench
152+
2. Clone the rag repo and use main branch
153+
- Use this link `https://github.com/opendatahub-io/rag.git` for cloning the repo
154+
- [Collaborating on Jupyter notebooks by using Git](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/working_with_connected_applications/using_basic_workbenches#collaborating-on-jupyter-notebooks-by-using-git_connected-apps)
155+
156+
157+
158+
3. Install dependencies for Jupyter Notebook with RAG Agent
159+
```
160+
cd demos/kfp/docling/asr-conversion/rag-agent
161+
pip3 install -r requirements.txt
162+
```
163+
164+
4. Follow the instructions in the corresponding RAG Jupyter Notebook `asr_rag_agent.ipynb` to query the content ingested by the pipeline.

0 commit comments

Comments
 (0)