Skip to content

Commit b1235de

Browse files
committed
refactor: enhanced README, renamed variable in pipeline code for better clarity and modified prompt in notebook
1 parent 9d7fd22 commit b1235de

File tree

4 files changed

+192
-145
lines changed

4 files changed

+192
-145
lines changed
Lines changed: 89 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,92 +1,150 @@
11
# Kubeflow Docling ASR Conversion Pipeline for RAG
2+
23
This document explains the **Kubeflow Docling ASR (Automatic Speech Recognition) Conversion Pipeline** - a Kubeflow pipeline that processes audio files using Automatic Speech Recognition (ASR) with Docling to extract transcripts and generate embeddings for Retrieval-Augmented Generation (RAG) applications. The pipeline supports execution on both GPU and CPU-only nodes.
34

45

6+
57
## Pipeline Overview
8+
69
The pipeline transforms audio files into searchable vector embeddings through the following stages:
10+
711
```mermaid
812
graph TD
913
A[Register Vector DB] --> B[Import audio files]
1014
B --> C[Create audio splits]
1115
C --> D[Install FFmpeg Dependency]
12-
D --> E[Conversion using Docling ASR via Whisper]
13-
E --> F[Text Chunking]
14-
F --> G[Generate Embeddings]
15-
G --> H[Store in Vector Database]
16-
H --> I[Ready for RAG Queries]
16+
D --> E[Convert audio files to WAV format that Whisper Turbo ASR model can process]
17+
E --> F[Conversion using Docling ASR via Whisper Turbo]
18+
F --> G[Text Chunking]
19+
G --> H[Generate Embeddings using Sentence Transformer powered by Embedding Model]
20+
H --> I[Store in Vector Database]
21+
I --> J[Ready for RAG Queries]
1722
```
1823

19-
24+
2025
## Pipeline Components
26+
2127
### 1. **Vector Database Registration** (`register_vector_db`)
28+
2229
- **Purpose**: Sets up the vector database with the proper configuration.
2330

31+
32+
2433
### 2. **Audio Import** (`import_audio_files`)
34+
2535
- **Purpose**: Downloads audio files from remote URLs.
2636

37+
38+
2739
### 3. **Audio Splitting** (`create_audio_splits`)
40+
2841
- **Purpose**: Distributes audio files across multiple parallel workers for faster processing.
2942

43+
44+
3045
### 4. **ASR Conversion and Embedding Generation** (`docling_convert_and_ingest_audio`)
46+
3147
- **Purpose**: Main processing component that transcribes audio, chunks the text, and generates vector embeddings.
3248

49+
50+
3351
## Supported Audio Formats
52+
3453
- `.wav`
54+
3555
- `.m4a`
56+
3657
- `.mp3`
58+
3759
- `.flac`
60+
3861
- `.ogg`
62+
3963
- `.aac`
4064

65+
66+
4167
In fact, Whisper model works exceptionally well with **WAV files**. It's the ideal format to use.
4268

69+
70+
4371
## Why WAV is the Best Choice
72+
4473
- **Uncompressed Data**: WAV files contain raw, uncompressed audio data (PCM), which is exactly what the Whisper model needs to analyze the sound waves and perform transcription.
4574

75+
76+
4677
- **Standardization**: You can easily save a WAV file with the precise specifications that Whisper was trained on: **16kHz sample rate** and a **single mono channel**. This consistency leads to the highest accuracy.
4778

79+
80+
4881
- **No Decoding Needed**: When the model receives a properly formatted WAV file, it can process the audio directly without needing any external tools like FFmpeg to decode it first.
4982

83+
84+
5085
In short, providing Whisper with a 16kHz mono WAV file is giving it the exact type of data it was designed to read, which ensures the most reliable and accurate results.
5186

5287

5388

54-
89+
90+
5591
## 🔄 RAG Query Flow
92+
5693
1. **User Query** → Embedding Model → Query Vector
94+
5795
2. **Vector Search** → Vector Database → Similar Transcript Chunks
96+
5897
3. **Context Assembly** → Markdown Transcript Content + Timestamps
98+
5999
4. **LLM Generation** → Final Answer with Context from Audio
60100

101+
102+
61103
The pipeline enables rich RAG applications that can answer questions about spoken content by leveraging the structured transcripts extracted from audio files.
62104

63-
105+
106+
64107
## 🚀 Getting Started
65-
### Prerequisites
66108

109+
### Prerequisites
67110
- [Data Science Project in OpenShift AI with a configured Workbench](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/getting_started)
111+
68112
- [Configuring a pipeline server](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/latest/html/working_with_data_science_pipelines/managing-data-science-pipelines_ds-pipelines#configuring-a-pipeline-server_ds-pipelines)
113+
69114
- A LlamaStack service with a vector database backend deployed (follow our [official deployment documentation](https://github.com/opendatahub-io/rag/blob/main/DEPLOYMENT.md))
115+
70116
- `ffmpeg` dependency (note: this is installed automatically by the pipeline components).
117+
71118
- GPU-enabled nodes are highly recommended for faster processing.
119+
72120
- You can still use only CPU nodes but it will take longer time to execute pipeline.
73121

74122

75123

76124

125+
77126
**Pipeline Parameters**
127+
78128
- `base_url`: URL where audio files are hosted
129+
79130
- `audio_filenames`: Comma-separated list of audio files to process
131+
80132
- `num_workers`: Number of parallel workers (default: 1)
133+
81134
- `vector_db_id`: ID of the vector database to store embeddings
135+
82136
- `service_url`: URL of the LlamaStack service
137+
83138
- `embed_model_id`: Embedding model to use (default: `ibm-granite/granite-embedding-125m-english`)
139+
84140
- `max_tokens`: Maximum tokens per chunk (default: 512)
141+
85142
- `use_gpu`: Whether to use GPU for processing (default: true)
86143

87-
144+
88145

89146
### Creating the Pipeline for running on GPU node
147+
90148
```
91149
# Install dependencies for pipeline
92150
cd demos/kfp/docling/asr-conversion
@@ -96,36 +154,52 @@ pip3 install -r requirements.txt
96154
# set use_gpu = True in docling_convert_pipeline() in docling_asr_convert_pipeline.py
97155
python3 docling_asr_convert_pipeline.py
98156
```
157+
99158
### Creating the Pipeline for running on CPU only
159+
100160
```
101161
# Install dependencies for pipeline
102162
cd demos/kfp/docling/asr-conversion
103163
pip3 install -r requirements.txt
104-
164+
105165
# Compile the Kubeflow pipeline for running on CPU only or use existing pipeline
106166
# set use_gpu = False in docling_convert_pipeline() in docling_asr_convert_pipeline.py
107167
python3 docling_asr_convert_pipeline.py
108168
```
109169

170+
171+
110172
### Import Kubeflow pipeline to OpenShift AI
173+
111174
- Import the compiled YAML to in Pipeline server in your Data Science project in OpenShift AI
112-
- [Running a data science pipeline generated from Python code](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/openshift_ai_tutorial_-_fraud_detection_example/implementing-pipelines#running-a-pipeline-generated-from-python-code)
175+
176+
- [Running a data science pipeline generated from Python code](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/openshift_ai_tutorial_-_fraud_detection_example/implementing-pipelines#running-a-pipeline-generated-from-python-code)
177+
113178
- Configure the pipeline parameters as needed
114179

115180

116181

117-
182+
183+
118184
### Query RAG Agent in your Workbench within a Data Science project on OpenShift AI
185+
119186
1. Open your Workbench
120187

188+
189+
121190
3. Clone the rag repo and use main branch
191+
122192
- Use this link `https://github.com/opendatahub-io/rag.git` for cloning the repo
193+
123194
- [Collaborating on Jupyter notebooks by using Git](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/working_with_connected_applications/using_basic_workbenches#collaborating-on-jupyter-notebooks-by-using-git_connected-apps)
124-
195+
125196
4. Install dependencies for Jupyter Notebook with RAG Agent
197+
126198
```
127199
cd demos/kfp/docling/asr-conversion/rag-agent
128200
pip3 install -r requirements.txt
129201
```
130202

131-
4. Follow the instructions in the corresponding RAG Jupyter Notebook `asr_rag_agent.ipynb` to query the content ingested by the pipeline.
203+
204+
205+
4. Follow the instructions in the corresponding RAG Jupyter Notebook `asr_rag_agent.ipynb` to query the content ingested by the pipeline.

demos/kfp/docling/asr-conversion/docling_asr_convert_pipeline.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -306,7 +306,7 @@ def cleanup_temp_files(temp_files_to_cleanup: List[pathlib.Path]) -> None:
306306
# Return a Docling DocumentConverter configured for ASR with whisper_turbo model.
307307
def get_asr_converter() -> DocumentConverter:
308308
"""Create a DocumentConverter configured for ASR with whisper_turbo model."""
309-
whisper_turbo_llm = InlineAsrNativeWhisperOptions(
309+
whisper_turbo_asr_model = InlineAsrNativeWhisperOptions(
310310
repo_id="turbo",
311311
inference_framework=InferenceAsrFramework.WHISPER,
312312
verbose=True,
@@ -318,7 +318,7 @@ def get_asr_converter() -> DocumentConverter:
318318
)
319319

320320
pipeline_options = AsrPipelineOptions()
321-
pipeline_options.asr_options = whisper_turbo_llm
321+
pipeline_options.asr_options = whisper_turbo_asr_model
322322

323323
converter = DocumentConverter(
324324
format_options={

demos/kfp/docling/asr-conversion/docling_asr_convert_pipeline_compiled.yaml

Lines changed: 28 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -463,21 +463,22 @@ deploymentSpec:
463463
\ up temporary file: {temp_file.name}\")\n\n # Return a Docling DocumentConverter\
464464
\ configured for ASR with whisper_turbo model.\n def get_asr_converter()\
465465
\ -> DocumentConverter:\n \"\"\"Create a DocumentConverter configured\
466-
\ for ASR with whisper_turbo model.\"\"\"\n whisper_turbo_llm = InlineAsrNativeWhisperOptions(\n\
467-
\ repo_id=\"turbo\",\n inference_framework=InferenceAsrFramework.WHISPER,\n\
468-
\ verbose=True,\n timestamps=False,\n word_timestamps=False,\n\
466+
\ for ASR with whisper_turbo model.\"\"\"\n whisper_turbo_asr_model\
467+
\ = InlineAsrNativeWhisperOptions(\n repo_id=\"turbo\",\n \
468+
\ inference_framework=InferenceAsrFramework.WHISPER,\n \
469+
\ verbose=True,\n timestamps=False,\n word_timestamps=False,\n\
469470
\ temperature=0.0,\n max_new_tokens=256,\n \
470471
\ max_time_chunk=30.0,\n )\n\n pipeline_options = AsrPipelineOptions()\n\
471-
\ pipeline_options.asr_options = whisper_turbo_llm\n\n converter\
472-
\ = DocumentConverter(\n format_options={\n InputFormat.AUDIO:\
473-
\ AudioFormatOption(\n pipeline_cls=AsrPipeline,\n \
474-
\ pipeline_options=pipeline_options,\n )\n\
475-
\ }\n )\n\n return converter\n\n # ---- Embedding\
476-
\ Helper functions ----\n def setup_chunker_and_embedder(\n embed_model_id:\
477-
\ str, max_tokens: int\n ) -> Tuple[SentenceTransformer, HybridChunker]:\n\
478-
\ tokenizer = AutoTokenizer.from_pretrained(embed_model_id)\n \
479-
\ embedding_model = SentenceTransformer(embed_model_id)\n chunker\
480-
\ = HybridChunker(\n tokenizer=tokenizer, max_tokens=max_tokens,\
472+
\ pipeline_options.asr_options = whisper_turbo_asr_model\n\n \
473+
\ converter = DocumentConverter(\n format_options={\n \
474+
\ InputFormat.AUDIO: AudioFormatOption(\n \
475+
\ pipeline_cls=AsrPipeline,\n pipeline_options=pipeline_options,\n\
476+
\ )\n }\n )\n\n return converter\n\
477+
\n # ---- Embedding Helper functions ----\n def setup_chunker_and_embedder(\n\
478+
\ embed_model_id: str, max_tokens: int\n ) -> Tuple[SentenceTransformer,\
479+
\ HybridChunker]:\n tokenizer = AutoTokenizer.from_pretrained(embed_model_id)\n\
480+
\ embedding_model = SentenceTransformer(embed_model_id)\n \
481+
\ chunker = HybridChunker(\n tokenizer=tokenizer, max_tokens=max_tokens,\
481482
\ merge_peers=True\n )\n return embedding_model, chunker\n\
482483
\n def embed_text(text: str, embedding_model: SentenceTransformer) ->\
483484
\ list[float]:\n return embedding_model.encode([text], normalize_embeddings=True).tolist()[0]\n\
@@ -668,21 +669,22 @@ deploymentSpec:
668669
\ up temporary file: {temp_file.name}\")\n\n # Return a Docling DocumentConverter\
669670
\ configured for ASR with whisper_turbo model.\n def get_asr_converter()\
670671
\ -> DocumentConverter:\n \"\"\"Create a DocumentConverter configured\
671-
\ for ASR with whisper_turbo model.\"\"\"\n whisper_turbo_llm = InlineAsrNativeWhisperOptions(\n\
672-
\ repo_id=\"turbo\",\n inference_framework=InferenceAsrFramework.WHISPER,\n\
673-
\ verbose=True,\n timestamps=False,\n word_timestamps=False,\n\
672+
\ for ASR with whisper_turbo model.\"\"\"\n whisper_turbo_asr_model\
673+
\ = InlineAsrNativeWhisperOptions(\n repo_id=\"turbo\",\n \
674+
\ inference_framework=InferenceAsrFramework.WHISPER,\n \
675+
\ verbose=True,\n timestamps=False,\n word_timestamps=False,\n\
674676
\ temperature=0.0,\n max_new_tokens=256,\n \
675677
\ max_time_chunk=30.0,\n )\n\n pipeline_options = AsrPipelineOptions()\n\
676-
\ pipeline_options.asr_options = whisper_turbo_llm\n\n converter\
677-
\ = DocumentConverter(\n format_options={\n InputFormat.AUDIO:\
678-
\ AudioFormatOption(\n pipeline_cls=AsrPipeline,\n \
679-
\ pipeline_options=pipeline_options,\n )\n\
680-
\ }\n )\n\n return converter\n\n # ---- Embedding\
681-
\ Helper functions ----\n def setup_chunker_and_embedder(\n embed_model_id:\
682-
\ str, max_tokens: int\n ) -> Tuple[SentenceTransformer, HybridChunker]:\n\
683-
\ tokenizer = AutoTokenizer.from_pretrained(embed_model_id)\n \
684-
\ embedding_model = SentenceTransformer(embed_model_id)\n chunker\
685-
\ = HybridChunker(\n tokenizer=tokenizer, max_tokens=max_tokens,\
678+
\ pipeline_options.asr_options = whisper_turbo_asr_model\n\n \
679+
\ converter = DocumentConverter(\n format_options={\n \
680+
\ InputFormat.AUDIO: AudioFormatOption(\n \
681+
\ pipeline_cls=AsrPipeline,\n pipeline_options=pipeline_options,\n\
682+
\ )\n }\n )\n\n return converter\n\
683+
\n # ---- Embedding Helper functions ----\n def setup_chunker_and_embedder(\n\
684+
\ embed_model_id: str, max_tokens: int\n ) -> Tuple[SentenceTransformer,\
685+
\ HybridChunker]:\n tokenizer = AutoTokenizer.from_pretrained(embed_model_id)\n\
686+
\ embedding_model = SentenceTransformer(embed_model_id)\n \
687+
\ chunker = HybridChunker(\n tokenizer=tokenizer, max_tokens=max_tokens,\
686688
\ merge_peers=True\n )\n return embedding_model, chunker\n\
687689
\n def embed_text(text: str, embedding_model: SentenceTransformer) ->\
688690
\ list[float]:\n return embedding_model.encode([text], normalize_embeddings=True).tolist()[0]\n\

0 commit comments

Comments
 (0)