|
| 1 | +# Kubeflow Docling ASR Conversion Pipeline for RAG |
| 2 | + |
| 3 | + |
| 4 | + |
| 5 | +This document explains the **Kubeflow Docling ASR (Automatic Speech Recognition) Conversion Pipeline** - a Kubeflow pipeline that processes audio files using Automatic Speech Recognition (ASR) with Docling to extract transcripts and generate embeddings for Retrieval-Augmented Generation (RAG) applications. The pipeline supports execution on both GPU and CPU-only nodes. |
| 6 | + |
| 7 | + |
| 8 | +## Pipeline Overview |
| 9 | +The pipeline transforms audio files into searchable vector embeddings through the following stages: |
| 10 | + |
| 11 | +```mermaid |
| 12 | +
|
| 13 | +graph TD |
| 14 | +
|
| 15 | +A[Register Milvus Vector DB] --> B[Import audio files in AWS S3 bucket storage] |
| 16 | +
|
| 17 | +B --> C[Create audio files splits for parallel processing] |
| 18 | +
|
| 19 | +C --> D[Install FFmpeg to convert all audio files to WAV format] |
| 20 | +
|
| 21 | +D --> E[Convert audio files to WAV format via FFmpeg that Whisper Turbo ASR model can process] |
| 22 | +
|
| 23 | +E --> F[Conversion of WAV files using Docling ASR via Whisper Turbo] |
| 24 | +
|
| 25 | +F --> G[Chunk each created Docling Document and extract raw chunks with text data] |
| 26 | +
|
| 27 | +G --> H[Generate Embeddings based on raw text chunks using Sentence Transformer powered by Embedding Model] |
| 28 | +
|
| 29 | +H --> I[Insert chunks with text content, embedding and metadata in Milvus DB] |
| 30 | +
|
| 31 | +I --> J[Ready for RAG Queries] |
| 32 | +``` |
| 33 | + |
| 34 | + |
| 35 | + |
| 36 | +## Pipeline Components |
| 37 | + |
| 38 | +### 1. **Vector Database Registration** (`register_vector_db`) |
| 39 | +- **Purpose**: Sets up the vector database with the proper configuration |
| 40 | + |
| 41 | +### 2. **Audio Import** (`import_audio_files`) |
| 42 | +- **Purpose**: Downloads audio files from remote URLs. |
| 43 | + |
| 44 | +### 3. **Audio Splitting** (`create_audio_splits`) |
| 45 | +- **Purpose**: Distributes audio files across multiple parallel workers for faster processing. |
| 46 | + |
| 47 | +### 4. **ASR Conversion and Embedding Generation** (`docling_convert_and_ingest_audio`) |
| 48 | +- **Purpose**: Main processing component that transcribes audio, chunks the text, and generates vector embeddings. |
| 49 | + |
| 50 | + |
| 51 | + |
| 52 | + |
| 53 | + |
| 54 | +## Supported Audio Formats |
| 55 | + |
| 56 | +- `.wav` |
| 57 | +- `.m4a` |
| 58 | +- `.mp3` |
| 59 | +- `.flac` |
| 60 | +- `.ogg` |
| 61 | +- `.aac` |
| 62 | + |
| 63 | +In fact, Whisper model works exceptionally well with **WAV files**. It's the ideal format to use. |
| 64 | + |
| 65 | +## Why WAV is the Best Choice |
| 66 | + |
| 67 | +- **Uncompressed Data**: WAV files contain raw, uncompressed audio data (PCM), which is exactly what the Whisper model needs to analyze the sound waves and perform transcription. |
| 68 | + |
| 69 | +- **Standardization**: You can easily save a WAV file with the precise specifications that Whisper was trained on: **16kHz sample rate** and a **single mono channel**. This consistency leads to the highest accuracy. |
| 70 | + |
| 71 | +- **No Decoding Needed**: When the model receives a properly formatted WAV file, it can process the audio directly without needing any external tools like FFmpeg to decode it first. |
| 72 | + |
| 73 | +In short, providing Whisper with a 16kHz mono WAV file is giving it the exact type of data it was designed to read, which ensures the most reliable and accurate results. |
| 74 | + |
| 75 | + |
| 76 | +## 🔄 RAG Query Flow |
| 77 | +1. **User Query** → Embedding Model → Query Vector |
| 78 | +2. **Vector Search** → Vector Database → Similar Transcript Chunks |
| 79 | +3. **Context Assembly** → Markdown Transcript Content + Timestamps |
| 80 | +4. **LLM Generation** → Final Answer with Context from Audio |
| 81 | + |
| 82 | +The pipeline enables rich RAG applications that can answer questions about spoken content by leveraging the structured transcripts extracted from audio files. |
| 83 | + |
| 84 | + |
| 85 | + |
| 86 | + |
| 87 | + |
| 88 | +## 🚀 Getting Started |
| 89 | +### Prerequisites |
| 90 | + |
| 91 | +- [Data Science Project in OpenShift AI with a configured Workbench](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/getting_started) |
| 92 | +- [Configuring a pipeline server](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/latest/html/working_with_data_science_pipelines/managing-data-science-pipelines_ds-pipelines#configuring-a-pipeline-server_ds-pipelines) |
| 93 | +- A LlamaStack service with a vector database backend deployed (follow our [official deployment documentation](https://github.com/opendatahub-io/rag/blob/main/DEPLOYMENT.md)) |
| 94 | +- `ffmpeg` dependency (note: this is installed automatically by the pipeline components). |
| 95 | +- GPU-enabled nodes are highly recommended for faster processing. |
| 96 | +- You can still use only CPU nodes but it will take longer time to execute pipeline. |
| 97 | + |
| 98 | + |
| 99 | + |
| 100 | +**Pipeline Parameters** |
| 101 | +- `base_url`: URL where audio files are hosted |
| 102 | +- `audio_filenames`: Comma-separated list of audio files to process |
| 103 | +- `num_workers`: Number of parallel workers (default: 1) |
| 104 | +- `vector_db_id`: ID of the vector database to store embeddings |
| 105 | +- `service_url`: URL of the LlamaStack service |
| 106 | +- `embed_model_id`: Embedding model to use (default: `ibm-granite/granite-embedding-125m-english`) |
| 107 | +- `max_tokens`: Maximum tokens per chunk (default: 512) |
| 108 | +- `use_gpu`: Whether to use GPU for processing (default: true) |
| 109 | +- `clean_vector_db`: The vector database will be cleared during running the pipeline (default: false) |
| 110 | + |
| 111 | + |
| 112 | +### Creating the Pipeline for running on GPU node |
| 113 | + |
| 114 | + |
| 115 | +``` |
| 116 | +# Install dependencies for pipeline |
| 117 | +cd demos/kfp/docling/asr-conversion |
| 118 | +pip3 install -r requirements.txt |
| 119 | +
|
| 120 | +# Compile the Kubeflow pipeline for running with help of GPU or use existing pipeline |
| 121 | +# set use_gpu = True in docling_convert_pipeline() in docling_asr_convert_pipeline.py |
| 122 | +python3 docling_asr_convert_pipeline.py |
| 123 | +``` |
| 124 | + |
| 125 | + |
| 126 | + |
| 127 | +### Creating the Pipeline for running on CPU only |
| 128 | +``` |
| 129 | +# Install dependencies for pipeline |
| 130 | +cd demos/kfp/docling/asr-conversion |
| 131 | +pip3 install -r requirements.txt |
| 132 | +
|
| 133 | +# Compile the Kubeflow pipeline for running on CPU only or use existing pipeline |
| 134 | +# set use_gpu = False in docling_convert_pipeline() in docling_asr_convert_pipeline.py |
| 135 | +python3 docling_asr_convert_pipeline.py |
| 136 | +``` |
| 137 | + |
| 138 | + |
| 139 | + |
| 140 | + |
| 141 | + |
| 142 | +### Import Kubeflow pipeline to OpenShift AI |
| 143 | +- Import the compiled YAML to in Pipeline server in your Data Science project in OpenShift AI |
| 144 | +- [Running a data science pipeline generated from Python code](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/openshift_ai_tutorial_-_fraud_detection_example/implementing-pipelines#running-a-pipeline-generated-from-python-code) |
| 145 | +- Configure the pipeline parameters as needed |
| 146 | + |
| 147 | + |
| 148 | + |
| 149 | + |
| 150 | +### Query RAG Agent in your Workbench within a Data Science project on OpenShift AI |
| 151 | +1. Open your Workbench |
| 152 | +2. Clone the rag repo and use main branch |
| 153 | + - Use this link `https://github.com/opendatahub-io/rag.git` for cloning the repo |
| 154 | + - [Collaborating on Jupyter notebooks by using Git](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/working_with_connected_applications/using_basic_workbenches#collaborating-on-jupyter-notebooks-by-using-git_connected-apps) |
| 155 | + |
| 156 | + |
| 157 | + |
| 158 | +3. Install dependencies for Jupyter Notebook with RAG Agent |
| 159 | +``` |
| 160 | +cd demos/kfp/docling/asr-conversion/rag-agent |
| 161 | +pip3 install -r requirements.txt |
| 162 | +``` |
| 163 | + |
| 164 | +4. Follow the instructions in the corresponding RAG Jupyter Notebook `asr_rag_agent.ipynb` to query the content ingested by the pipeline. |
0 commit comments