docs: improved the README with more clarity

r3v5 · r3v5 · commit 0070d3cc3926 · 2025-07-14T10:24:58.000+01:00
diff --git a/demos/kfp/docling/asr-conversion/README.md b/demos/kfp/docling/asr-conversion/README.md
@@ -1,150 +1,116 @@
 # Kubeflow Docling ASR Conversion Pipeline for RAG
 
-This document explains the **Kubeflow Docling ASR (Automatic Speech Recognition) Conversion Pipeline** - a Kubeflow pipeline that processes audio files using Automatic Speech Recognition (ASR) with Docling to extract transcripts and generate embeddings for Retrieval-Augmented Generation (RAG) applications. The pipeline supports execution on both GPU and CPU-only nodes.
-
   
 
-## Pipeline Overview
+This document explains the **Kubeflow Docling ASR (Automatic Speech Recognition) Conversion Pipeline** - a Kubeflow pipeline that processes audio files using Automatic Speech Recognition (ASR) with Docling to extract transcripts and generate embeddings for Retrieval-Augmented Generation (RAG) applications. The pipeline supports execution on both GPU and CPU-only nodes.
 
+ 
+## Pipeline Overview
 The pipeline transforms audio files into searchable vector embeddings through the following stages:
 
 ```mermaid
+
 graph TD
-A[Register Vector DB] --> B[Import audio files]
-B --> C[Create audio splits]
-C --> D[Install FFmpeg Dependency]
-D --> E[Convert audio files to WAV format that Whisper Turbo ASR model can process]
-E --> F[Conversion using Docling ASR via Whisper Turbo]
-F --> G[Text Chunking]
-G --> H[Generate Embeddings using Sentence Transformer powered by Embedding Model]
-H --> I[Store in Vector Database]
+
+A[Register Milvus Vector DB] --> B[Import audio files in AWS S3 bucket storage]
+
+B --> C[Create audio splits based on input file format for parallel processing]
+
+C --> D[Install FFmpeg to convert all audio files to WAV format]
+
+D --> E[Convert audio files to WAV format via FFmpeg that Whisper Turbo ASR model can process]
+
+E --> F[Conversion of WAV files using Docling ASR via Whisper Turbo]
+
+F --> G[Chunk each created Docling Document and extract raw chunks with text data]
+
+G --> H[Generate Embeddings based on raw text chunks using Sentence Transformer powered by Embedding Model]
+
+H --> I[Insert chunks with text content, embedding and metadata in Milvus DB]
+
 I --> J[Ready for RAG Queries]
 ```
 
- 
+  
+
 ## Pipeline Components
 
 ### 1. **Vector Database Registration** (`register_vector_db`)
-
--  **Purpose**: Sets up the vector database with the proper configuration.
-
+-  **Purpose**: Sets up the vector database with the proper configuration
   
-
 ### 2. **Audio Import** (`import_audio_files`)
-
 -  **Purpose**: Downloads audio files from remote URLs.
 
-  
-
 ### 3. **Audio Splitting** (`create_audio_splits`)
-
 -  **Purpose**: Distributes audio files across multiple parallel workers for faster processing.
 
-  
-
 ### 4. **ASR Conversion and Embedding Generation** (`docling_convert_and_ingest_audio`)
-
 -  **Purpose**: Main processing component that transcribes audio, chunks the text, and generates vector embeddings.
 
   
 
+  
+
 ## Supported Audio Formats
 
 -  `.wav`
-
 -  `.m4a`
-
 -  `.mp3`
-
 -  `.flac`
-
 -  `.ogg`
-
 -  `.aac`
 
-  
-
 In fact, Whisper model works exceptionally well with **WAV files**. It's the ideal format to use.
 
-  
-
 ## Why WAV is the Best Choice
 
 -  **Uncompressed Data**: WAV files contain raw, uncompressed audio data (PCM), which is exactly what the Whisper model needs to analyze the sound waves and perform transcription.
 
-  
-
 -  **Standardization**: You can easily save a WAV file with the precise specifications that Whisper was trained on: **16kHz sample rate** and a **single mono channel**. This consistency leads to the highest accuracy.
 
-  
-
 -  **No Decoding Needed**: When the model receives a properly formatted WAV file, it can process the audio directly without needing any external tools like FFmpeg to decode it first.
 
-  
-
 In short, providing Whisper with a 16kHz mono WAV file is giving it the exact type of data it was designed to read, which ensures the most reliable and accurate results.
 
   
-
-  
-
 ## 🔄 RAG Query Flow
-
 1.  **User Query** → Embedding Model → Query Vector
-
 2.  **Vector Search** → Vector Database → Similar Transcript Chunks
-
 3.  **Context Assembly** → Markdown Transcript Content + Timestamps
-
 4.  **LLM Generation** → Final Answer with Context from Audio
 
-  
-
 The pipeline enables rich RAG applications that can answer questions about spoken content by leveraging the structured transcripts extracted from audio files.
 
   
 
-## 🚀 Getting Started
+  
 
+## 🚀 Getting Started
 ### Prerequisites
-- [Data Science Project in OpenShift AI with a configured Workbench](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/getting_started)
 
+- [Data Science Project in OpenShift AI with a configured Workbench](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/getting_started)
 - [Configuring a pipeline server](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/latest/html/working_with_data_science_pipelines/managing-data-science-pipelines_ds-pipelines#configuring-a-pipeline-server_ds-pipelines)
-
 - A LlamaStack service with a vector database backend deployed (follow our [official deployment documentation](https://github.com/opendatahub-io/rag/blob/main/DEPLOYMENT.md))
-
 -  `ffmpeg` dependency (note: this is installed automatically by the pipeline components).
-
 - GPU-enabled nodes are highly recommended for faster processing.
-
 - You can still use only CPU nodes but it will take longer time to execute pipeline.
 
   
 
-  
-
 **Pipeline Parameters**
-
 -  `base_url`: URL where audio files are hosted
-
 -  `audio_filenames`: Comma-separated list of audio files to process
-
 -  `num_workers`: Number of parallel workers (default: 1)
-
 -  `vector_db_id`: ID of the vector database to store embeddings
-
 -  `service_url`: URL of the LlamaStack service
-
 -  `embed_model_id`: Embedding model to use (default: `ibm-granite/granite-embedding-125m-english`)
-
 -  `max_tokens`: Maximum tokens per chunk (default: 512)
-
 -  `use_gpu`: Whether to use GPU for processing (default: true)
 
-
-
+  
 ### Creating the Pipeline for running on GPU node
 
+
 ```
 # Install dependencies for pipeline
 cd demos/kfp/docling/asr-conversion
@@ -155,8 +121,12 @@ pip3 install -r requirements.txt
 python3 docling_asr_convert_pipeline.py
 ```
 
+  
+
 ### Creating the Pipeline for running on CPU only
 
+  
+
 ```
 # Install dependencies for pipeline
 cd demos/kfp/docling/asr-conversion
@@ -169,37 +139,28 @@ python3 docling_asr_convert_pipeline.py
 
   
 
-### Import Kubeflow pipeline to OpenShift AI
+  
 
+### Import Kubeflow pipeline to OpenShift AI
 - Import the compiled YAML to in Pipeline server in your Data Science project in OpenShift AI
-
 - [Running a data science pipeline generated from Python code](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/openshift_ai_tutorial_-_fraud_detection_example/implementing-pipelines#running-a-pipeline-generated-from-python-code)
-
 - Configure the pipeline parameters as needed
 
   
 
-  
 
 ### Query RAG Agent in your Workbench within a Data Science project on OpenShift AI
-
 1. Open your Workbench
-
-  
-
-3. Clone the rag repo and use main branch
-
+2. Clone the rag repo and use main branch
 	- Use this link `https://github.com/opendatahub-io/rag.git` for cloning the repo
-
 	- [Collaborating on Jupyter notebooks by using Git](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service/1/html/working_with_connected_applications/using_basic_workbenches#collaborating-on-jupyter-notebooks-by-using-git_connected-apps)
 
-4. Install dependencies for Jupyter Notebook with RAG Agent
+  
 
+3. Install dependencies for Jupyter Notebook with RAG Agent
 ```
 cd demos/kfp/docling/asr-conversion/rag-agent
 pip3 install -r requirements.txt
 ```
 
-  
-
 4. Follow the instructions in the corresponding RAG Jupyter Notebook `asr_rag_agent.ipynb` to query the content ingested by the pipeline.