Skip to content

Commit 39ebc79

Browse files
committed
edit for pub
1 parent e4a0447 commit 39ebc79

File tree

2 files changed

+87
-79
lines changed

2 files changed

+87
-79
lines changed

articles/ai-services/content-understanding/concepts/retrieval-augmented-generation.md

Lines changed: 40 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -12,80 +12,84 @@ ms.date: 04/23/2025
1212

1313
# Retrieval-augmented generation with Content Understanding
1414

15-
Retrieval-Augmented Generation (RAG) expands the potential of Large Language Models (LLMs) by grounding their responses in external knowledge sources, ensuring both accuracy and contextual relevance. One of the central challenges in RAG lies in efficiently extracting and preparing multimodal content—such as documents, images, audio, and video—so it can be accurately retrieved and utilized to enhance the LLM's responses.
15+
Retrieval-augmented Generation (**RAG**) is a method that enhances the capabilities of Large Language Models (`LLM`) by integrating data from external knowledge sources. Integrating diverse and current information refines the precision and contextual relevance of the outputs generated by `LLM`s. A key challenge for **RAG** is the efficient extraction and processing of multimodal content—such as documents, images, audio, and video—to ensure accurate retrieval and effective use to bolster the `LLM` responses.
1616

17-
Azure AI Content Understanding addresses these challenges by providing sophisticated content extraction capabilities across all modalities. The service seamlessly integrates advanced natural language processing, computer vision, and speech recognition into a unified framework simplifying the challenges of managing separate extraction pipelines. This approach ensures high-quality data handling for documents, images, audio, and video, enhancing precision and depth in information retrieval. Such an approach is beneficial for `RAG` applications, where the quality and relevance of responses depend heavily on context and interrelationships.
17+
Azure AI Content Understanding addresses these challenges by offering advanced content extraction capabilities across diverse modalities. The service seamlessly integrates advanced natural language processing, computer vision, and speech recognition into a unified framework. This integration eliminates the complexities of managing separate extraction pipelines and workflows. A unified approach ensures superior data handling for documents, images, audio, and video, thus enhancing both precision and depth in information retrieval. Such innovation proves especially beneficial for **RAG** applications, where the accuracy and contextual relevance of responses depend on a deep understanding of interconnections, interrelationships, and context.
1818

1919
:::image type="content" source="../media/concepts/rag-architecture-1.png" alt-text="screenshot of Azure Content Understanding service architecture.":::
2020

2121
## Multimodal data and RAG
2222

23-
In traditional content processing, simple text extraction sufficed for many content processing use cases. Today's enterprise environments house a vast array of complex and diverse information across multiple formats—documents with intricate layouts, images rich in visual insights, audio recordings of important discussions, and videos that seamlessly integrate these elements. For retrieval-augmented generation (RAG) systems to deliver truly comprehensive outputs, all such content must be accurately processed and made accessible to generative AI models. This method guarantees that users receive pertinent answers to their queries, no matter the original format of the information. The RAG system ensures seamless retrieval and relevance in various scenarios. It can handle a comprehensive table from a financial report. It also supports technical diagrams extracted from manuals. Additionally, it draws insights from recorded conference calls. Finally, the system effectively manages explanations delivered through training videos.
23+
In traditional content processing, simple text extraction sufficed for many content processing use cases. Modern enterprise environments encompass a vast array of complex information across diverse formats:
2424

25+
* **Documents** featuring intricate layouts.
26+
* **Images** rich with visual details and insights.
27+
* **Audio** recordings capturing pivotal conversations.
28+
* **Videos** that seamlessly integrate and unify multiple data types.
2529

30+
For **RAG** systems to deliver truly comprehensive outputs, all content must be accurately processed and made accessible to generative AI models. This method guarantees that users receive relevant answers to their queries, regardless of the original format of the information. The **RAG** system excels in retrieving and maintaining relevance across diverse scenarios: it seamlessly processes detailed tables from financial reports, interprets complex technical diagrams from manuals, extracts insights from recorded conference calls, and efficiently manages explanations presented in training videos.
2631

2732
## Effective RAG using Content Understanding
2833

29-
Chunking is key to effective RAG with multimodal content. It breaks large content into smaller, manageable parts for better processing and retrieval. However, different types of data present unique challenges:
34+
Context-aware chunking is key to optimizing **RAG** with multimodal content. It breaks large content or datasets into smaller, manageable parts for better processing and retrieval. However, different types of data present unique challenges:
3035

3136
* **Documents**. Layout and meaning must be preserved to avoid losing context.
3237
* **Images**. Visual elements need accurate interpretation while maintaining their relationships.
3338
* **Audio**. Speaker identification and time order are important to keep the narrative clear.
3439
* **Video**: Scene boundaries and synchronization between modes must stay intact.
3540

36-
Semantic chunking prioritizes meaning and relationships over arbitrary splits. Chunking ensures retrieved chunks are relevant to queries, allowing for more accurate and coherent responses. Azure AI Content Understanding is built for multimodal RAG, processing various content types (documents, images, audio, video) while preserving context and meaning. Chunking also improves query relevance and downstream operations, making it ideal for enterprise use cases requiring deep content analysis and understanding.
41+
Semantic chunking prioritizes preserving meaning and relationships rather than instituting arbitrary splits. By ensuring that retrieved chunks align closely with queries, chunking enables more precise and coherent responses. Azure AI Content Understanding is designed for multimodal RAG applications and processes diverse content types (documents, images, audio, and video) while preserving context and meaning. Chunking also improves query relevance and downstream processing operations, making it and ideal solutions for enterprise use cases that demand deep, in-depth content analysis and understanding.
3742

3843
## Content Understanding RAG capabilities
3944

40-
Azure AI Content Understanding addresses the core challenges of multimodal RAG—complex data ingestion, data representation, and query optimization—by providing a solution that enhances the accuracy and relevance of retrieval and generation processes:
45+
Azure AI Content Understanding addresses the core challenges of multimodal **RAG**—complex data ingestion, data representation, and query optimization—by providing a solution that enhances the accuracy and relevance of retrieval and generation processes:
4146

42-
* **Simplified Multimodal Ingestion:** Content Understanding streamlines the processing of diverse content types—documents, images, audio, and video—into a unified workflow. Preserving structural integrity and contextual relationships eliminates the complexities of handling multimodal data, ensuring consistent representation across all modalities.
47+
* **Simplified multimodal ingestion:** Content Understanding streamlines the processing of various content types—documents, images, audio, and video—into a cohesive and unified workflow. Preserving structural integrity and contextual relationships eliminates the complexities of handling multimodal data, ensuring consistent representation across all modalities.
4348

44-
* **Enhanced Data Representation:** Content Understanding transforms unstructured data into structured, context-rich formats such as Markdown and JSON. This transformation ensures smooth integration with embedding models, vector databases, and generative AI systems. Maintaining semantic depth, hierarchical structure, and cross-modal linkages, addresses issues like semantic fragmentation and enables more accurate information retrieval.
49+
* **Enhanced data representation:** Content Understanding transforms unstructured data into structured, context-rich formats like Markdown and JSON. This structured transformation facilitates seamless integration with embedding models, vector databases, and generative AI systems. Maintaining semantic depth, hierarchical structure, and cross-modal linkages, addresses issues like semantic fragmentation and enables more accurate information retrieval.
4550

46-
* **Customizable Field Extraction:** Users can define custom fields to generate targeted metadata, such as summaries, visual descriptions, or sentiment analysis, enriching knowledge bases with domain-specific insights. These enhancements complement standard content extraction and vector representations, improving retrieval precision and enabling more contextually relevant responses.
51+
* **Customizable field extraction:** Users can define custom fields to generate targeted metadata, such as summaries, visual descriptions, or sentiment analysis. These tailored outputs enrich knowledge bases with domain-specific insights, enhancing standard content extraction and vector representations. The result is improved retrieval accuracy and more contextually relevant responses.
4752

48-
* **Optimized Query Performance:** Content Understanding mitigates modality bias and context fragmentation by providing structured, enriched data that supports sophisticated relevance ranking across modalities. This mitigation ensures that the most appropriate information is surfaced for user queries, improving the coherence and accuracy of generated responses.
53+
* **Optimized query performance:** Content Understanding mitigates modality bias and context fragmentation by providing structured, enriched data that supports advanced relevance ranking across modalities. This approach ensures that user queries yield the most relevant information, enhancing the coherence and precision of generated responses.
4954

5055
:::image type="content" source="../media/concepts/rag-architecture-2.png" alt-text="Screenshot of Content Understanding RAG architecture overview, process, and workflow with Azure AI Search and Azure OpenAI.":::
5156

52-
## RAG implementation
57+
## RAG implementation pattern
5358

54-
A high level summary of the `RAG` implementation pattern looks like this:
59+
An overview of the **RAG** implementation pattern is as follows:
5560

56-
1. Transform unstructured multimodal data into structured representation using Content Understanding.
57-
2. Embed structured output using embedding models.
58-
3. Store embedded vectors in database or search index.
59-
4. Use Generative AI chat models to query and generate responses from retrieval systems.
61+
1. [Extract content](#content-extraction). Convert unstructured multimodal data into a structured representation.
6062

61-
Here's an overview of the implementation process. It begins with data extraction using Azure AI Content Understanding. This approach serves as the foundation for transforming raw multimodal data into structured, searchable formats. These formats are optimized for RAG workflows.
63+
1. [Generate embeddings](../../openai/how-to/embeddings.md). Apply embedding models to represent the structured data as vectors.
6264

63-
### 1. Content Extraction: The Foundation for RAG with Content Understanding
65+
1. [Create a unified search index](#create-a-unified-search-index). Store the embedded vectors in a database or search index for efficient retrieval.
6466

65-
Content extraction is ideal for transforming raw multimodal data into structured, searchable formats:
67+
1. [Utilize Azure OpenAI models](#utilize-azure-openai-models) Use generative AI chat models to query the retrieval systems and generate responses.
6668

67-
* **Documents:** Captures hierarchical structures, such as headers, paragraphs, tables, and page elements, preserving the logical organization of training materials.
68-
* **Images**: Transforms visual information into searchable text by verbalizing diagrams and charts, extracting embedded text elements, and converting graphical data into structured formats. Technical illustrations are analyzed to identify components and their relationships.
69-
* **Audio:** Produces speaker-aware transcriptions that accurately capture spoken content while automatically detecting and processing multiple languages.
70-
* **Video:** The system segments video content into meaningful units using scene detection and key frame extraction. It generates descriptive summaries for the footage. The system also transcribes spoken content and identifies key topics. Lastly, it analyzes sentiment indicators throughout the video. Transcribes spoken content and provides scene descriptions while addressing context window limitations in generative AI models.
69+
### Content extraction
7170

72-
While content extraction provides a strong foundation for indexing and retrieval, it may not fully address domain-specific needs or provide deeper contextual insights.
71+
The RAG implementation process starts with data extraction using Azure AI Content Understanding. This step establishes the groundwork for converting raw multimodal data into structured, searchable formats tailored for RAG workflows. Content extraction is ideal for transforming raw multimodal data into structured, searchable formats:
7372

74-
### 2. Field Extraction: Enhancing Knowledge Bases for Better Retrieval
73+
* **Documents:** Captures hierarchical structures, like headers, paragraphs, tables, and page elements, preserving the logical organization of training materials remains intact.
74+
* **Images**: Transforms visual data into searchable text by verbalizing diagrams and charts, extracting embedded text, and converting graphical data into structured formats. Technical illustrations are analyzed to identify components and relationships.
75+
* **Audio:** Generates speaker-aware transcriptions that accurately capture spoken content across multiple languages through automatic detection and processing.
76+
* **Video:** Segments video content into meaningful units using scene detection and key frame extraction. It creates descriptive summaries, transcribes spoken dialogue, identifies key topics, and analyzes sentiment indicators throughout the footage. Scene descriptions are provided while addressing context limitations inherent to generative AI models.
7577

76-
Field extraction complements content extraction by generating targeted metadata that enriches the knowledge base and improves retrieval precision:
77-
* **Document:** Extract key topics/fields to provide concise overviews of lengthy materials.
78-
* **Image:** Converts visual information into searchable text by verbalizing diagrams, extracting embedded text, and identifying graphical components.
79-
* **Audio:** Extract key topics or sentiment analysis from conversations and to provide added context for queries.
80-
* **Video:** Generate scene-level summaries, identify key topics, or analyze brand presence and product associations within video footage.
78+
#### Field Extraction
8179

82-
Organizations can create a contextually rich knowledge base optimized for indexing, retrieval, and RAG scenarios by combining content extraction with field extraction. This method ensures more accurate and meaningful responses to user queries.
80+
While content extraction provides a strong foundation for indexing and retrieval, it may not fully address specialized domain-specific requirements or deliver deeper contextual insights. Field extraction is a valuable complement to content extraction by producing targeted metadata that enriches the knowledge base and improves retrieval accuracy:
8381

84-
Learn more about [content extraction](./capabilities.md#content-extraction) and [field extraction](./capabilities.md#field-extraction) capabilities.
82+
* **Document:** Summarizes key topics or fields to provide clear overviews of extensive materials.
83+
* **Image:** Transforms visual content into searchable text by interpreting diagrams, extracting embedded text, and recognizing graphical elements.
84+
* **Audio:** Analyzes conversations to extract key topics, assess sentiment, and offer more context for inquiries.
85+
* **Video:** Creates scene-level summaries, identifies main topics, and examines brand presence or product associations in video content.
86+
87+
88+
Integrating content extraction with field extraction enables organizations to develop a knowledge base that is context-rich and optimized for indexing, retrieval, and RAG scenarios. This approach enables more precise and relevant responses to user inquiries. To learn more, *see* [content extraction](./capabilities.md#content-extraction) and [field extraction](./capabilities.md#field-extraction) capabilities.
8589

8690
#### Code Sample: Analyzer and Schema Configuration
8791

88-
The following code sample shows an analyzer and schema creation for various modalities in a multimodal RAG scenario.
92+
The following code samples show an analyzer and schema creation for various modalities in a multimodal RAG scenario.
8993

9094
---
9195

@@ -454,9 +458,9 @@ After data is extracted using Azure AI Content Understanding, the next steps inv
454458
> [!div class="nextstepaction"]
455459
> [View full code sample for RAG on GitHub.](https://github.com/Azure-Samples/azure-ai-search-with-content-understanding-python#samples)
456460

457-
## 3. Create a Unified Search Index
461+
### Create a unified search index
458462

459-
Once multimodal content is processed through Azure AI Content Understanding, the next crucial step is to build a robust search framework that capitalizes on this enriched structured data. You can use Azure OpenAI's embedding models to embed markdown and JSON outputs. By indexing these embeddings with [Azure AI Search](https://docs.azure.cn/en-us/search/tutorial-rag-build-solution-index-schema), you can create an integrated knowledge repository. This repository effortlessly bridges various content modalities.
463+
After Azure AI Content Understanding processes multimodal content, the next essential step is to develop a powerful search framework that effectively uses the enriched structured data. You can use [Azure OpenAI's embedding models](../../openai/how-to/embeddings.md) to embed markdown and JSON outputs. By indexing these embeddings with [Azure AI Search](https://docs.azure.cn/en-us/search/tutorial-rag-build-solution-index-schema), you can create an integrated knowledge repository. This repository effortlessly bridges various content modalities.
460464

461465
Azure AI Search provides advanced search strategies to maximize the value of multimodal content:
462466

@@ -507,7 +511,7 @@ The following JSON code sample shows a minimal consolidated index that support v
507511
```
508512
---
509513

510-
## 4. Utilize Azure OpenAI Models
514+
### Utilize Azure OpenAI Models
511515

512516
Once your content is extracted and indexed, integrate [Azure OpenAI's embedding and chat models](../../openai/concepts/models.md) to create an interactive question-answering system:
513517

0 commit comments

Comments
 (0)