You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Multimodal retrieval-augmented generation with Content Understanding
14
14
15
-
Retrieval-augmented Generation (**RAG**) is a method that enhances the capabilities of Large Language Models (**LLM**) by integrating data from external knowledge sources. Integrating diverse and current information refines the precision and contextual relevance of the outputs generated by an **LLM**. A key challenge for **RAG** is the efficient extraction and processing of multimodal content—such as documents, images, audio, and video—to ensure accurate retrieval and effective use to bolster the **LLM** responses.
15
+
Retrieval-augmented Generation (**RAG**) is a method that enhances the capabilities of Large Language Models (**LLM**) by integrating data from external knowledge sources. Integrating diverse and current information refines the precision and contextual relevance of the outputs generated by an **LLM**. A key challenge for **RAG** is the efficient extraction and processing of multimodal content—such as documents, images, audio, and video—to ensure accurate retrieval and effective use to bolster the **LLM** responses.
16
16
17
17
Azure AI Content Understanding addresses these challenges by providing sophisticated extraction capabilities across all content modalities. The service seamlessly integrates advanced natural language processing, computer vision, and speech recognition into a unified framework. This integration preserves semantic integrity and contextual relationships that traditional extraction methods often lose. A unified approach eliminates the need to manage separate workflows and models for different content types, streamlining implementation while ensuring optimal representation for retrieval and generation.
18
18
@@ -34,28 +34,19 @@ Azure AI Content Understanding addresses the core challenges of multimodal **RAG
34
34
35
35
## Build multimodal RAG Solution with Content Understanding
36
36
37
-
Content extraction forms the foundation of effective RAG systems by transforming raw multimodal data into structured, searchable formats optimized for retrieval. The implementation varies by content type:
38
-
39
-
**Document**: Extracts hierarchical structures, such as headers, paragraphs, tables, and page elements, preserving the logical organization of training materials.
40
-
**Image**: Converts visual information into searchable text by verbalizing diagrams, extracting embedded text, and identifying graphical components.
41
-
**Audio**: Generates speaker-aware transcriptions that accurately capture spoken content while automatically detecting and processing multiple languages.
42
-
**Video**: Divides video into meaningful units, transcribes spoken content, and provides scene descriptions while addressing context window limitations in generative AI models.
43
-
44
-
:::image type="content" source="../media/concepts/rag-architecture-2.png" alt-text="Screenshot of Content Understanding **RAG** architecture overview, process, and workflow with Azure AI Search and Azure OpenAI.":::
45
-
46
-
### **RAG** Scenario: Corporate Training Knowledge Base
47
-
48
37
Imagine a corporate training program with a collection of documents, images, audio recordings, and videos covering topics such as compliance, safety, and technical skills. The goal is to create a system that retrieves relevant information from these multimodal sources based on user queries, enabling employees to access precise and contextually rich answers.
49
38
50
-
##RAG implementation
39
+
### Implementation
51
40
52
41
A high level summary of **RAG** implementation pattern looks like this:
53
42
54
43
1. Transform unstructured multimodal data into structured representation using Content Understanding.
55
44
1. Embed structured output using embedding models.
56
-
1. Store embedded vectors in database or search index.
45
+
1. Store embedded vectors in database or search index.
57
46
1. Use generative AI chat models to query and generate responses from retrieval systems.
58
47
48
+
:::image type="content" source="../media/concepts/rag-architecture-2.png" alt-text="Screenshot of Content Understanding **RAG** architecture overview, process, and workflow with Azure AI Search and Azure OpenAI.":::
49
+
59
50
Here's an overview of the implementation process, beginning with data extraction using Azure AI Content Understanding as the foundation for transforming raw multimodal data into structured, searchable formats optimized for **RAG** workflows:
@@ -69,7 +60,7 @@ Content extraction forms the foundation of effective **RAG** systems by transfor
69
60
70
61
***Document:** Extract hierarchical structures, such as headers, paragraphs, tables, and page elements, preserving the logical organization of training materials.
71
62
***Image:** Transform visual data into searchable text by verbalizing diagrams and charts, extracting embedded text, and converting graphical data into structured formats. Technical illustrations are analyzed to identify components and relationships.
72
-
***Audio:** Generate speaker-aware transcriptions that accurately capture spoken content while automatically detecting and processing multiple languages.
63
+
***Audio:** Generate speaker-aware transcriptions that accurately capture spoken content while automatically detecting and processing multiple languages.
73
64
***Video:** Video data is segmented into meaningful units, transcribe spoken content, and provide scene descriptions while addressing context window limitations in generative AI models.
74
65
75
66
While content extraction provides a strong foundation for indexing and retrieval, it may not fully address domain-specific needs or provide deeper contextual insights. Learn more about [content extraction](capabilities.md)
@@ -81,15 +72,15 @@ Field extraction complements content extraction by generating targeted metadata
81
72
***Document:** Extract key topics/fields to provide concise overviews of lengthy materials.
82
73
***Image:** Convert visual information into searchable text by verbalizing diagrams, extracting embedded text, and identifying graphical components.
83
74
***Audio:** Extract key topics or sentiment analysis from conversations and to provide added context for queries.
84
-
***Video:** Generate scene-level summaries, identify key topics, or analyze brand presence and product associations within video footage.
75
+
***Video:** Generate scene-level summaries, identify key topics, or analyze brand presence and product associations within video footage.
85
76
86
-
Combining content extraction with field extraction enables organizations to create a contextually rich knowledge base optimized for indexing, retrieval, and **RAG** scenarios, ensuring more accurate and meaningful responses to user queries.
77
+
Combining content extraction with field extraction enables organizations to create a contextually rich knowledge base optimized for indexing, retrieval, and **RAG** scenarios, ensuring more accurate and meaningful responses to user queries.
87
78
88
79
Learn more about [field extraction](capabilities.md#field-extraction).
89
80
90
81
#### Analyzer and schema configuration
91
-
92
-
The following code sample is an example of an analyzer and schema creation for various modalities in a multimodal **RAG** scenario.
82
+
83
+
The following code sample is an example of an analyzer and schema creation for various modalities in a multimodal **RAG** scenario.
93
84
94
85
---
95
86
@@ -269,12 +260,12 @@ The following code sample showcases the results of content and field extraction
269
260
"words": [
270
261
{
271
262
....
272
-
},
263
+
},
273
264
],
274
265
"lines": [
275
266
{
276
267
...
277
-
},
268
+
},
278
269
]
279
270
}
280
271
],
@@ -426,7 +417,7 @@ The following code sample showcases the results of content and field extraction
426
417
"height": 960,
427
418
"markdown": "# Shot 0:0.0 => 0:1.800\n\n## Transcript\n\n```\n\nWEBVTT\n\n0:0.80 --> 0:10.560\n<v Speaker>When I was planning my trip...",
428
419
"fields": {
429
-
420
+
430
421
"description": {
431
422
"type": "string",
432
423
"valueString": "The video begins with a view from a glass floor, showing a person's feet in white sneakers standing on it. The scene captures a downward view of a structure, possibly a tower, with a grid pattern on the floor and a clear view of the ground below. The lighting is bright, suggesting a sunny day, and the colors are dominated by the orange of the structure and the gray of the floor."
@@ -473,17 +464,17 @@ To follow is a sample consolidated index that support vector and hybrid search a
0 commit comments