Skip to content

Commit a11bef4

Browse files
committed
RAG LLN NLP and Layout model
1 parent 93da14f commit a11bef4

File tree

1 file changed

+36
-1
lines changed

1 file changed

+36
-1
lines changed

articles/ai-services/document-intelligence/concept-retrieval-augumented-generation.md

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Text data chunking strategies play a key role in optimizing the RAG response and
3333

3434
* **Fixed-sized chunking**. Most chunking strategies used in RAG today are based on fix-sized text segments known as chunks. Fixed-sized chunking is quick, easy, and effective with text that doesn't have a strong semantic structure such as logs and data. However it isn't recommended for text that requires semantic understanding and precise context. The fixed-size nature of the window can result in severing words, sentences, or paragraphs impeding comprehension and disrupt the flow of information and understanding.
3535

36-
* **Semantic chunking**. This method divides the text into chunks based on semantic understanding. Division boundaries are focused on sentence subject and use significant computational algorithmically complex resources. However, it has the distinct advantage of maintaining semantic consistency within each chunk. It's useful for text summarization, sentiment analysis, and document classification tasks. For example, if you're looking for a specific section in a document, you can use semantic chunking to divide the document into smaller chunks based on the section headers helping you to find the section you're looking for quickly and easily. An effective semantic chunking strategy yields the following benefits:
36+
* **Semantic chunking**. This method divides the text into chunks based on semantic understanding. Division boundaries are focused on sentence subject and use significant computational algorithmically complex resources. However, it has the distinct advantage of maintaining semantic consistency within each chunk. It's useful for text summarization, sentiment analysis, and document classification tasks.
3737

3838
## Semantic chunking with Document Intelligence layout model
3939

@@ -107,6 +107,41 @@ You can follow the [Document Intelligence studio quickstart](quickstarts/try-doc
107107

108108
* The chat with your data solution accelerator[code sample](https://github.com/Azure-Samples/chat-with-your-data-solution-accelerator) demonstrates an end-to-end baseline RAG pattern sample. It uses Azure AI Search as a retriever and Azure AI Document Intelligence for document loading and semantic chunking.
109109

110+
## Use case
111+
112+
If you're looking for a specific section in a document, you can use semantic chunking to divide the document into smaller chunks based on the section headers helping you to find the section you're looking for quickly and easily:
113+
114+
```python
115+
116+
# Using SDK targeting 2023-10-31-preview
117+
# pip install azure-ai-documentintelligence==1.0.0b1
118+
119+
from azure.ai.documentintelligence import DocumentIntelligenceClient
120+
from azure.core.credentials import AzureKeyCredential
121+
122+
endpoint = "https://<my-custom-subdomain>.cognitiveservices.azure.com/"
123+
credential = AzureKeyCredential("<api_key>")
124+
125+
document_intelligence_client = DocumentIntelligenceClient(
126+
endpoint, credential)
127+
128+
from langchain.document_loaders.doc_intelligence import DocumentIntelligenceLoader
129+
from langchain.text_splitter import MarkdownHeaderTextSplitter
130+
# Initiate Azure AI Document Intelligence to load the document and split it into chunks
131+
loader = DocumentIntelligenceLoader(file_path=<your file path>, credential, endpoint)
132+
docs = loader.load()
133+
# text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
134+
headers_to_split_on = [
135+
("#", "Header 1"),
136+
("##", "Header 2"),
137+
("###", "Header 3"),
138+
]
139+
text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
140+
splits = text_splitter.split_text(docs_string)
141+
splits
142+
143+
```
144+
110145
## Next steps
111146

112147
* Learn more about [Azure AI Document Intelligence](overview.md).

0 commit comments

Comments
 (0)