You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Chunking large documents for vector search solutions in Azure AI Search
16
16
17
-
This article describes several approaches for chunking large documents so that you can generate embeddings for vector search. Chunking is only required if source documents are too large for the maximum input size imposed by models.
17
+
Partitioning large documents into smaller chunks can help you stay under the maximum token input limits of embedding models. For example, the maximum length of input text for the [Azure OpenAI](/azure/ai-services/openai/how-to/embeddings) embedding models is 8,191 tokens. Given that each token is around four characters of text for common OpenAI models, this maximum limit is equivalent to around 6,000 words of text. If you're using these models to generate embeddings, it's critical that the input text stays under the limit. Partitioning your content into chunks ensures that your data can be processed by the embedding models used to populate vector stores and text-to-vector query conversions.
18
18
19
-
> [!NOTE]
20
-
> This article applies to the generally available version of [vector search](vector-search-overview.md), which assumes your application code calls an external library that performs data chunking. A new feature called [integrated vectorization](vector-search-integrated-vectorization.md), currently in preview, offers embedded data chunking. Integrated vectorization takes a dependency on indexers, skillsets, and the Text Split skill.
21
-
22
-
## Why is chunking important?
23
-
24
-
The models used to generate embedding vectors have maximum limits on the text fragments provided as input. For example, the maximum length of input text for the [Azure OpenAI](/azure/ai-services/openai/how-to/embeddings) embedding models is 8,191 tokens. Given that each token is around 4 characters of text for common OpenAI models, this maximum limit is equivalent to around 6000 words of text. If you're using these models to generate embeddings, it's critical that the input text stays under the limit. Partitioning your content into chunks ensures that your data can be processed by the Large Language Models (LLM) used for indexing and queries.
25
-
26
-
## How chunking fits into the workflow
27
-
28
-
Because there isn't a native chunking capability in either Azure AI Search or Azure OpenAI, if you have large documents, you must insert a chunking step into indexing and query workflows that breaks up large text. Some libraries that provide chunking include:
19
+
This article describes several approaches for data chunking. Chunking is only required if source documents are too large for the maximum input size imposed by models.
Both libraries support common chunking techniques for fixed size, variable size, or a combination. You can also specify an overlap percentage that duplicates a small amount of content in each chunk for context preservation.
21
+
> [!NOTE]
22
+
> If you're using the generally available version of [vector search](vector-search-overview.md), data chunking and embedding requires external code, such as library or a custom skill. A new feature called [integrated vectorization](vector-search-integrated-vectorization.md), currently in preview, offers internal data chunking and embedding. Integrated vectorization takes a dependency on indexers, skillsets, the Text Split skill, and the AzureOpenAiEmbedding skill (or a custom skill). If you can't use the preview features, the examples in this article provide an alternative path forward.
34
23
35
-
###Common chunking techniques
24
+
## Common chunking techniques
36
25
37
26
Here are some common chunking techniques, starting with the most widely used method:
38
27
@@ -56,75 +45,150 @@ When it comes to chunking data, think about these factors:
56
45
57
46
+ Large Language Models (LLM) have performance guidelines for chunk size. you need to set a chunk size that works best for all of the models you're using. For instance, if you use models for summarization and embeddings, choose an optimal chunk size that works for both.
58
47
59
-
## Simple example of how to create chunks with sentences
48
+
### How chunking fits into the workflow
49
+
50
+
If you have large documents, you must insert a chunking step into indexing and query workflows that breaks up large text. When using [integrated vectorization (preview)](vector-search-integrated-vectorization.md), a default chunking strategy using the [text split skill](./cognitive-search-skill-textsplit.md) is applied. You can also apply a custom chunking strategy using a [custom skill](cognitive-search-custom-skill-web-api.md). Some libraries that provide chunking include:
60
51
61
-
This section uses an example to demonstrate the logic of creating chunks out of sentences. For this example, assume the following:
Most libraries provide common chunking techniques for fixed size, variable size, or a combination. You can also specify an overlap that duplicates a small amount of content in each chunk for context preservation.
66
56
67
-
### Sample input
57
+
##Chunking examples
68
58
69
-
`"Barcelona is a city in Spain. It is close to the sea and /n the mountains. /n You can both ski in winter and swim in summer."`
59
+
The following examples demonstrate how chunking strategies are applied to [NASA's Earth at Night e-book](https://github.com/Azure-Samples/azure-search-sample-data/blob/main/nasa-e-book/earth_at_night_508.pdf):
70
60
71
-
+ Sentence 1 contains 6 words: `"Barcelona is a city in Spain."`
72
-
+ Sentence 2 contains 9 words: `"It is close to the sea /n and the mountains. /n"`
73
-
+ Sentence 3 contains 10 words: `"You can both ski in winter and swim in summer."`
### Approach 1: Sentence chunking with "no overlap"
66
+
### Text Split skill (preview)
76
67
77
-
Given a maximum number of tokens, iterate through the sentences and concatenate sentences until the maximum token length is reached. If a sentence is bigger than the maximum number of chunks, truncate to a maximum number of tokens, and put the rest in the next chunk.
68
+
This section documents the built-in data chunking using a skills-driven approach and [Text Split skill parameters](cognitive-search-skill-textsplit.md#skill-parameters).
78
69
79
-
> [!NOTE]
80
-
> The examples ignore the newline `/n` character because it's not a token, but if the package or library detects new lines, then you'd see those line breaks here.
70
+
Set `textSplitMode` to break up content into smaller chunks:
81
71
82
-
**Example: maximum tokens = 10**
72
+
+`pages` (default). Chunks are made up of multiple sentences.
73
+
+`sentences`. Chunks are made up of single sentences. What constitutes a "sentence" is language dependent. In English, standard sentence ending punctuation such as `.` or `!` is used. The language is controlled by the `defaultLanguageCode` parameter.
83
74
84
-
```
85
-
Barcelona is a city in Spain.
86
-
It is close to the sea /n and the mountains. /n
87
-
You can both ski in winter and swim in summer.
88
-
```
75
+
The `pages` parameter adds extra parameters:
89
76
90
-
**Example: maximum tokens = 16**
77
+
+`maximumPageLength` defines the maximum number of characters <sup>1</sup> in each chunk. The text splitter avoids breaking up sentences, so the actual character count depends on the content.
78
+
+`pageOverlapLength` defines how many characters from the end of the previous page are included at the start of the next page. If set, this must be less than half the maximum page length.
79
+
+`maximumPagesToTake` defines how many pages / chunks to take from a document. The default value is 0, which means taking all pages or chunks from the document.
91
80
92
-
```
93
-
Barcelona is a city in Spain. It is close to the sea /n and the mountains. /n
94
-
You can both ski in winter and swim in summer.
95
-
```
81
+
<sup>1</sup> Characters don't align to the definition of a [token](/azure/ai-services/openai/concepts/prompt-engineering#space-efficiency). The number of tokens measured by the LLM might be different than the character size measured by the Text Split skill.
82
+
83
+
The following table shows how the choice of parameters affects the total chunk count from the Earth at Night e-book:
96
84
97
-
**Example: maximum tokens = 6**
85
+
|`textSplitMode`|`maximumPageLength`|`pageOverlapLength`| Total Chunk Count |
Using a `textSplitMode` of `pages` results in a majority of chunks having total character counts close to `maximumPageLength`. Chunk character count varies due to differences on where sentence boundaries fall inside the chunk. Chunk token length varies due to differences in the contents of the chunk.
96
+
97
+
The following histograms show how the distribution of chunk character length compares to chunk token length for [gpt-35-turbo](/azure/ai-services/openai/how-to/chatgpt) when using a `textSplitMode` of `pages`, a `maximumPageLength` of 2000, and a `pageOverlapLength` of 500 on the Earth at Night e-book:
98
+
99
+
:::image type="content" source="./media/vector-search-how-to-chunk-documents/maximumpagelength-2000-pageoverlap-500-characters.png" alt-text="Histogram of chunk character count for maximumPageLength 2000 and pageOverlapLength 500.":::
100
+
101
+
:::image type="content" source="./media/vector-search-how-to-chunk-documents/maximumpagelength-2000-pageoverlap-500-tokens.png" alt-text="Histogram of chunk token count for maximumPageLength 2000 and pageOverlapLength 500.":::
102
+
103
+
Using a `textSplitMode` of `sentences` results in a large number of chunks consisting of individual sentences. These chunks are significantly smaller than those produced by `pages`, and the token count of the chunks more closely matches the character counts.
104
+
105
+
The following histograms show how the distribution of chunk character length compares to chunk token length for [gpt-35-turbo](/azure/ai-services/openai/how-to/chatgpt) when using a `textSplitMode` of `sentences` on the Earth at Night e-book:
106
+
107
+
:::image type="content" source="./media/vector-search-how-to-chunk-documents/sentences-characters.png" alt-text="Histogram of chunk character count for sentences.":::
108
+
109
+
:::image type="content" source="./media/vector-search-how-to-chunk-documents/sentences-tokens.png" alt-text="Histogram of chunk token count for sentences.":::
110
+
111
+
The optimal choice of parameters depends on how the chunks will be used. For most applications, it's recommended to start with the following default parameters:
LangChain provides document loaders and text splitters. This example shows you how to load a PDF, get token counts, and set up a text splitter. Getting token counts helps you make an informed decision on chunk sizing.
120
+
121
+
```python
122
+
from langchain_community.document_loaders import PyPDFLoader
### Approach 2: Sentence chunking with "10% overlap"
159
+
Output indicates that no pages have zero tokens, the average token length per page is 189 tokens, and the maximum token count of any page is 1,583.
108
160
109
-
Follow the same logic with no overlap approach, except that you create an overlap between chunks according to certain ratio.
110
-
A 10% overlap on maximum tokens of 10 is one token.
161
+
Knowing the average and maximum token size gives you insight into setting chunk size. Although you could use the standard recommendation of 2000 characters with a 500 character overlap, in this case it makes sense to go lower given the token counts of the sample document. In fact, setting an overlap value that's too large can result in no overlap appearing at all.
111
162
112
-
**Example: maximum tokens = 10**
163
+
```python
164
+
from langchain.text_splitter import RecursiveCharacterTextSplitter
165
+
# split documents into text and embeddings
113
166
114
-
```
115
-
Barcelona is a city in Spain.
116
-
Spain. It is close to the sea /n and the mountains. /n
117
-
mountains. /n You can both ski in winter and swim in summer.
167
+
text_splitter = RecursiveCharacterTextSplitter(
168
+
chunk_size=1000,
169
+
chunk_overlap=200,
170
+
length_function=len,
171
+
is_separator_regex=False
172
+
)
173
+
174
+
chunks = text_splitter.split_documents(pages)
175
+
176
+
print(chunks[20])
177
+
print(chunks[21])
118
178
```
119
179
120
-
## Try it out: Chunking and vector embedding generation sample
180
+
Output for two consecutive chunks shows the text from the first chunk overlapping onto the second chunk. Output is lightly edited for readability.
121
181
122
-
A [fixed-sized chunking and embedding generation sample](https://github.com/Azure-Samples/azure-search-power-skills/blob/main/Vector/EmbeddingGenerator/README.md) demonstrates both chunking and vector embedding generation using [Azure OpenAI](/azure/ai-services/openai/) embedding models. This sample uses an [Azure AI Search custom skill](cognitive-search-custom-skill-web-api.md) in the [Power Skills repo](https://github.com/Azure-Samples/azure-search-power-skills/tree/main#readme) to wrap the chunking step.
182
+
`'x Earth at NightForeword\nNASA’s Earth at Night explores the brilliance of our planet when it is in darkness. \n It is a compilation of stories depicting the interactions between science and \nwonder, and I am pleased to share this visually stunning and captivating exploration of \nour home planet.\nFrom space, our Earth looks tranquil. The blue ethereal vastness of the oceans \nharmoniously shares the space with verdant green land—an undercurrent of gentle-ness and solitude. But spending time gazing at the images presented in this book, our home planet at night instantly reveals a different reality. Beautiful, filled with glow-ing communities, natural wonders, and striking illumination, our world is bustling with activity and life.**\nDarkness is not void of illumination. It is the contrast, the area between light and'** metadata={'source': './data/earth_at_night_508.pdf', 'page': 9}`
183
+
184
+
`'**Darkness is not void of illumination. It is the contrast, the area between light and **\ndark, that is often the most illustrative. Darkness reminds me of where I came from and where I am now—from a small town in the mountains, to the unique vantage point of the Nation’s capital. Darkness is where dreamers and learners of all ages peer into the universe and think of questions about themselves and their space in the cosmos. Light is where they work, where they gather, and take time together.\nNASA’s spacefaring satellites have compiled an unprecedented record of our \nEarth, and its luminescence in darkness, to captivate and spark curiosity. These missions see the contrast between dark and light through the lenses of scientific instruments. Our home planet is full of complex and dynamic cycles and processes. These soaring observers show us new ways to discern the nuances of light created by natural and human-made sources, such as auroras, wildfires, cities, phytoplankton, and volcanoes.' metadata={'source': './data/earth_at_night_508.pdf', 'page': 9}`
123
185
124
-
This sample is built on LangChain, Azure OpenAI, and Azure AI Search.
186
+
### Custom skill
187
+
188
+
A [fixed-sized chunking and embedding generation sample](https://github.com/Azure-Samples/azure-search-power-skills/blob/main/Vector/EmbeddingGenerator/README.md) demonstrates both chunking and vector embedding generation using [Azure OpenAI](/azure/ai-services/openai/) embedding models. This sample uses an [Azure AI Search custom skill](cognitive-search-custom-skill-web-api.md) in the [Power Skills repo](https://github.com/Azure-Samples/azure-search-power-skills/tree/main#readme) to wrap the chunking step.
125
189
126
190
## See also
127
191
128
192
+[Understanding embeddings in Azure OpenAI Service](/azure/ai-services/openai/concepts/understand-embeddings)
129
-
+[Learn how to generate embeddings](/azure/ai-services/openai/how-to/embeddings?tabs=console)
130
-
+[Tutorial: Explore Azure OpenAI Service embeddings and document search](/azure/ai-services/openai/tutorials/embeddings?tabs=command-line)
193
+
+[Learn how to generate embeddings](/azure/ai-services/openai/how-to/embeddings)
194
+
+[Tutorial: Explore Azure OpenAI Service embeddings and document search](/azure/ai-services/openai/tutorials/embeddings)
0 commit comments