Skip to content

Commit 3f13249

Browse files
authored
Merge pull request #263985 from mattgotteiner/matt/chunking-doc
Chunking doc update
2 parents 95b6ad6 + ec8a78e commit 3f13249

File tree

5 files changed

+125
-61
lines changed

5 files changed

+125
-61
lines changed
27.3 KB
Loading
27.2 KB
Loading
21.6 KB
Loading
20.7 KB
Loading

articles/search/vector-search-how-to-chunk-documents.md

Lines changed: 125 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -9,30 +9,19 @@ ms.service: cognitive-search
99
ms.custom:
1010
- ignite-2023
1111
ms.topic: conceptual
12-
ms.date: 10/30/2023
12+
ms.date: 01/29/2024
1313
---
1414

1515
# Chunking large documents for vector search solutions in Azure AI Search
1616

17-
This article describes several approaches for chunking large documents so that you can generate embeddings for vector search. Chunking is only required if source documents are too large for the maximum input size imposed by models.
17+
Partitioning large documents into smaller chunks can help you stay under the maximum token input limits of embedding models. For example, the maximum length of input text for the [Azure OpenAI](/azure/ai-services/openai/how-to/embeddings) embedding models is 8,191 tokens. Given that each token is around four characters of text for common OpenAI models, this maximum limit is equivalent to around 6,000 words of text. If you're using these models to generate embeddings, it's critical that the input text stays under the limit. Partitioning your content into chunks ensures that your data can be processed by the embedding models used to populate vector stores and text-to-vector query conversions.
1818

19-
> [!NOTE]
20-
> This article applies to the generally available version of [vector search](vector-search-overview.md), which assumes your application code calls an external library that performs data chunking. A new feature called [integrated vectorization](vector-search-integrated-vectorization.md), currently in preview, offers embedded data chunking. Integrated vectorization takes a dependency on indexers, skillsets, and the Text Split skill.
21-
22-
## Why is chunking important?
23-
24-
The models used to generate embedding vectors have maximum limits on the text fragments provided as input. For example, the maximum length of input text for the [Azure OpenAI](/azure/ai-services/openai/how-to/embeddings) embedding models is 8,191 tokens. Given that each token is around 4 characters of text for common OpenAI models, this maximum limit is equivalent to around 6000 words of text. If you're using these models to generate embeddings, it's critical that the input text stays under the limit. Partitioning your content into chunks ensures that your data can be processed by the Large Language Models (LLM) used for indexing and queries.
25-
26-
## How chunking fits into the workflow
27-
28-
Because there isn't a native chunking capability in either Azure AI Search or Azure OpenAI, if you have large documents, you must insert a chunking step into indexing and query workflows that breaks up large text. Some libraries that provide chunking include:
19+
This article describes several approaches for data chunking. Chunking is only required if source documents are too large for the maximum input size imposed by models.
2920

30-
+ [LangChain](https://python.langchain.com/en/latest/index.html)
31-
+ [Semantic Kernel](https://github.com/microsoft/semantic-kernel)
32-
33-
Both libraries support common chunking techniques for fixed size, variable size, or a combination. You can also specify an overlap percentage that duplicates a small amount of content in each chunk for context preservation.
21+
> [!NOTE]
22+
> If you're using the generally available version of [vector search](vector-search-overview.md), data chunking and embedding requires external code, such as library or a custom skill. A new feature called [integrated vectorization](vector-search-integrated-vectorization.md), currently in preview, offers internal data chunking and embedding. Integrated vectorization takes a dependency on indexers, skillsets, the Text Split skill, and the AzureOpenAiEmbedding skill (or a custom skill). If you can't use the preview features, the examples in this article provide an alternative path forward.
3423
35-
### Common chunking techniques
24+
## Common chunking techniques
3625

3726
Here are some common chunking techniques, starting with the most widely used method:
3827

@@ -56,75 +45,150 @@ When it comes to chunking data, think about these factors:
5645

5746
+ Large Language Models (LLM) have performance guidelines for chunk size. you need to set a chunk size that works best for all of the models you're using. For instance, if you use models for summarization and embeddings, choose an optimal chunk size that works for both.
5847

59-
## Simple example of how to create chunks with sentences
48+
### How chunking fits into the workflow
49+
50+
If you have large documents, you must insert a chunking step into indexing and query workflows that breaks up large text. When using [integrated vectorization (preview)](vector-search-integrated-vectorization.md), a default chunking strategy using the [text split skill](./cognitive-search-skill-textsplit.md) is applied. You can also apply a custom chunking strategy using a [custom skill](cognitive-search-custom-skill-web-api.md). Some libraries that provide chunking include:
6051

61-
This section uses an example to demonstrate the logic of creating chunks out of sentences. For this example, assume the following:
52+
+ [LangChain](https://python.langchain.com/en/latest/index.html)
53+
+ [Semantic Kernel](https://github.com/microsoft/semantic-kernel)
6254

63-
+ Tokens are equal to words.
64-
+ Input = `text_to_chunk(string)`
65-
+ Output = `sentences(list[string])`
55+
Most libraries provide common chunking techniques for fixed size, variable size, or a combination. You can also specify an overlap that duplicates a small amount of content in each chunk for context preservation.
6656

67-
### Sample input
57+
## Chunking examples
6858

69-
`"Barcelona is a city in Spain. It is close to the sea and /n the mountains. /n You can both ski in winter and swim in summer."`
59+
The following examples demonstrate how chunking strategies are applied to [NASA's Earth at Night e-book](https://github.com/Azure-Samples/azure-search-sample-data/blob/main/nasa-e-book/earth_at_night_508.pdf):
7060

71-
+ Sentence 1 contains 6 words: `"Barcelona is a city in Spain."`
72-
+ Sentence 2 contains 9 words: `"It is close to the sea /n and the mountains. /n"`
73-
+ Sentence 3 contains 10 words: `"You can both ski in winter and swim in summer."`
61+
+ [Text Split skill (preview](cognitive-search-skill-textsplit.md)
62+
+ [LangChain](https://python.langchain.com/en/latest/index.html)
63+
+ [Semantic Kernel](https://github.com/microsoft/semantic-kernel)
64+
+ [custom skill](cognitive-search-custom-skill-scale.md)
7465

75-
### Approach 1: Sentence chunking with "no overlap"
66+
### Text Split skill (preview)
7667

77-
Given a maximum number of tokens, iterate through the sentences and concatenate sentences until the maximum token length is reached. If a sentence is bigger than the maximum number of chunks, truncate to a maximum number of tokens, and put the rest in the next chunk.
68+
This section documents the built-in data chunking using a skills-driven approach and [Text Split skill parameters](cognitive-search-skill-textsplit.md#skill-parameters).
7869

79-
> [!NOTE]
80-
> The examples ignore the newline `/n` character because it's not a token, but if the package or library detects new lines, then you'd see those line breaks here.
70+
Set `textSplitMode` to break up content into smaller chunks:
8171

82-
**Example: maximum tokens = 10**
72+
+ `pages` (default). Chunks are made up of multiple sentences.
73+
+ `sentences`. Chunks are made up of single sentences. What constitutes a "sentence" is language dependent. In English, standard sentence ending punctuation such as `.` or `!` is used. The language is controlled by the `defaultLanguageCode` parameter.
8374

84-
```
85-
Barcelona is a city in Spain.
86-
It is close to the sea /n and the mountains. /n
87-
You can both ski in winter and swim in summer.
88-
```
75+
The `pages` parameter adds extra parameters:
8976

90-
**Example: maximum tokens = 16**
77+
+ `maximumPageLength` defines the maximum number of characters <sup>1</sup> in each chunk. The text splitter avoids breaking up sentences, so the actual character count depends on the content.
78+
+ `pageOverlapLength` defines how many characters from the end of the previous page are included at the start of the next page. If set, this must be less than half the maximum page length.
79+
+ `maximumPagesToTake` defines how many pages / chunks to take from a document. The default value is 0, which means taking all pages or chunks from the document.
9180

92-
```
93-
Barcelona is a city in Spain. It is close to the sea /n and the mountains. /n
94-
You can both ski in winter and swim in summer.
95-
```
81+
<sup>1</sup> Characters don't align to the definition of a [token](/azure/ai-services/openai/concepts/prompt-engineering#space-efficiency). The number of tokens measured by the LLM might be different than the character size measured by the Text Split skill.
82+
83+
The following table shows how the choice of parameters affects the total chunk count from the Earth at Night e-book:
9684

97-
**Example: maximum tokens = 6**
85+
| `textSplitMode` | `maximumPageLength` | `pageOverlapLength` | Total Chunk Count |
86+
|-----------------|-----------------|-----------------|-----------------|
87+
| `pages` | 1000 | 0 | 172 |
88+
| `pages` | 1000 | 200 | 216 |
89+
| `pages` | 2000 | 0 | 85 |
90+
| `pages` | 2000 | 500 | 113 |
91+
| `pages` | 5000 | 0 | 34 |
92+
| `pages` | 5000 | 500 | 38 |
93+
| `sentences` | N/A | N/A | 13361 |
9894

95+
Using a `textSplitMode` of `pages` results in a majority of chunks having total character counts close to `maximumPageLength`. Chunk character count varies due to differences on where sentence boundaries fall inside the chunk. Chunk token length varies due to differences in the contents of the chunk.
96+
97+
The following histograms show how the distribution of chunk character length compares to chunk token length for [gpt-35-turbo](/azure/ai-services/openai/how-to/chatgpt) when using a `textSplitMode` of `pages`, a `maximumPageLength` of 2000, and a `pageOverlapLength` of 500 on the Earth at Night e-book:
98+
99+
:::image type="content" source="./media/vector-search-how-to-chunk-documents/maximumpagelength-2000-pageoverlap-500-characters.png" alt-text="Histogram of chunk character count for maximumPageLength 2000 and pageOverlapLength 500.":::
100+
101+
:::image type="content" source="./media/vector-search-how-to-chunk-documents/maximumpagelength-2000-pageoverlap-500-tokens.png" alt-text="Histogram of chunk token count for maximumPageLength 2000 and pageOverlapLength 500.":::
102+
103+
Using a `textSplitMode` of `sentences` results in a large number of chunks consisting of individual sentences. These chunks are significantly smaller than those produced by `pages`, and the token count of the chunks more closely matches the character counts.
104+
105+
The following histograms show how the distribution of chunk character length compares to chunk token length for [gpt-35-turbo](/azure/ai-services/openai/how-to/chatgpt) when using a `textSplitMode` of `sentences` on the Earth at Night e-book:
106+
107+
:::image type="content" source="./media/vector-search-how-to-chunk-documents/sentences-characters.png" alt-text="Histogram of chunk character count for sentences.":::
108+
109+
:::image type="content" source="./media/vector-search-how-to-chunk-documents/sentences-tokens.png" alt-text="Histogram of chunk token count for sentences.":::
110+
111+
The optimal choice of parameters depends on how the chunks will be used. For most applications, it's recommended to start with the following default parameters:
112+
113+
| `textSplitMode` | `maximumPageLength` | `pageOverlapLength` |
114+
|-----------------|-----------------|-----------------|
115+
| `pages` | 2000 | 500 |
116+
117+
### LangChain
118+
119+
LangChain provides document loaders and text splitters. This example shows you how to load a PDF, get token counts, and set up a text splitter. Getting token counts helps you make an informed decision on chunk sizing.
120+
121+
```python
122+
from langchain_community.document_loaders import PyPDFLoader
123+
124+
loader = PyPDFLoader("./data/earth_at_night_508.pdf")
125+
pages = loader.load()
126+
127+
print(len(pages))
99128
```
100-
Barcelona is a city in Spain.
101-
It is close to the sea /n
102-
and the mountains. /n
103-
You can both ski in winter
104-
and swim in summer.
129+
Output indicates 200 documents or pages in the PDF.
130+
131+
To get an estimated token count for these pages, use TikToken.
132+
133+
```python
134+
import tiktoken
135+
136+
tokenizer = tiktoken.get_encoding('cl100k_base')
137+
def tiktoken_len(text):
138+
tokens = tokenizer.encode(
139+
text,
140+
disallowed_special=()
141+
)
142+
return len(tokens)
143+
tiktoken.encoding_for_model('gpt-3.5-turbo')
144+
145+
# create the length function
146+
token_counts = []
147+
for page in pages:
148+
token_counts.append(tiktoken_len(page.page_content))
149+
min_token_count = min(token_counts)
150+
avg_token_count = int(sum(token_counts) / len(token_counts))
151+
max_token_count = max(token_counts)
152+
153+
# print token counts
154+
print(f"Min: {min_token_count}")
155+
print(f"Avg: {avg_token_count}")
156+
print(f"Max: {max_token_count}")
105157
```
106158

107-
### Approach 2: Sentence chunking with "10% overlap"
159+
Output indicates that no pages have zero tokens, the average token length per page is 189 tokens, and the maximum token count of any page is 1,583.
108160

109-
Follow the same logic with no overlap approach, except that you create an overlap between chunks according to certain ratio.
110-
A 10% overlap on maximum tokens of 10 is one token.
161+
Knowing the average and maximum token size gives you insight into setting chunk size. Although you could use the standard recommendation of 2000 characters with a 500 character overlap, in this case it makes sense to go lower given the token counts of the sample document. In fact, setting an overlap value that's too large can result in no overlap appearing at all.
111162

112-
**Example: maximum tokens = 10**
163+
```python
164+
from langchain.text_splitter import RecursiveCharacterTextSplitter
165+
# split documents into text and embeddings
113166

114-
```
115-
Barcelona is a city in Spain.
116-
Spain. It is close to the sea /n and the mountains. /n
117-
mountains. /n You can both ski in winter and swim in summer.
167+
text_splitter = RecursiveCharacterTextSplitter(
168+
chunk_size=1000,
169+
chunk_overlap=200,
170+
length_function=len,
171+
is_separator_regex=False
172+
)
173+
174+
chunks = text_splitter.split_documents(pages)
175+
176+
print(chunks[20])
177+
print(chunks[21])
118178
```
119179

120-
## Try it out: Chunking and vector embedding generation sample
180+
Output for two consecutive chunks shows the text from the first chunk overlapping onto the second chunk. Output is lightly edited for readability.
121181

122-
A [fixed-sized chunking and embedding generation sample](https://github.com/Azure-Samples/azure-search-power-skills/blob/main/Vector/EmbeddingGenerator/README.md) demonstrates both chunking and vector embedding generation using [Azure OpenAI](/azure/ai-services/openai/) embedding models. This sample uses an [Azure AI Search custom skill](cognitive-search-custom-skill-web-api.md) in the [Power Skills repo](https://github.com/Azure-Samples/azure-search-power-skills/tree/main#readme) to wrap the chunking step.
182+
`'x Earth at NightForeword\nNASA’s Earth at Night explores the brilliance of our planet when it is in darkness. \n It is a compilation of stories depicting the interactions between science and \nwonder, and I am pleased to share this visually stunning and captivating exploration of \nour home planet.\nFrom space, our Earth looks tranquil. The blue ethereal vastness of the oceans \nharmoniously shares the space with verdant green land—an undercurrent of gentle-ness and solitude. But spending time gazing at the images presented in this book, our home planet at night instantly reveals a different reality. Beautiful, filled with glow-ing communities, natural wonders, and striking illumination, our world is bustling with activity and life.**\nDarkness is not void of illumination. It is the contrast, the area between light and'** metadata={'source': './data/earth_at_night_508.pdf', 'page': 9}`
183+
184+
`'**Darkness is not void of illumination. It is the contrast, the area between light and **\ndark, that is often the most illustrative. Darkness reminds me of where I came from and where I am now—from a small town in the mountains, to the unique vantage point of the Nation’s capital. Darkness is where dreamers and learners of all ages peer into the universe and think of questions about themselves and their space in the cosmos. Light is where they work, where they gather, and take time together.\nNASA’s spacefaring satellites have compiled an unprecedented record of our \nEarth, and its luminescence in darkness, to captivate and spark curiosity. These missions see the contrast between dark and light through the lenses of scientific instruments. Our home planet is full of complex and dynamic cycles and processes. These soaring observers show us new ways to discern the nuances of light created by natural and human-made sources, such as auroras, wildfires, cities, phytoplankton, and volcanoes.' metadata={'source': './data/earth_at_night_508.pdf', 'page': 9}`
123185

124-
This sample is built on LangChain, Azure OpenAI, and Azure AI Search.
186+
### Custom skill
187+
188+
A [fixed-sized chunking and embedding generation sample](https://github.com/Azure-Samples/azure-search-power-skills/blob/main/Vector/EmbeddingGenerator/README.md) demonstrates both chunking and vector embedding generation using [Azure OpenAI](/azure/ai-services/openai/) embedding models. This sample uses an [Azure AI Search custom skill](cognitive-search-custom-skill-web-api.md) in the [Power Skills repo](https://github.com/Azure-Samples/azure-search-power-skills/tree/main#readme) to wrap the chunking step.
125189

126190
## See also
127191

128192
+ [Understanding embeddings in Azure OpenAI Service](/azure/ai-services/openai/concepts/understand-embeddings)
129-
+ [Learn how to generate embeddings](/azure/ai-services/openai/how-to/embeddings?tabs=console)
130-
+ [Tutorial: Explore Azure OpenAI Service embeddings and document search](/azure/ai-services/openai/tutorials/embeddings?tabs=command-line)
193+
+ [Learn how to generate embeddings](/azure/ai-services/openai/how-to/embeddings)
194+
+ [Tutorial: Explore Azure OpenAI Service embeddings and document search](/azure/ai-services/openai/tutorials/embeddings)

0 commit comments

Comments
 (0)