MicrosoftDocs
diff --git a/‎articles/search/media/vector-search-how-to-chunk-documents/maximumpagelength-2000-pageoverlap-500-characters.png
27.3 KB b/‎articles/search/media/vector-search-how-to-chunk-documents/maximumpagelength-2000-pageoverlap-500-characters.png
27.3 KB
diff --git a/‎articles/search/media/vector-search-how-to-chunk-documents/maximumpagelength-2000-pageoverlap-500-tokens.png
27.2 KB b/‎articles/search/media/vector-search-how-to-chunk-documents/maximumpagelength-2000-pageoverlap-500-tokens.png
27.2 KB
diff --git a/‎articles/search/media/vector-search-how-to-chunk-documents/sentences-characters.png
21.6 KB b/‎articles/search/media/vector-search-how-to-chunk-documents/sentences-characters.png
21.6 KB
diff --git a/‎articles/search/media/vector-search-how-to-chunk-documents/sentences-tokens.png
20.7 KB b/‎articles/search/media/vector-search-how-to-chunk-documents/sentences-tokens.png
20.7 KB
diff --git a/‎articles/search/vector-search-how-to-chunk-documents.md
Lines changed: 125 additions & 61 deletions b/‎articles/search/vector-search-how-to-chunk-documents.md
Lines changed: 125 additions & 61 deletions
@@ -9,30 +9,19 @@ ms.service: cognitive-search
 ms.custom:
   - ignite-2023
 ms.topic: conceptual
-ms.date: 10/30/2023
+ms.date: 01/29/2024
 ---
 
 # Chunking large documents for vector search solutions in Azure AI Search
 
-This article describes several approaches for chunking large documents so that you can generate embeddings for vector search. Chunking is only required if source documents are too large for the maximum input size imposed by models.
+Partitioning large documents into smaller chunks can help you stay under the maximum token input limits of embedding models. For example, the maximum length of input text for the [Azure OpenAI](/azure/ai-services/openai/how-to/embeddings) embedding models is 8,191 tokens. Given that each token is around four characters of text for common OpenAI models, this maximum limit is equivalent to around 6,000 words of text. If you're using these models to generate embeddings, it's critical that the input text stays under the limit. Partitioning your content into chunks ensures that your data can be processed by the embedding models used to populate vector stores and text-to-vector query conversions. 
 
-> [!NOTE]
-> This article applies to the generally available version of [vector search](vector-search-overview.md), which assumes your application code calls an external library that performs data chunking. A new feature called [integrated vectorization](vector-search-integrated-vectorization.md), currently in preview, offers embedded data chunking. Integrated vectorization takes a dependency on indexers, skillsets, and the Text Split skill. 
-
-## Why is chunking important?
-
-The models used to generate embedding vectors have maximum limits on the text fragments provided as input. For example, the maximum length of input text for the [Azure OpenAI](/azure/ai-services/openai/how-to/embeddings) embedding models is 8,191 tokens. Given that each token is around 4 characters of text for common OpenAI models, this maximum limit is equivalent to around 6000 words of text. If you're using these models to generate embeddings, it's critical that the input text stays under the limit. Partitioning your content into chunks ensures that your data can be processed by the Large Language Models (LLM) used for indexing and queries. 
-
-## How chunking fits into the workflow
-
-Because there isn't a native chunking capability in either Azure AI Search or Azure OpenAI, if you have large documents, you must insert a chunking step into indexing and query workflows that breaks up large text. Some libraries that provide chunking include:
+This article describes several approaches for data chunking. Chunking is only required if source documents are too large for the maximum input size imposed by models.
 
-+ [LangChain](https://python.langchain.com/en/latest/index.html)
-+ [Semantic Kernel](https://github.com/microsoft/semantic-kernel)
-
-Both libraries support common chunking techniques for fixed size, variable size, or a combination. You can also specify an overlap percentage that duplicates a small amount of content in each chunk for context preservation.
+> [!NOTE]
+> If you're using the generally available version of [vector search](vector-search-overview.md), data chunking and embedding requires external code, such as library or a custom skill. A new feature called [integrated vectorization](vector-search-integrated-vectorization.md), currently in preview, offers internal data chunking and embedding. Integrated vectorization takes a dependency on indexers, skillsets, the Text Split skill, and the AzureOpenAiEmbedding skill (or a custom skill). If you can't use the preview features, the examples in this article provide an alternative path forward.
 
-### Common chunking techniques
+## Common chunking techniques
 
 Here are some common chunking techniques, starting with the most widely used method:
 
@@ -56,75 +45,150 @@ When it comes to chunking data, think about these factors:
 
 + Large Language Models (LLM) have performance guidelines for chunk size. you need to set a chunk size that works best for all of the models you're using. For instance, if you use models for summarization and embeddings, choose an optimal chunk size that works for both.
 
-## Simple example of how to create chunks with sentences
+### How chunking fits into the workflow
+
+If you have large documents, you must insert a chunking step into indexing and query workflows that breaks up large text. When using [integrated vectorization (preview)](vector-search-integrated-vectorization.md), a default chunking strategy using the [text split skill](./cognitive-search-skill-textsplit.md) is applied. You can also apply a custom chunking strategy using a [custom skill](cognitive-search-custom-skill-web-api.md). Some libraries that provide chunking include:
 
-This section uses an example to demonstrate the logic of creating chunks out of sentences. For this example, assume the following:
++ [LangChain](https://python.langchain.com/en/latest/index.html)
++ [Semantic Kernel](https://github.com/microsoft/semantic-kernel)
 
-+ Tokens are equal to words.
-+ Input = `text_to_chunk(string)`
-+ Output = `sentences(list[string])`
+Most libraries provide common chunking techniques for fixed size, variable size, or a combination. You can also specify an overlap that duplicates a small amount of content in each chunk for context preservation.
 
-### Sample input
+## Chunking examples
 
-`"Barcelona is a city in Spain. It is close to the sea and /n the mountains. /n You can both ski in winter and swim in summer."`
+The following examples demonstrate how chunking strategies are  applied to [NASA's Earth at Night e-book](https://github.com/Azure-Samples/azure-search-sample-data/blob/main/nasa-e-book/earth_at_night_508.pdf):
 
-+ Sentence 1 contains 6 words: `"Barcelona is a city in Spain."`
-+ Sentence 2 contains 9 words: `"It is close to the sea /n and the mountains. /n"`
-+ Sentence 3 contains 10 words: `"You can both ski in winter and swim in summer."`
++ [Text Split skill (preview](cognitive-search-skill-textsplit.md)
++ [LangChain](https://python.langchain.com/en/latest/index.html)
++ [Semantic Kernel](https://github.com/microsoft/semantic-kernel)
++ [custom skill](cognitive-search-custom-skill-scale.md)
 
-### Approach 1: Sentence chunking with "no overlap"
+### Text Split skill (preview)
 
-Given a maximum number of tokens, iterate through the sentences and concatenate sentences until the maximum token length is reached. If a sentence is bigger than the maximum number of chunks, truncate to a maximum number of tokens, and put the rest in the next chunk.
+This section documents the built-in data chunking using a skills-driven approach and [Text Split skill parameters](cognitive-search-skill-textsplit.md#skill-parameters). 
 
-> [!NOTE]
-> The examples ignore the newline `/n` character because it's not a token, but if the package or library detects new lines, then you'd see those line breaks here.
+Set `textSplitMode` to break up content into smaller chunks:
 
-**Example: maximum tokens = 10**
+  + `pages` (default). Chunks are made up of multiple sentences.
+  + `sentences`. Chunks are made up of single sentences. What constitutes a "sentence" is language dependent. In English, standard sentence ending punctuation such as `.` or `!` is used. The language is controlled by the `defaultLanguageCode` parameter.
 
-```
-Barcelona is a city in Spain.
-It is close to the sea /n and the mountains. /n
-You can both ski in winter and swim in summer.
-```
+The `pages` parameter adds extra parameters:
 
-**Example: maximum tokens = 16**
++ `maximumPageLength` defines the maximum number of characters <sup>1</sup> in each chunk. The text splitter avoids breaking up sentences, so the actual character count depends on the content.
++ `pageOverlapLength` defines how many characters from the end of the previous page are included at the start of the next page. If set, this must be less than half the maximum page length.
++ `maximumPagesToTake` defines how many pages / chunks to take from a document. The default value is 0, which means taking all pages or chunks from the document.
 
-```
-Barcelona is a city in Spain. It is close to the sea /n and the mountains. /n
-You can both ski in winter and swim in summer.
-```
+<sup>1</sup> Characters don't align to the definition of a [token](/azure/ai-services/openai/concepts/prompt-engineering#space-efficiency). The number of tokens measured by the LLM might be different than the character size measured by the Text Split skill.
+
+The following table shows how the choice of parameters affects the total chunk count from the Earth at Night e-book:
 
-**Example: maximum tokens = 6**
+| `textSplitMode` | `maximumPageLength` | `pageOverlapLength` | Total Chunk Count |
+|-----------------|-----------------|-----------------|-----------------|
+| `pages` | 1000 | 0 | 172 |
+| `pages` | 1000 | 200 | 216 |
+| `pages` | 2000 | 0 | 85 |
+| `pages` | 2000 | 500 | 113 |
+| `pages` | 5000 | 0 | 34 |
+| `pages` | 5000 | 500 | 38 |
+| `sentences` | N/A | N/A | 13361 |
 
+Using a `textSplitMode` of `pages` results in a majority of chunks having total character counts close to `maximumPageLength`. Chunk character count varies due to differences on where sentence boundaries fall inside the chunk. Chunk token length varies due to differences in the contents of the chunk.
+
+The following histograms show how the distribution of chunk character length compares to chunk token length for [gpt-35-turbo](/azure/ai-services/openai/how-to/chatgpt) when using a `textSplitMode` of `pages`, a `maximumPageLength` of 2000, and a `pageOverlapLength` of 500 on the Earth at Night e-book:
+
+   :::image type="content" source="./media/vector-search-how-to-chunk-documents/maximumpagelength-2000-pageoverlap-500-characters.png" alt-text="Histogram of chunk character count for maximumPageLength 2000 and pageOverlapLength 500.":::
+
+   :::image type="content" source="./media/vector-search-how-to-chunk-documents/maximumpagelength-2000-pageoverlap-500-tokens.png" alt-text="Histogram of chunk token count for maximumPageLength 2000 and pageOverlapLength 500.":::
+
+Using a `textSplitMode` of `sentences` results in a large number of chunks consisting of individual sentences. These chunks are significantly smaller than those produced by `pages`, and the token count of the chunks more closely matches the character counts.
+
+The following histograms show how the distribution of chunk character length compares to chunk token length for [gpt-35-turbo](/azure/ai-services/openai/how-to/chatgpt) when using a `textSplitMode` of `sentences` on the Earth at Night e-book:
+
+   :::image type="content" source="./media/vector-search-how-to-chunk-documents/sentences-characters.png" alt-text="Histogram of chunk character count for sentences.":::
+
+   :::image type="content" source="./media/vector-search-how-to-chunk-documents/sentences-tokens.png" alt-text="Histogram of chunk token count for sentences.":::
+
+The optimal choice of parameters depends on how the chunks will be used. For most applications, it's recommended to start with the following default parameters:
+
+| `textSplitMode` | `maximumPageLength` | `pageOverlapLength` |
+|-----------------|-----------------|-----------------|
+| `pages` | 2000 | 500 |
+
+### LangChain
+
+LangChain provides document loaders and text splitters. This example shows you how to load a PDF, get token counts, and set up a text splitter. Getting token counts helps you make an informed decision on chunk sizing.
+
+```python
+from langchain_community.document_loaders import PyPDFLoader
+ 
+loader = PyPDFLoader("./data/earth_at_night_508.pdf")
+pages = loader.load()
+
+print(len(pages))
 ```
-Barcelona is a city in Spain.
-It is close to the sea /n
-and the mountains. /n
-You can both ski in winter
-and swim in summer.
+Output indicates 200 documents or pages in the PDF.
+
+To get an estimated token count for these pages, use TikToken.
+
+```python
+import tiktoken
+
+tokenizer = tiktoken.get_encoding('cl100k_base')
+def tiktoken_len(text):
+    tokens = tokenizer.encode(
+    text,
+    disallowed_special=()
+)
+    return len(tokens)
+tiktoken.encoding_for_model('gpt-3.5-turbo')
+
+# create the length function
+token_counts = []
+for page in pages:
+    token_counts.append(tiktoken_len(page.page_content))
+min_token_count = min(token_counts)
+avg_token_count = int(sum(token_counts) / len(token_counts))
+max_token_count = max(token_counts)
+
+# print token counts
+print(f"Min: {min_token_count}")
+print(f"Avg: {avg_token_count}")
+print(f"Max: {max_token_count}")
 ```
 
-### Approach 2: Sentence chunking with "10% overlap"
+Output indicates that no pages have zero tokens, the average token length per page is 189 tokens, and the maximum token count of any page is 1,583.
 
-Follow the same logic with no overlap approach, except that you create an overlap between chunks according to certain ratio.
-A 10% overlap on maximum tokens of 10 is one token.
+Knowing the average and maximum token size gives you insight into setting chunk size. Although you could use the standard recommendation of 2000 characters with a 500 character overlap, in this case it makes sense to go lower given the token counts of the sample document. In fact, setting an overlap value that's too large can result in no overlap appearing at all.
 
-**Example: maximum tokens = 10**
+```python
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+# split documents into text and embeddings
 
-```
-Barcelona is a city in Spain.
-Spain. It is close to the sea /n and the mountains. /n 
-mountains. /n You can both ski in winter and swim in summer.
+text_splitter = RecursiveCharacterTextSplitter(
+   chunk_size=1000, 
+   chunk_overlap=200,
+   length_function=len,
+   is_separator_regex=False
+)
+
+chunks = text_splitter.split_documents(pages)
+
+print(chunks[20])
+print(chunks[21])
 ```
 
-## Try it out: Chunking and vector embedding generation sample
+Output for two consecutive chunks shows the text from the first chunk overlapping onto the second chunk. Output is lightly edited for readability.
 
-A [fixed-sized chunking and embedding generation sample](https://github.com/Azure-Samples/azure-search-power-skills/blob/main/Vector/EmbeddingGenerator/README.md) demonstrates both chunking and vector embedding generation using [Azure OpenAI](/azure/ai-services/openai/) embedding models. This sample uses an [Azure AI Search custom skill](cognitive-search-custom-skill-web-api.md) in the [Power Skills repo](https://github.com/Azure-Samples/azure-search-power-skills/tree/main#readme) to wrap the chunking step.
+`'x Earth at NightForeword\nNASA’s Earth at Night explores the brilliance of our planet when it is in darkness.  \n  It is a compilation of stories depicting the interactions between science and \nwonder, and I am pleased to share this visually stunning and captivating exploration of \nour home planet.\nFrom space, our Earth looks tranquil. The blue ethereal vastness of the oceans \nharmoniously shares the space with verdant green land—an undercurrent of gentle-ness and solitude. But spending time gazing at the images presented in this book, our home planet at night instantly reveals a different reality. Beautiful, filled with glow-ing communities, natural wonders, and striking illumination, our world is bustling with activity and life.**\nDarkness is not void of illumination. It is the contrast, the area between light and'** metadata={'source': './data/earth_at_night_508.pdf', 'page': 9}`
+
+`'**Darkness is not void of illumination. It is the contrast, the area between light and **\ndark, that is often the most illustrative. Darkness reminds me of where I came from and where I am now—from a small town in the mountains, to the unique vantage point of the Nation’s capital. Darkness is where dreamers and learners of all ages peer into the universe and think of questions about themselves and their space in the cosmos. Light is where they work, where they gather, and take time together.\nNASA’s spacefaring satellites have compiled an unprecedented record of our \nEarth, and its luminescence in darkness, to captivate and spark curiosity. These missions see the contrast between dark and light through the lenses of scientific instruments. Our home planet is full of complex and dynamic cycles and processes. These soaring observers show us new ways to discern the nuances of light created by natural and human-made sources, such as auroras, wildfires, cities, phytoplankton, and volcanoes.' metadata={'source': './data/earth_at_night_508.pdf', 'page': 9}`
 
-This sample is built on LangChain, Azure OpenAI, and Azure AI Search.
+### Custom skill
+
+A [fixed-sized chunking and embedding generation sample](https://github.com/Azure-Samples/azure-search-power-skills/blob/main/Vector/EmbeddingGenerator/README.md) demonstrates both chunking and vector embedding generation using [Azure OpenAI](/azure/ai-services/openai/) embedding models. This sample uses an [Azure AI Search custom skill](cognitive-search-custom-skill-web-api.md) in the [Power Skills repo](https://github.com/Azure-Samples/azure-search-power-skills/tree/main#readme) to wrap the chunking step.
 
 ## See also
 
 + [Understanding embeddings in Azure OpenAI Service](/azure/ai-services/openai/concepts/understand-embeddings)
-+ [Learn how to generate embeddings](/azure/ai-services/openai/how-to/embeddings?tabs=console)
-+ [Tutorial: Explore Azure OpenAI Service embeddings and document search](/azure/ai-services/openai/tutorials/embeddings?tabs=command-line)
++ [Learn how to generate embeddings](/azure/ai-services/openai/how-to/embeddings)
++ [Tutorial: Explore Azure OpenAI Service embeddings and document search](/azure/ai-services/openai/tutorials/embeddings)