You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/customization.md
+5-1Lines changed: 5 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -50,9 +50,11 @@ If you followed the instructions in [the multimodal guide](multimodal.md) to ena
50
50
there are several differences in the chat approach:
51
51
52
52
1.**Query rewriting**: Unchanged.
53
-
2.**Search**: For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.
53
+
2.**Search**: For this step, it calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.
54
54
3.**Answering**: When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the multimodal LLM. The model generates a response that includes citations to the images, and the UI renders the images when a citation is clicked.
55
55
56
+
The settings can be customized to disable calculating the image vector embeddings or to disable sending image inputs to the LLM, if desired.
57
+
56
58
#### Ask approach
57
59
58
60
The ask tab uses the approach programmed in [retrievethenread.py](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/retrievethenread.py).
@@ -70,6 +72,8 @@ there are several differences in the ask approach:
70
72
1.**Search**: For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.
71
73
2.**Answering**: When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the multimodal LLM. The model generates a response that includes citations to the images, and the UI renders the images when a citation is clicked.
72
74
75
+
The settings can be customized to disable calculating the image vector embeddings or to disable sending image inputs to the LLM, if desired.
76
+
73
77
#### Making settings overrides permanent
74
78
75
79
The UI provides a "Developer Settings" menu for customizing the approaches, like disabling semantic ranker or using vector search.
Copy file name to clipboardExpand all lines: docs/textsplitter.md
+29-13Lines changed: 29 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ The `SentenceTextSplitter` is designed to:
17
17
1. Produce semantically coherent chunks that align with sentence boundaries.
18
18
2. Respect a maximum token count per chunk (hard limit of 500 tokens) plus a soft character length guideline (default 1,000 characters with a 20% overflow tolerance for merges / normalization). Size limit does not apply to figure blocks (chunks containing a `<figure>` may exceed the token limit; figures are never split).
19
19
3. Keep structural figure placeholders (`<figure>...</figure>`) atomic: never split internally and always attach them to preceding accumulated text if any exists.
20
-
4. Repair mid‑sentence page breaks when safe via merge or fragment shift heuristics while enforcing token + soft character budgets.
20
+
4. Repair mid‑sentence page breaks when possible, while enforcing token + soft character budgets.
21
21
5. Avoid empty outputs or unclosed figure tags.
22
22
6. Perform a light normalization pass (trim only minimal leading/trailing whitespace that would cause small overflows; do not modify figure chunks).
23
23
@@ -28,9 +28,9 @@ The splitter includes these components:
28
28
* Recursive subdivision of oversized individual spans using a boundary preference order:
29
29
1. Sentence-ending punctuation near the midpoint (scan within the central third of the span).
30
30
2. If no sentence boundary is found, a word break (space / punctuation from a configured list) near the midpoint to avoid mid‑word cuts.
31
-
3. If neither boundary type is found, a symmetric 10% overlap midpoint split (duplicated region appears at the end of the first part and the start of the second) preserves continuity.
31
+
3. If neither boundary type is found, we use a simpler midpoint split with 10% overlap (duplicated region appears at the end of the first part and the start of the second) to preserve continuity.
32
32
* Figure handling is front‑loaded: figure blocks are extracted first and treated as atomic before any span splitting or recursion on plain text.
33
-
* Cross‑page merge of text when all safety checks pass (prior chunk ends mid‑sentence, next chunk starts lowercase, not a heading, no early figure) and combined size fits both token and soft char budgets; otherwise a fragment shift may move the trailing unfinished clause forward.
33
+
* Cross‑page merge of text chunks when combined size fits within the allowed chunk size; otherwise a trailing sentence segment may be shifted forward to the next chunk.
34
34
* A lightweight semantic overlap duplication pass (10% of max section length) that appends a trimmed prefix of the next chunk onto the end of the previous chunk (the next chunk itself is left unchanged). This is always attempted for adjacent non‑figure chunks on the same page and conditionally across a page boundary when the next chunk appears to be a direct lowercase continuation (and not a heading or figure). Figures are never overlapped/duplicated.
35
35
* A safe concatenation rule inserts a space between merged page fragments only when both adjoining characters are alphanumeric and no existing whitespace or HTML tag boundary (`>`) already separates them.
36
36
@@ -100,15 +100,31 @@ flowchart TD
100
100
101
101
## Cross-page boundary repair
102
102
103
-
To mitigate artificial breaks introduced by page segmentation, the algorithm attempts a merge between the trailing chunk of the previous page and the first chunk of the current page when ALL of the following hold:
103
+
Page boundaries frequently slice a sentence in half, due to the way PDFs and other document formats handle text layout. The repair phase tries to re‑stitch this so downstream retrieval does not see an artificial break.
104
104
105
-
* Prior chunk does not end with a recognized sentence terminator.
106
-
* New chunk begins with a lowercase letter (suggesting continuation), and does not resemble a heading or list.
107
-
* Combined size (after tentative merge) fits the token cap (500) and soft char budget (<= 1.2 * 1000 chars after normalization); if not, a fragment shift may relocate the unfinished trailing clause to the next chunk instead.
108
-
* The first part of the new chunk is not an immediate `<figure>`.
109
-
* Safe concatenation inserts a single space only if the last character of the prior chunk and the first character of the next are both alphanumeric and there is no existing whitespace boundary (nor a closing `>` tag at the join). Otherwise the texts are directly concatenated.
105
+
There are two strategies, attempted in order:
110
106
111
-
If merging would exceed limits, a secondary strategy attempts a "fragment shift": locate the last sentence-ending punctuation in the previous chunk, treat everything after it as a trailing fragment, and prepend as much of that fragment to the next chunk as budgets allow (splitting or trimming further if necessary). Residual fragment pieces are inserted as separate chunks if still non-empty.
107
+
1. Full merge (ideal path)
108
+
2. Trailing sentence fragment carry‑forward
109
+
110
+
### 1. Full merge
111
+
112
+
We first try to simply glue the last chunk of Page N to the first chunk of Page N+1. This is only allowed when ALL of these hold:
113
+
114
+
* Previous chunk does not already end in sentence‑terminating punctuation.
115
+
* First new chunk starts with a lowercase letter (heuristic for continuation), is not detected as a heading / list, and does not begin with a `<figure>`.
116
+
* The concatenated text fits BOTH: token limit (500) AND soft length budget (<= 1.2 × 1000 chars after normalization).
117
+
118
+
If all pass, the two chunks are merged into one, with an injected whitespace between them if necessary.
119
+
120
+
### 2. Trailing sentence fragment carry‑forward
121
+
122
+
If a full merge would violate limits, we do a more surgical repair: pull only the dangling sentence fragment from the end of the previous chunk and move it forward so it reunites with its continuation at the start of the next page.
123
+
124
+
Key differences from semantic overlap:
125
+
126
+
* Carry‑forward MOVES text (no duplication except any recursive split overlap that may occur later). Semantic overlap DUPLICATES a small preview from the next chunk.
127
+
* Carry‑forward only activates across a page boundary when a full merge is too large. Semantic overlap is routine and size‑capped.
112
128
113
129
## Chunk normalization
114
130
@@ -255,7 +271,7 @@ Follow-up sentence.
255
271
256
272
Mid-sentence boundary satisfied merge conditions; remainder forms a second chunk.
257
273
258
-
### Example 5: Fragment shift when merge too large
274
+
### Example 5: Trailing sentence fragment carry‑forward when merge too large
259
275
260
276
⬅️ **Page A:**
261
277
@@ -266,7 +282,7 @@ Intro sentence finishes here. This clause is long but near the limit and the fol
266
282
⬅️ **Page B:**
267
283
268
284
```text
269
-
so a fragment shift moves this trailing portion forward. Remaining context continues here.
285
+
so the trailing fragment carry‑forward moves this trailing portion forward. Remaining context continues here.
270
286
```
271
287
272
288
➡️ **Output:**
@@ -276,7 +292,7 @@ Chunk 0:
276
292
Intro sentence finishes here.
277
293
278
294
Chunk 1:
279
-
This clause is long but near the limit and the following portion would push it over so a fragment shift moves this trailing portion forward. Remaining context continues here.
295
+
This clause is long but near the limit and the following portion would push it over so the trailing fragment carry‑forward moves this trailing portion forward. Remaining context continues here.
0 commit comments