Skip to content

Commit c838ad5

Browse files
committed
Rephrase fragment shift to be more grokkable
1 parent d52de82 commit c838ad5

File tree

4 files changed

+43
-23
lines changed

4 files changed

+43
-23
lines changed

app/backend/prepdocslib/page.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ class Page:
3434
@dataclass
3535
class Chunk:
3636
"""Semantic chunk emitted by the splitter (may originate wholly within one page
37-
or be the result of a cross-page merge / fragment shift).
37+
or be the result of a cross-page merge / trailing fragment carry-forward).
3838
3939
Attributes:
4040
page_num (int): Logical source page number (0-indexed) for the originating

docs/customization.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,9 +50,11 @@ If you followed the instructions in [the multimodal guide](multimodal.md) to ena
5050
there are several differences in the chat approach:
5151

5252
1. **Query rewriting**: Unchanged.
53-
2. **Search**: For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.
53+
2. **Search**: For this step, it calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.
5454
3. **Answering**: When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the multimodal LLM. The model generates a response that includes citations to the images, and the UI renders the images when a citation is clicked.
5555

56+
The settings can be customized to disable calculating the image vector embeddings or to disable sending image inputs to the LLM, if desired.
57+
5658
#### Ask approach
5759

5860
The ask tab uses the approach programmed in [retrievethenread.py](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/retrievethenread.py).
@@ -70,6 +72,8 @@ there are several differences in the ask approach:
7072
1. **Search**: For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.
7173
2. **Answering**: When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the multimodal LLM. The model generates a response that includes citations to the images, and the UI renders the images when a citation is clicked.
7274

75+
The settings can be customized to disable calculating the image vector embeddings or to disable sending image inputs to the LLM, if desired.
76+
7377
#### Making settings overrides permanent
7478

7579
The UI provides a "Developer Settings" menu for customizing the approaches, like disabling semantic ranker or using vector search.

docs/textsplitter.md

Lines changed: 29 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ The `SentenceTextSplitter` is designed to:
1717
1. Produce semantically coherent chunks that align with sentence boundaries.
1818
2. Respect a maximum token count per chunk (hard limit of 500 tokens) plus a soft character length guideline (default 1,000 characters with a 20% overflow tolerance for merges / normalization). Size limit does not apply to figure blocks (chunks containing a `<figure>` may exceed the token limit; figures are never split).
1919
3. Keep structural figure placeholders (`<figure>...</figure>`) atomic: never split internally and always attach them to preceding accumulated text if any exists.
20-
4. Repair mid‑sentence page breaks when safe via merge or fragment shift heuristics while enforcing token + soft character budgets.
20+
4. Repair mid‑sentence page breaks when possible, while enforcing token + soft character budgets.
2121
5. Avoid empty outputs or unclosed figure tags.
2222
6. Perform a light normalization pass (trim only minimal leading/trailing whitespace that would cause small overflows; do not modify figure chunks).
2323

@@ -28,9 +28,9 @@ The splitter includes these components:
2828
* Recursive subdivision of oversized individual spans using a boundary preference order:
2929
1. Sentence-ending punctuation near the midpoint (scan within the central third of the span).
3030
2. If no sentence boundary is found, a word break (space / punctuation from a configured list) near the midpoint to avoid mid‑word cuts.
31-
3. If neither boundary type is found, a symmetric 10% overlap midpoint split (duplicated region appears at the end of the first part and the start of the second) preserves continuity.
31+
3. If neither boundary type is found, we use a simpler midpoint split with 10% overlap (duplicated region appears at the end of the first part and the start of the second) to preserve continuity.
3232
* Figure handling is front‑loaded: figure blocks are extracted first and treated as atomic before any span splitting or recursion on plain text.
33-
* Cross‑page merge of text when all safety checks pass (prior chunk ends mid‑sentence, next chunk starts lowercase, not a heading, no early figure) and combined size fits both token and soft char budgets; otherwise a fragment shift may move the trailing unfinished clause forward.
33+
* Cross‑page merge of text chunks when combined size fits within the allowed chunk size; otherwise a trailing sentence segment may be shifted forward to the next chunk.
3434
* A lightweight semantic overlap duplication pass (10% of max section length) that appends a trimmed prefix of the next chunk onto the end of the previous chunk (the next chunk itself is left unchanged). This is always attempted for adjacent non‑figure chunks on the same page and conditionally across a page boundary when the next chunk appears to be a direct lowercase continuation (and not a heading or figure). Figures are never overlapped/duplicated.
3535
* A safe concatenation rule inserts a space between merged page fragments only when both adjoining characters are alphanumeric and no existing whitespace or HTML tag boundary (`>`) already separates them.
3636

@@ -100,15 +100,31 @@ flowchart TD
100100

101101
## Cross-page boundary repair
102102

103-
To mitigate artificial breaks introduced by page segmentation, the algorithm attempts a merge between the trailing chunk of the previous page and the first chunk of the current page when ALL of the following hold:
103+
Page boundaries frequently slice a sentence in half, due to the way PDFs and other document formats handle text layout. The repair phase tries to re‑stitch this so downstream retrieval does not see an artificial break.
104104

105-
* Prior chunk does not end with a recognized sentence terminator.
106-
* New chunk begins with a lowercase letter (suggesting continuation), and does not resemble a heading or list.
107-
* Combined size (after tentative merge) fits the token cap (500) and soft char budget (<= 1.2 * 1000 chars after normalization); if not, a fragment shift may relocate the unfinished trailing clause to the next chunk instead.
108-
* The first part of the new chunk is not an immediate `<figure>`.
109-
* Safe concatenation inserts a single space only if the last character of the prior chunk and the first character of the next are both alphanumeric and there is no existing whitespace boundary (nor a closing `>` tag at the join). Otherwise the texts are directly concatenated.
105+
There are two strategies, attempted in order:
110106

111-
If merging would exceed limits, a secondary strategy attempts a "fragment shift": locate the last sentence-ending punctuation in the previous chunk, treat everything after it as a trailing fragment, and prepend as much of that fragment to the next chunk as budgets allow (splitting or trimming further if necessary). Residual fragment pieces are inserted as separate chunks if still non-empty.
107+
1. Full merge (ideal path)
108+
2. Trailing sentence fragment carry‑forward
109+
110+
### 1. Full merge
111+
112+
We first try to simply glue the last chunk of Page N to the first chunk of Page N+1. This is only allowed when ALL of these hold:
113+
114+
* Previous chunk does not already end in sentence‑terminating punctuation.
115+
* First new chunk starts with a lowercase letter (heuristic for continuation), is not detected as a heading / list, and does not begin with a `<figure>`.
116+
* The concatenated text fits BOTH: token limit (500) AND soft length budget (<= 1.2 × 1000 chars after normalization).
117+
118+
If all pass, the two chunks are merged into one, with an injected whitespace between them if necessary.
119+
120+
### 2. Trailing sentence fragment carry‑forward
121+
122+
If a full merge would violate limits, we do a more surgical repair: pull only the dangling sentence fragment from the end of the previous chunk and move it forward so it reunites with its continuation at the start of the next page.
123+
124+
Key differences from semantic overlap:
125+
126+
* Carry‑forward MOVES text (no duplication except any recursive split overlap that may occur later). Semantic overlap DUPLICATES a small preview from the next chunk.
127+
* Carry‑forward only activates across a page boundary when a full merge is too large. Semantic overlap is routine and size‑capped.
112128

113129
## Chunk normalization
114130

@@ -255,7 +271,7 @@ Follow-up sentence.
255271

256272
Mid-sentence boundary satisfied merge conditions; remainder forms a second chunk.
257273

258-
### Example 5: Fragment shift when merge too large
274+
### Example 5: Trailing sentence fragment carry‑forward when merge too large
259275

260276
⬅️ **Page A:**
261277

@@ -266,7 +282,7 @@ Intro sentence finishes here. This clause is long but near the limit and the fol
266282
⬅️ **Page B:**
267283

268284
```text
269-
so a fragment shift moves this trailing portion forward. Remaining context continues here.
285+
so the trailing fragment carry‑forward moves this trailing portion forward. Remaining context continues here.
270286
```
271287

272288
➡️ **Output:**
@@ -276,7 +292,7 @@ Chunk 0:
276292
Intro sentence finishes here.
277293
278294
Chunk 1:
279-
This clause is long but near the limit and the following portion would push it over so a fragment shift moves this trailing portion forward. Remaining context continues here.
295+
This clause is long but near the limit and the following portion would push it over so the trailing fragment carry‑forward moves this trailing portion forward. Remaining context continues here.
280296
```
281297

282298
💬 **Explanation:**

tests/test_prepdocslib_textsplitter.py

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -319,7 +319,7 @@ def test_recursive_split_uses_sentence_boundary():
319319

320320

321321
def test_cross_page_merge_fragment_shift_no_sentence_end():
322-
"""Cross-page merge failing due to size triggers fragment shift when previous chunk has no sentence ending."""
322+
"""Cross-page merge failing due to size triggers trailing fragment carry-forward when previous chunk has no sentence ending."""
323323
splitter = SentenceTextSplitter(max_tokens_per_section=40)
324324
splitter.max_section_length = 120
325325
# Previous page produces one chunk without punctuation so last_end = -1 (fragment_start=0).
@@ -330,13 +330,13 @@ def test_cross_page_merge_fragment_shift_no_sentence_end():
330330
# Ensure we did NOT merge whole (would have uppercase W then lowercase c joined) but did shift some fragment
331331
joined_texts = "||".join(c.text for c in chunks)
332332
assert "word word" in joined_texts # previous content still present somewhere
333-
# Because fragment shift moves everything (no sentence end) previous chunk should become None and its content redistributed
333+
# Because trailing fragment carry-forward moves everything (no sentence end) previous chunk should become None and its content redistributed
334334
# Expect at least one chunk starting with a moved fragment part 'word'
335335
assert any(c.text.startswith("word") for c in chunks)
336336

337337

338338
def test_cross_page_merge_fragment_shift_with_sentence_end_and_shortening():
339-
"""Cross-page merge fragment shift path where a fragment contains an internal sentence boundary allowing shortening."""
339+
"""Cross-page merge trailing fragment carry-forward path where a fragment contains an internal sentence boundary allowing shortening."""
340340
splitter = SentenceTextSplitter(max_tokens_per_section=60)
341341
splitter.max_section_length = 120
342342
# Previous chunk ends mid-sentence but contains an earlier full stop to anchor retained portion
@@ -347,15 +347,15 @@ def test_cross_page_merge_fragment_shift_with_sentence_end_and_shortening():
347347
)
348348
# Force previous page to emit as single chunk
349349
page1 = Page(page_num=0, offset=0, text=prev_text)
350-
# Next page begins lowercase continuation so merge attempted then fragment shift triggered
350+
# Next page begins lowercase continuation so merge attempted then trailing fragment carry-forward triggered
351351
page2 = Page(page_num=1, offset=0, text="continuation that keeps going with additional trailing words.")
352352
chunks = list(splitter.split_pages([page1, page2]))
353353
# We expect retained intro sentence as its own (ends with '.') and a following chunk starting with moved fragment
354354
retained_present = any(c.text.strip().startswith("Intro sentence.") for c in chunks)
355355
moved_fragment_present = any(
356356
c.text.strip().startswith("Second part") or c.text.strip().startswith("Second part that") for c in chunks
357357
)
358-
assert retained_present, "Retained portion after fragment shift missing"
358+
assert retained_present, "Retained portion after trailing fragment carry-forward missing"
359359
assert moved_fragment_present, "Moved (shortened) fragment not found in any chunk"
360360

361361

@@ -491,7 +491,7 @@ def test_recursive_split_overlap_fallback_when_no_word_breaks():
491491

492492

493493
def test_fragment_shift_token_limit_fits_false():
494-
"""Trigger fragment shift where fits() fails solely due to token limit (not char length) and trimming loop runs."""
494+
"""Trigger trailing fragment carry-forward where fits() fails solely due to token limit (not char length) and trimming loop runs."""
495495
# Configure large char allowance so only token constraint matters.
496496
splitter = SentenceTextSplitter(max_tokens_per_section=50)
497497
splitter.max_section_length = 5000 # very high to avoid char-based fits() failure
@@ -565,10 +565,10 @@ def test_normalization_trims_trailing_space_overflow():
565565

566566

567567
def test_cross_page_fragment_shortening_path():
568-
"""Exercise fragment shift after a complete sentence; ensures part of trailing fragment is moved and retained sentence stays."""
568+
"""Exercise trailing fragment carry-forward after a complete sentence; ensures part of trailing fragment is moved and retained sentence stays."""
569569
splitter = SentenceTextSplitter(max_tokens_per_section=55)
570570
splitter.max_section_length = 120
571-
# Previous ends with two sentences, second incomplete to encourage fragment shift
571+
# Previous ends with two sentences, second incomplete to encourage trailing fragment carry-forward
572572
prev = "Complete sentence one. Asecondpartthatshouldbeshortened because it is long"
573573
page1 = Page(page_num=0, offset=0, text=prev)
574574
nxt = "continues here with lowercase start." # triggers merge attempt (lowercase)

0 commit comments

Comments
 (0)