Rephrase fragment shift to be more grokkable

pamelafox · pamelafox · commit c838ad5a9550 · 2025-08-26T11:04:38.000-07:00
diff --git a/app/backend/prepdocslib/page.py b/app/backend/prepdocslib/page.py
@@ -34,7 +34,7 @@ class Page:
 @dataclass
 class Chunk:
     """Semantic chunk emitted by the splitter (may originate wholly within one page
-    or be the result of a cross-page merge / fragment shift).
+    or be the result of a cross-page merge / trailing fragment carry-forward).
 
     Attributes:
         page_num (int): Logical source page number (0-indexed) for the originating
diff --git a/docs/customization.md b/docs/customization.md
@@ -50,9 +50,11 @@ If you followed the instructions in [the multimodal guide](multimodal.md) to ena
 there are several differences in the chat approach:
 
 1. **Query rewriting**: Unchanged.
-2. **Search**: For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.
+2. **Search**: For this step, it calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.
 3. **Answering**: When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the multimodal LLM. The model generates a response that includes citations to the images, and the UI renders the images when a citation is clicked.
 
+The settings can be customized to disable calculating the image vector embeddings or to disable sending image inputs to the LLM, if desired.
+
 #### Ask approach
 
 The ask tab uses the approach programmed in [retrievethenread.py](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/retrievethenread.py).
@@ -70,6 +72,8 @@ there are several differences in the ask approach:
 1. **Search**: For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.
 2. **Answering**: When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the multimodal LLM. The model generates a response that includes citations to the images, and the UI renders the images when a citation is clicked.
 
+The settings can be customized to disable calculating the image vector embeddings or to disable sending image inputs to the LLM, if desired.
+
 #### Making settings overrides permanent
 
 The UI provides a "Developer Settings" menu for customizing the approaches, like disabling semantic ranker or using vector search.
diff --git a/docs/textsplitter.md b/docs/textsplitter.md
@@ -17,7 +17,7 @@ The `SentenceTextSplitter` is designed to:
 1. Produce semantically coherent chunks that align with sentence boundaries.
 2. Respect a maximum token count per chunk (hard limit of 500 tokens) plus a soft character length guideline (default 1,000 characters with a 20% overflow tolerance for merges / normalization). Size limit does not apply to figure blocks (chunks containing a `<figure>` may exceed the token limit; figures are never split).
 3. Keep structural figure placeholders (`<figure>...</figure>`) atomic: never split internally and always attach them to preceding accumulated text if any exists.
-4. Repair mid‑sentence page breaks when safe via merge or fragment shift heuristics while enforcing token + soft character budgets.
+4. Repair mid‑sentence page breaks when possible, while enforcing token + soft character budgets.
 5. Avoid empty outputs or unclosed figure tags.
 6. Perform a light normalization pass (trim only minimal leading/trailing whitespace that would cause small overflows; do not modify figure chunks).
 
@@ -28,9 +28,9 @@ The splitter includes these components:
 * Recursive subdivision of oversized individual spans using a boundary preference order:
     1. Sentence-ending punctuation near the midpoint (scan within the central third of the span).
     2. If no sentence boundary is found, a word break (space / punctuation from a configured list) near the midpoint to avoid mid‑word cuts.
-    3. If neither boundary type is found, a symmetric 10% overlap midpoint split (duplicated region appears at the end of the first part and the start of the second) preserves continuity.
+    3. If neither boundary type is found, we use a simpler midpoint split with 10% overlap (duplicated region appears at the end of the first part and the start of the second) to preserve continuity.
 * Figure handling is front‑loaded: figure blocks are extracted first and treated as atomic before any span splitting or recursion on plain text.
-* Cross‑page merge of text when all safety checks pass (prior chunk ends mid‑sentence, next chunk starts lowercase, not a heading, no early figure) and combined size fits both token and soft char budgets; otherwise a fragment shift may move the trailing unfinished clause forward.
+* Cross‑page merge of text chunks when combined size fits within the allowed chunk size; otherwise a trailing sentence segment may be shifted forward to the next chunk.
 * A lightweight semantic overlap duplication pass (10% of max section length) that appends a trimmed prefix of the next chunk onto the end of the previous chunk (the next chunk itself is left unchanged). This is always attempted for adjacent non‑figure chunks on the same page and conditionally across a page boundary when the next chunk appears to be a direct lowercase continuation (and not a heading or figure). Figures are never overlapped/duplicated.
 * A safe concatenation rule inserts a space between merged page fragments only when both adjoining characters are alphanumeric and no existing whitespace or HTML tag boundary (`>`) already separates them.
 
@@ -100,15 +100,31 @@ flowchart TD
 
 ## Cross-page boundary repair
 
-To mitigate artificial breaks introduced by page segmentation, the algorithm attempts a merge between the trailing chunk of the previous page and the first chunk of the current page when ALL of the following hold:
+Page boundaries frequently slice a sentence in half, due to the way PDFs and other document formats handle text layout. The repair phase tries to re‑stitch this so downstream retrieval does not see an artificial break.
 
-* Prior chunk does not end with a recognized sentence terminator.
-* New chunk begins with a lowercase letter (suggesting continuation), and does not resemble a heading or list.
-* Combined size (after tentative merge) fits the token cap (500) and soft char budget (<= 1.2 * 1000 chars after normalization); if not, a fragment shift may relocate the unfinished trailing clause to the next chunk instead.
-* The first part of the new chunk is not an immediate `<figure>`.
-* Safe concatenation inserts a single space only if the last character of the prior chunk and the first character of the next are both alphanumeric and there is no existing whitespace boundary (nor a closing `>` tag at the join). Otherwise the texts are directly concatenated.
+There are two strategies, attempted in order:
 
-If merging would exceed limits, a secondary strategy attempts a "fragment shift": locate the last sentence-ending punctuation in the previous chunk, treat everything after it as a trailing fragment, and prepend as much of that fragment to the next chunk as budgets allow (splitting or trimming further if necessary). Residual fragment pieces are inserted as separate chunks if still non-empty.
+1. Full merge (ideal path)
+2. Trailing sentence fragment carry‑forward
+
+### 1. Full merge
+
+We first try to simply glue the last chunk of Page N to the first chunk of Page N+1. This is only allowed when ALL of these hold:
+
+* Previous chunk does not already end in sentence‑terminating punctuation.
+* First new chunk starts with a lowercase letter (heuristic for continuation), is not detected as a heading / list, and does not begin with a `<figure>`.
+* The concatenated text fits BOTH: token limit (500) AND soft length budget (<= 1.2 × 1000 chars after normalization).
+
+If all pass, the two chunks are merged into one, with an injected whitespace between them if necessary.
+
+### 2. Trailing sentence fragment carry‑forward
+
+If a full merge would violate limits, we do a more surgical repair: pull only the dangling sentence fragment from the end of the previous chunk and move it forward so it reunites with its continuation at the start of the next page.
+
+Key differences from semantic overlap:
+
+* Carry‑forward MOVES text (no duplication except any recursive split overlap that may occur later). Semantic overlap DUPLICATES a small preview from the next chunk.
+* Carry‑forward only activates across a page boundary when a full merge is too large. Semantic overlap is routine and size‑capped.
 
 ## Chunk normalization
 
@@ -255,7 +271,7 @@ Follow-up sentence.
 
 Mid-sentence boundary satisfied merge conditions; remainder forms a second chunk.
 
-### Example 5: Fragment shift when merge too large
+### Example 5: Trailing sentence fragment carry‑forward when merge too large
 
 ⬅️ **Page A:**
 
@@ -266,7 +282,7 @@ Intro sentence finishes here. This clause is long but near the limit and the fol
 ⬅️ **Page B:**
 
 ```text
-so a fragment shift moves this trailing portion forward. Remaining context continues here.
+so the trailing fragment carry‑forward moves this trailing portion forward. Remaining context continues here.
 ```
 
 ➡️ **Output:**
@@ -276,7 +292,7 @@ Chunk 0:
 Intro sentence finishes here.
 
 Chunk 1:
-This clause is long but near the limit and the following portion would push it over so a fragment shift moves this trailing portion forward. Remaining context continues here.
+This clause is long but near the limit and the following portion would push it over so the trailing fragment carry‑forward moves this trailing portion forward. Remaining context continues here.
 ```
 
 💬 **Explanation:**
diff --git a/tests/test_prepdocslib_textsplitter.py b/tests/test_prepdocslib_textsplitter.py
@@ -319,7 +319,7 @@ def test_recursive_split_uses_sentence_boundary():
 
 
 def test_cross_page_merge_fragment_shift_no_sentence_end():
-    """Cross-page merge failing due to size triggers fragment shift when previous chunk has no sentence ending."""
+    """Cross-page merge failing due to size triggers trailing fragment carry-forward when previous chunk has no sentence ending."""
     splitter = SentenceTextSplitter(max_tokens_per_section=40)
     splitter.max_section_length = 120
     # Previous page produces one chunk without punctuation so last_end = -1 (fragment_start=0).
@@ -330,13 +330,13 @@ def test_cross_page_merge_fragment_shift_no_sentence_end():
     # Ensure we did NOT merge whole (would have uppercase W then lowercase c joined) but did shift some fragment
     joined_texts = "||".join(c.text for c in chunks)
     assert "word word" in joined_texts  # previous content still present somewhere
-    # Because fragment shift moves everything (no sentence end) previous chunk should become None and its content redistributed
+    # Because trailing fragment carry-forward moves everything (no sentence end) previous chunk should become None and its content redistributed
     # Expect at least one chunk starting with a moved fragment part 'word'
     assert any(c.text.startswith("word") for c in chunks)
 
 
 def test_cross_page_merge_fragment_shift_with_sentence_end_and_shortening():
-    """Cross-page merge fragment shift path where a fragment contains an internal sentence boundary allowing shortening."""
+    """Cross-page merge trailing fragment carry-forward path where a fragment contains an internal sentence boundary allowing shortening."""
     splitter = SentenceTextSplitter(max_tokens_per_section=60)
     splitter.max_section_length = 120
     # Previous chunk ends mid-sentence but contains an earlier full stop to anchor retained portion
@@ -347,15 +347,15 @@ def test_cross_page_merge_fragment_shift_with_sentence_end_and_shortening():
     )
     # Force previous page to emit as single chunk
     page1 = Page(page_num=0, offset=0, text=prev_text)
-    # Next page begins lowercase continuation so merge attempted then fragment shift triggered
+    # Next page begins lowercase continuation so merge attempted then trailing fragment carry-forward triggered
     page2 = Page(page_num=1, offset=0, text="continuation that keeps going with additional trailing words.")
     chunks = list(splitter.split_pages([page1, page2]))
     # We expect retained intro sentence as its own (ends with '.') and a following chunk starting with moved fragment
     retained_present = any(c.text.strip().startswith("Intro sentence.") for c in chunks)
     moved_fragment_present = any(
         c.text.strip().startswith("Second part") or c.text.strip().startswith("Second part that") for c in chunks
     )
-    assert retained_present, "Retained portion after fragment shift missing"
+    assert retained_present, "Retained portion after trailing fragment carry-forward missing"
     assert moved_fragment_present, "Moved (shortened) fragment not found in any chunk"
 
 
@@ -491,7 +491,7 @@ def test_recursive_split_overlap_fallback_when_no_word_breaks():
 
 
 def test_fragment_shift_token_limit_fits_false():
-    """Trigger fragment shift where fits() fails solely due to token limit (not char length) and trimming loop runs."""
+    """Trigger trailing fragment carry-forward where fits() fails solely due to token limit (not char length) and trimming loop runs."""
     # Configure large char allowance so only token constraint matters.
     splitter = SentenceTextSplitter(max_tokens_per_section=50)
     splitter.max_section_length = 5000  # very high to avoid char-based fits() failure
@@ -565,10 +565,10 @@ def test_normalization_trims_trailing_space_overflow():
 
 
 def test_cross_page_fragment_shortening_path():
-    """Exercise fragment shift after a complete sentence; ensures part of trailing fragment is moved and retained sentence stays."""
+    """Exercise trailing fragment carry-forward after a complete sentence; ensures part of trailing fragment is moved and retained sentence stays."""
     splitter = SentenceTextSplitter(max_tokens_per_section=55)
     splitter.max_section_length = 120
-    # Previous ends with two sentences, second incomplete to encourage fragment shift
+    # Previous ends with two sentences, second incomplete to encourage trailing fragment carry-forward
     prev = "Complete sentence one. Asecondpartthatshouldbeshortened because it is long"
     page1 = Page(page_num=0, offset=0, text=prev)
     nxt = "continues here with lowercase start."  # triggers merge attempt (lowercase)