Skip to content

Commit a0c3b41

Browse files
committed
markdown issues
1 parent a2ff14a commit a0c3b41

File tree

1 file changed

+1
-2
lines changed

1 file changed

+1
-2
lines changed

docs/textsplitter.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# RAG Chat: Text splitting algorithm overview
22

3-
This document explains the chunking logic implemented in the [data ingestion pipeline](./data_ingestion.md). The [splitter module](../backend/prepdocslib/textsplitter.py) contains both a `SimpleTextSplitter` (used only for JSON files) and a `SentenceTextSplitter` (used for all other formats). This document focuses on the `SentenceTextSplitter` since its approach is far more complicated, and it can be difficult to follow the code.
3+
This document explains the chunking logic implemented in the [data ingestion pipeline](./data_ingestion.md). The [splitter module](../app/backend/prepdocslib/textsplitter.py) contains both a `SimpleTextSplitter` (used only for JSON files) and a `SentenceTextSplitter` (used for all other formats). This document focuses on the `SentenceTextSplitter` since its approach is far more complicated, and it can be difficult to follow the code.
44

55
* [High-level overview](#high-level-overview)
66
* [Splitting algorithm](#splitting-algorithm)
@@ -81,7 +81,6 @@ Steps:
8181
7. Recurse until all pieces are within the token cap.
8282

8383
> Note: The 10% overlap is computed on raw character length (`len(text)`), not tokens, so the duplicated region is 2 × floor(0.10 * character_count) characters. Token counts can differ across the two halves.
84-
8584
> Clarification: Recursion is triggered only when the *span itself* exceeds the token cap. If adding a span to the current accumulator would overflow but the span alone fits, the accumulator is flushed—recursion is not used in that case.
8685
8786
```mermaid

0 commit comments

Comments
 (0)