Skip to content

Commit 622bb05

Browse files
authored
fix(langchain): class HTMLSemanticPreservingSplitter ignores the text inside the div tag (#32213)
**Description:** We collect the text from the "html", "body", "div", and "main" nodes, if they have any. **Issue:** Fixes #32206.
1 parent 56dde3a commit 622bb05

File tree

1 file changed

+4
-0
lines changed
  • libs/text-splitters/langchain_text_splitters

1 file changed

+4
-0
lines changed

libs/text-splitters/langchain_text_splitters/html.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -842,6 +842,10 @@ def _process_element(
842842
preserved_elements,
843843
placeholder_count,
844844
)
845+
content = " ".join(elem.find_all(string=True, recursive=False))
846+
if content:
847+
content = self._normalize_and_clean_text(content)
848+
current_content.append(content)
845849
continue
846850

847851
if elem.name in [h[0] for h in self._headers_to_split_on]:

0 commit comments

Comments
 (0)