Skip to content

Commit fcb681a

Browse files
ctothclaude
andcommitted
Fix full HTML document parsing with HTML entities
Detect full HTML documents (starting with <!doctype or <html) and parse them with lxml.html.fromstring() instead of fragment_fromstring(). This preserves document structure for XHTML files with HTML entities like &nbsp; that fail strict XML parsing. Fragment parsing with span wrapper is still used for actual fragments to maintain platform consistency fix from ea79811. Fixes EPUB parsing where XHTML documents were being wrapped in span, breaking body element detection. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
1 parent 3d6a8c7 commit fcb681a

File tree

1 file changed

+17
-5
lines changed

1 file changed

+17
-5
lines changed

html_to_text.py

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -523,11 +523,23 @@ def tree_from_string(html: str) -> _Element:
523523
# fragment_fromstring is more forgiving, so check for empty/whitespace first
524524
if not html or not html.strip():
525525
raise lxml.etree.ParserError("Document is empty")
526-
# Use fragment_fromstring with explicit parent container to ensure
527-
# consistent parsing behavior. lxml.html.fromstring() has unpredictable
528-
# auto-correction that wraps fragments differently across platforms.
529-
# Using 'span' as parent since it's inline and won't add extra spacing.
530-
return lxml.html.fragment_fromstring(html, create_parent="span")
526+
527+
# Detect if this is a full HTML document vs a fragment
528+
html_stripped = html.strip()
529+
is_full_document = (
530+
html_stripped.lower().startswith('<!doctype') or
531+
html_stripped.lower().startswith('<html')
532+
)
533+
534+
if is_full_document:
535+
# Full HTML documents should be parsed as documents to preserve structure
536+
return lxml.html.fromstring(html)
537+
else:
538+
# Use fragment_fromstring with explicit parent container to ensure
539+
# consistent parsing behavior. lxml.html.fromstring() has unpredictable
540+
# auto-correction that wraps fragments differently across platforms.
541+
# Using 'span' as parent since it's inline and won't add extra spacing.
542+
return lxml.html.fragment_fromstring(html, create_parent="span")
531543

532544

533545
def main() -> int:

0 commit comments

Comments
 (0)