Skip to content

Commit fef66a4

Browse files
ctothclaude
andcommitted
Detect XHTML documents with XML declarations
Include <?xml declarations when detecting full HTML documents. XHTML files from EPUBs typically start with XML declarations before the DOCTYPE, and should be parsed as documents not fragments. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 79887a6 commit fef66a4

File tree

1 file changed

+2
-0
lines changed

1 file changed

+2
-0
lines changed

html_to_text.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -529,12 +529,14 @@ def tree_from_string(html: Union[str, bytes]) -> _Element:
529529
if isinstance(html, bytes):
530530
html_stripped = html.strip()
531531
is_full_document = (
532+
html_stripped.lower().startswith(b'<?xml') or
532533
html_stripped.lower().startswith(b'<!doctype') or
533534
html_stripped.lower().startswith(b'<html')
534535
)
535536
else:
536537
html_stripped = html.strip()
537538
is_full_document = (
539+
html_stripped.lower().startswith('<?xml') or
538540
html_stripped.lower().startswith('<!doctype') or
539541
html_stripped.lower().startswith('<html')
540542
)

0 commit comments

Comments
 (0)