fix: Improve HTML entity sanitization for invalid XML characters (#348) by Rithboss · Pull Request #1683 · freelawproject/juriscraper

Rithboss · 2025-12-02T01:48:14Z

Fixes #348

Summary

This PR enhances the HTML sanitization logic to properly handle invalid XML character entities, preventing when parsing PACER dockets that contain escape sequences or other invalid XML characters.

Problem

PACER dockets sometimes contain invalid XML characters (like - ESC character) that cause parsing failures with error:

lxml.etree.XMLSyntaxError: PCDATA invalid Char value 27

The existing clean_html() function had limited support for removing invalid character entities, only handling � through �.

Solution

Enhanced the regex pattern in clean_html() to remove ALL invalid XML character entities
Handles decimal entities: � through �, �, , � through �
Handles hexadecimal entities: � through � (excluding valid ones)
Preserves valid XML characters: tab (0x09), LF (0x0A), CR (0x0D)
The existing Unicode character filtering already handles raw invalid characters

Testing

Added comprehensive test suite (tests/local/test_xml_character_sanitization.py) with 6 test cases:

✅ ESC character (\x1b) removal
✅ Various invalid XML characters (NULL, SOH, STX, BS, VT, FF, SO, ESC, US)
✅ Valid XML character preservation (tab, newline, carriage return)
✅ HTML entity handling (�, etc.)
✅ Integration with strip_bad_html_tags_insecure()
✅ Real-world docket text scenarios

All tests pass (6/6).

Changes

Modified: juriscraper/lib/html_utils.py
Modified: CHANGES.md
Added: tests/local/test_xml_character_sanitization.py

Backward Compatibility

This change is fully backward compatible. It only removes invalid characters that would have caused parsing errors anyway.

Fixes freelawproject#348 - Enhanced clean_html() to remove HTML entities for all invalid XML characters - Added comprehensive regex to handle decimal entities ( through &freelawproject#31;) - Added support for hexadecimal entities ( through ) - Excludes valid XML characters: tab (0x09), LF (0x0A), CR (0x0D) - Added comprehensive test suite with 6 test cases covering: - ESC character (\x1b) removal - Various invalid XML characters (NULL, SOH, STX, BS, VT, FF, SO, ESC, US) - Valid XML character preservation (tab, newline, carriage return) - HTML entity handling (&freelawproject#27;, etc.) - Integration with strip_bad_html_tags_insecure() - Real-world docket text scenarios This prevents XMLSyntaxError when parsing PACER dockets that contain invalid XML characters like escape sequences.

CLAassistant · 2025-12-02T01:48:21Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

f1e5ddb

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Improve HTML entity sanitization for invalid XML characters (#348)#1683

fix: Improve HTML entity sanitization for invalid XML characters (#348)#1683
Rithboss wants to merge 2 commits intofreelawproject:mainfrom
Rithboss:feature/issue-348-xml-sanitization

Rithboss commented Dec 2, 2025

Uh oh!

CLAassistant commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Rithboss commented Dec 2, 2025

Summary

Problem

Solution

Testing

Changes

Backward Compatibility

Uh oh!

CLAassistant commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants