Skip to content

fix: Improve HTML entity sanitization for invalid XML characters (#348)#1683

Open
Rithboss wants to merge 2 commits intofreelawproject:mainfrom
Rithboss:feature/issue-348-xml-sanitization
Open

fix: Improve HTML entity sanitization for invalid XML characters (#348)#1683
Rithboss wants to merge 2 commits intofreelawproject:mainfrom
Rithboss:feature/issue-348-xml-sanitization

Conversation

@Rithboss
Copy link
Copy Markdown

@Rithboss Rithboss commented Dec 2, 2025

Fixes #348

Summary

This PR enhances the HTML sanitization logic to properly handle invalid XML character entities, preventing when parsing PACER dockets that contain escape sequences or other invalid XML characters.

Problem

PACER dockets sometimes contain invalid XML characters (like - ESC character) that cause parsing failures with error:

lxml.etree.XMLSyntaxError: PCDATA invalid Char value 27

The existing clean_html() function had limited support for removing invalid character entities, only handling � through �.

Solution

  • Enhanced the regex pattern in clean_html() to remove ALL invalid XML character entities
  • Handles decimal entities: � through �, �, , � through �
  • Handles hexadecimal entities: � through � (excluding valid ones)
  • Preserves valid XML characters: tab (0x09), LF (0x0A), CR (0x0D)
  • The existing Unicode character filtering already handles raw invalid characters

Testing

Added comprehensive test suite (tests/local/test_xml_character_sanitization.py) with 6 test cases:

  • ✅ ESC character (\x1b) removal
  • ✅ Various invalid XML characters (NULL, SOH, STX, BS, VT, FF, SO, ESC, US)
  • ✅ Valid XML character preservation (tab, newline, carriage return)
  • ✅ HTML entity handling (�, etc.)
  • ✅ Integration with strip_bad_html_tags_insecure()
  • ✅ Real-world docket text scenarios

All tests pass (6/6).

Changes

  • Modified: juriscraper/lib/html_utils.py
  • Modified: CHANGES.md
  • Added: tests/local/test_xml_character_sanitization.py

Backward Compatibility

This change is fully backward compatible. It only removes invalid characters that would have caused parsing errors anyway.

Fixes freelawproject#348

- Enhanced clean_html() to remove HTML entities for all invalid XML characters
- Added comprehensive regex to handle decimal entities (� through &freelawproject#31;)
- Added support for hexadecimal entities (� through )
- Excludes valid XML characters: tab (0x09), LF (0x0A), CR (0x0D)
- Added comprehensive test suite with 6 test cases covering:
  - ESC character (\x1b) removal
  - Various invalid XML characters (NULL, SOH, STX, BS, VT, FF, SO, ESC, US)
  - Valid XML character preservation (tab, newline, carriage return)
  - HTML entity handling (&freelawproject#27;, etc.)
  - Integration with strip_bad_html_tags_insecure()
  - Real-world docket text scenarios

This prevents XMLSyntaxError when parsing PACER dockets that contain
invalid XML characters like escape sequences.
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Invalid XML character break docket parsers

2 participants