Skip to content

Refactor Lexbor empty text detection to ASCII whitespace only, changing it from Lexbor's original lxb_dom_node_is_empty check!#198

Merged
rushter merged 7 commits intorushter:masterfrom
pygarap:fix-skip-empty
Nov 29, 2025
Merged

Refactor Lexbor empty text detection to ASCII whitespace only, changing it from Lexbor's original lxb_dom_node_is_empty check!#198
rushter merged 7 commits intorushter:masterfrom
pygarap:fix-skip-empty

Conversation

@pygarap
Copy link
Contributor

@pygarap pygarap commented Nov 27, 2025

This pull request updates the definition of "empty" text nodes in the Lexbor backend of selectolax, changing it from Lexbor's original lxb_dom_node_is_empty check to a stricter check for nodes containing only ASCII whitespace (space, tab, newline, form feed, or carriage return). It also adds a new test to verify this behavior on a full HTML document and updates documentation throughout the codebase to reflect the new definition.

Important

lxb_dom_node_is_empty is overkill for our needs, and it also causes some bugs with the previous implementation.

Behavioral changes:

  • The _is_empty_text_node method in LexborNode now checks if a text node consists solely of ASCII whitespace, rather than using Lexbor's lxb_dom_node_is_empty. This is implemented via a new helper method _is_whitespace_only, which does an inline scan for whitespace characters. [1] [2]
  • All code and documentation references to skipping or identifying empty text nodes now specify that only ASCII whitespace nodes are considered empty, rather than relying on Lexbor's broader or different definition. [1] [2] [3] [4] [5] [6] [7] [8] [9]

Testing:

  • A new test, test_traverse_with_skip_empty_on_a_full_html_document, is added to verify that traversal with and without skip_empty=True produces the expected results, confirming the new whitespace-only logic.
  • Minor utility code is added to clean up docstrings for test inputs.

These changes make the definition of "empty" text nodes more predictable and easier to reason about, matching common expectations for HTML whitespace handling.

…nction, simplify `_is_whitespace_only`, and replace internal calls for improved clarity and reusability.
@pygarap pygarap requested a review from rushter November 28, 2025 16:20
@rushter rushter merged commit 51efb93 into rushter:master Nov 29, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants