Refactor Lexbor empty text detection to ASCII whitespace only, changing it from Lexbor's original `lxb_dom_node_is_empty` check! by pygarap · Pull Request #198 · rushter/selectolax

pygarap · 2025-11-27T22:59:25Z

This pull request updates the definition of "empty" text nodes in the Lexbor backend of selectolax, changing it from Lexbor's original lxb_dom_node_is_empty check to a stricter check for nodes containing only ASCII whitespace (space, tab, newline, form feed, or carriage return). It also adds a new test to verify this behavior on a full HTML document and updates documentation throughout the codebase to reflect the new definition.

Important

lxb_dom_node_is_empty is overkill for our needs, and it also causes some bugs with the previous implementation.

Behavioral changes:

The _is_empty_text_node method in LexborNode now checks if a text node consists solely of ASCII whitespace, rather than using Lexbor's lxb_dom_node_is_empty. This is implemented via a new helper method _is_whitespace_only, which does an inline scan for whitespace characters. [1] [2]
All code and documentation references to skipping or identifying empty text nodes now specify that only ASCII whitespace nodes are considered empty, rather than relying on Lexbor's broader or different definition. [1] [2] [3] [4] [5] [6] [7] [8] [9]

Testing:

A new test, test_traverse_with_skip_empty_on_a_full_html_document, is added to verify that traversal with and without skip_empty=True produces the expected results, confirming the new whitespace-only logic.
Minor utility code is added to clean up docstrings for test inputs.

These changes make the definition of "empty" text nodes more predictable and easier to reason about, matching common expectations for HTML whitespace handling.

…only.

…er method.

…space` for improved readability and code reuse.

…rmance and marking it as `nogil`.

…ly`, add `clean_doc` utility in tests, and enhance traversal test coverage.

…pace_only`, add docstrings, and enhance test formatting.

selectolax/lexbor/node.pxi

…nction, simplify `_is_whitespace_only`, and replace internal calls for improved clarity and reusability.

pygarap added 6 commits November 27, 2025 06:42

Refine skip_empty behavior to consider nodes with ASCII whitespace …

7a07d12

…only.

Refactor character data handling and add _buffer_is_whitespace help…

f726033

…er method.

Refactor _is_empty_text_node logic by integrating `_buffer_is_white…

fcbe916

…space` for improved readability and code reuse.

Optimize _buffer_is_whitespace by using pointers for improved perfo…

c4209be

…rmance and marking it as `nogil`.

Refactor _is_empty_text_node to directly utilize `_is_whitespace_on…

49b014e

…ly`, add `clean_doc` utility in tests, and enhance traversal test coverage.

Refactor whitespace handling in _is_empty_text_node and `_is_whites…

5b4aac9

…pace_only`, add docstrings, and enhance test formatting.

rushter reviewed Nov 28, 2025

View reviewed changes

selectolax/lexbor/node.pxi Outdated Show resolved Hide resolved

Refactor _is_empty_text_node by extracting logic to a standalone fu…

4b27e02

…nction, simplify `_is_whitespace_only`, and replace internal calls for improved clarity and reusability.

pygarap requested a review from rushter November 28, 2025 16:20

rushter merged commit 51efb93 into rushter:master Nov 29, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Lexbor empty text detection to ASCII whitespace only, changing it from Lexbor's original `lxb_dom_node_is_empty` check!#198

Refactor Lexbor empty text detection to ASCII whitespace only, changing it from Lexbor's original `lxb_dom_node_is_empty` check!#198
rushter merged 7 commits intorushter:masterfrom
pygarap:fix-skip-empty

pygarap commented Nov 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pygarap commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pygarap commented Nov 27, 2025 •

edited

Loading