Added `is_fragment` parameter to the `LexborHTMLParser` constructor, allowing users to specify whether to parse input as a full HTML document or as a fragment! by pygarap · Pull Request #188 · rushter/selectolax

pygarap · 2025-11-21T16:00:20Z

This pull request introduces significant enhancements to the LexborHTMLParser API, focusing on improved HTML fragment parsing, expanded type stubs, and better documentation. The changes add support for parsing HTML fragments (not just full documents), clarify and expand the Python interface, and improve the maintainability of the codebase.

HTML Fragment Parsing Support:

Added an is_fragment parameter to the LexborHTMLParser constructor, allowing users to specify whether the input should be parsed as a full HTML document or as an HTML fragment. When parsing as a fragment, required elements like <html>, <head>, and <body> are not automatically inserted. [1] [2]
Introduced new internal methods (_new_html_document, _parse_html_document, and _parse_html_fragment) to handle the creation and parsing of both full documents and fragments. [1] [2]
Updated Cython and C header definitions to expose fragment parsing functions from Lexbor. [1] [2]

Python Type Stubs and Documentation Improvements:

Expanded and clarified the selectolax/lexbor.pyi type stubs for LexborHTMLParser, including detailed docstrings for all public methods and properties, improved parameter types, and return value annotations. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]
Added new properties and methods to the type stubs, such as inner_html (with getter and setter), and improved the documentation and signatures for existing methods like text, tags, strip_tags, merge_text_nodes, and others.
Improved the accuracy of default values and parameter types (e.g., default is now Any instead of bool in several methods). [1] [2] [3]
Updated docstrings to clarify return types, error conditions, and usage examples. [1] [2] [3] [4] [5] [6] [7]

Code Quality and Linting:

Added cython-lint to the Makefile lint target to ensure Cython code style consistency.

These changes collectively make the parser more flexible, easier to use, and better documented for both full HTML documents and fragments.

References:
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17]

This update introduces the `with_top_level_tags` parameter, allowing users to control whether the parser automatically wraps content with top-level HTML tags. Includes relevant method adjustments and tests.

…HTMLParser` Updated method to correctly handle returning fragments and improved error handling for parsing failures. Includes minor comment adjustments for future enhancements.

Refined `lxb_html_document_parse_fragment` logic to better handle NULL fragment scenarios, aligning with Lexbor status API. Added comments for clarity.

Refactored `_parse_html` to correctly handle scenarios with empty top-level tags, ensuring compatibility with Lexbor's fragment parsing logic. Updated test cases to reflect changes.

…p_level_tags` and `_parse_without_top_level_tags` methods for improved readability and modularity. Updated tests to validate changes.

Streamlined class implementations by removing unnecessary lines and improving formatting consistency. Moved `__dealloc__` method to appropriate location and ensured proper cleanup logic. Simplified Python interface by removing redundant spacing and updated test cases to match changes.

…dd a test case Deleted the long-commented `_parse_html` method to declutter the codebase. Added a new test case to ensure proper parsing behavior with `with_top_level_tags=False`.

…atting in annotations Moved the `selector` property for better structure in `lexbor.pyx`. Enhanced `lexbor.pyi` with consistent spacing and improved formatting for better readability.

… hints Enhanced clarity by adding comprehensive docstrings across `lexbor.pyx` and `lexbor.pyi`, improving developer understanding. Refined type hints for better type safety and consistency.

Introduced comprehensive docstrings for `_new_html_document`, `_parse_html`, `_parse_with_top_level_tags`, and `_parse_without_top_level_tags` methods to improve clarity and developer understanding.

… in `lexbor.pyx`

rushter · 2025-11-21T18:16:15Z

I will check this later, but please avoid adding new features/parameters without creating an issue first.
We don't want to introduce a ton of new APIs. That makes it harder to support in the future and confuses users when there are multiple options available.

rushter · 2025-11-21T18:17:38Z

selectolax/lexbor.pxd


    ctypedef enum lxb_dom_node_type_t:
-        LXB_DOM_NODE_TYPE_ELEMENT                = 0x01
-        LXB_DOM_NODE_TYPE_ATTRIBUTE              = 0x02


Such formatting is intentional. Please avoid changing formatting everywhere. It makes it harder to review and keeps unrelated changes in the same commits.

@rushter Thanks for pointing this out.

I have reverted the formatting changes and kept this PR focused only on the relevant code changes. I will avoid broad formatting edits in future commits to make diffs more straightforward to review.

pygarap · 2025-11-21T18:32:23Z

I will check this later, but please avoid adding new features/parameters without creating an issue first. We don't want to introduce a ton of new APIs. That makes it harder to support in the future and confuses users when there are multiple options available.

@rushter Thank you for the feedback!

To clarify, with_top_level_tags is enabled by default, but with the default value, the behavior stays exactly the same as before. The behavior only changes if a user explicitly sets with_top_level_tags=False, so this is an entirely optional switch.
My goal with this parameter is to address:

If you prefer, I can open an issue for this and update the pull request to reference it.

rushter · 2025-11-21T18:48:06Z

I will check this later, but please avoid adding new features/parameters without creating an issue first. We don't want to introduce a ton of new APIs. That makes it harder to support in the future and confuses users when there are multiple options available.

@rushter Thank you for the feedback!

To clarify, with_top_level_tags is enabled by default, but with the default value, the behavior stays exactly the same as before. The behavior only changes if a user explicitly sets with_top_level_tags=False, so this is an entirely optional switch. My goal with this parameter is to address:

refactor: replace selectolax with beautifulsoup django-components/django-components#823 (comment)

Parse partial HTML as-is? #160 (comment)

If you prefer, I can open an issue for this and update the pull request to reference it.

We already have parse_fragment. Can you please explain the difference? Can we just improve that method instead?
https://github.com/rushter/selectolax/blob/9d27a521ac800e02bfc5a2f2c4287fa025cc7257/docs/examples.rst#basic-html-parsing

UPD: parse_fragment actually returns a list, so it won't work. I will review your changes tomorrow.

…or.pyi`, and related files.

pygarap · 2025-11-21T19:29:22Z

We already have parse_fragment. Can you please explain the difference? Can we just improve that method instead? https://github.com/rushter/selectolax/blob/9d27a521ac800e02bfc5a2f2c4287fa025cc7257/docs/examples.rst#basic-html-parsing

UPD: parse_fragment actually returns a list, so it won't work. I will review your changes tomorrow.

@rushter Thank you for pointing to parse_fragment!

I had already looked at selectolax.lexbor.util.parse_fragment. As it is today, it is a pure Python helper that is not part of the public parser API, and it returns a list of nodes, which is slightly different from the parser methods that return a single LexborNode.

The new method selectolax.lexbor.LexborHTMLParser._parse_without_top_level_tags is implemented in Cython and calls lexbor's lxb_html_document_parse_fragment directly. It is completely inlined and all calls run with nogil, so all parsing work happens in C.

There is also a structural difference in how the fragment is processed. selectolax.lexbor.util.parse_fragment first constructs a full document by creating a LexborHTMLParser instance, and during that initialization, the parser calls lxb_html_document_parse. Only after that does it apply the fragment-related logic.

In my approach, lxb_html_document_parse is never called, and no full document is built, so the input is treated as a fragment from start to finish, which should be faster and use less memory when users only need fragments.

… methods.

rushter · 2025-11-22T11:21:09Z

@pygarap Let's rename the argument to is_fragment and flip the boolean. with_top_level_tags is too confusing.

And we need to document it to something like this:


When ``True``, treats input as an HTML fragment. Does not add required HTML5 tags when they are missing.

When set to `False`, it parses HTML fragments as well, but differently. 
Adds missing required tags (such as '<html>', '<head>', '<body>') to the tree according to the HTML5 specification.
Same as in browsers when they render HTML fragments as pages.

…LParser` for improved clarity and consistency. Update associated methods, docstrings, and tests.

…bility and align with style consistency. Adjust tests to use proper `is_fragment` parameter values.

…ehavior

pygarap · 2025-11-22T15:25:19Z

@pygarap Let's rename the argument to is_fragment and flip the boolean. with_top_level_tags is too confusing.

And we need to document it to something like this:


When ``True``, treats input as an HTML fragment. Does not add required HTML5 tags when they are missing.

When set to `False`, it parses HTML fragments as well, but differently. 
Adds missing required tags (such as '<html>', '<head>', '<body>') to the tree according to the HTML5 specification.
Same as in browsers when they render HTML fragments as pages.

@rushter Thank you for the suggestions.
I have renamed the argument to is_fragment, flipped the boolean, and updated the documentation to match your wording. I have pushed the changes.
Please let me know if you would like me to change or clarify anything else.

rushter · 2025-11-22T16:49:10Z

selectolax/lexbor.pyx

    def clone(self):
-        """Clone the current node.
-
-        You can use to do temporary modifications without affecting the original HTML tree.


Please keep those comments.

@rushter Sorry about removing them earlier, and thanks for pointing that out.

I have restored those comments as requested and pushed the updated changes.
Please let me know if you would like me to adjust anything else.

…xbor.pyi`. Enhance docstring of `clone` method for clarity and usage details.

…clarity

pygarap added 13 commits November 20, 2025 17:10

Add skip_empty parameter to text() method in LexborNode

add47af

Add with_top_level_tags parameter to LexborHTMLParser

ed0622b

This update introduces the `with_top_level_tags` parameter, allowing users to control whether the parser automatically wraps content with top-level HTML tags. Includes relevant method adjustments and tests.

Refactor lxb_html_document_parse_fragment implementation in `Lexbor…

9d22d77

…HTMLParser` Updated method to correctly handle returning fragments and improved error handling for parsing failures. Includes minor comment adjustments for future enhancements.

Improve fragment error handling in LexborHTMLParser

d62e03b

Refined `lxb_html_document_parse_fragment` logic to better handle NULL fragment scenarios, aligning with Lexbor status API. Added comments for clarity.

Improve handling of fragment parsing in LexborHTMLParser

fb3dcef

Refactored `_parse_html` to correctly handle scenarios with empty top-level tags, ensuring compatibility with Lexbor's fragment parsing logic. Updated test cases to reflect changes.

Refactor _parse_html by delegating parsing logic to `_parse_with_to…

caf04d2

…p_level_tags` and `_parse_without_top_level_tags` methods for improved readability and modularity. Updated tests to validate changes.

Remove commented-out _parse_html method in LexborHTMLParser and a…

d8abd9d

…dd a test case Deleted the long-commented `_parse_html` method to declutter the codebase. Added a new test case to ensure proper parsing behavior with `with_top_level_tags=False`.

Reorganize selector property in LexborHTMLParser and improve form…

ecd6bdc

…atting in annotations Moved the `selector` property for better structure in `lexbor.pyx`. Enhanced `lexbor.pyi` with consistent spacing and improved formatting for better readability.

Add detailed docstrings to LexborHTMLParser methods and refine type…

f2d9609

… hints Enhanced clarity by adding comprehensive docstrings across `lexbor.pyx` and `lexbor.pyi`, improving developer understanding. Refined type hints for better type safety and consistency.

Clean up lexbor.pyi formatting and integrate cython-lint in Makefile

0c0c784

Add detailed docstrings for HTML parsing methods in lexbor.pyx

79b79a1

Introduced comprehensive docstrings for `_new_html_document`, `_parse_html`, `_parse_with_top_level_tags`, and `_parse_without_top_level_tags` methods to improve clarity and developer understanding.

Add detailed docstrings for __dealloc__ and from_document methods…

5ed6089

… in `lexbor.pyx`

rushter reviewed Nov 21, 2025

View reviewed changes

Improve formatting and spacing consistency across lexbor.pxd, `lexb…

4b03c8d

…or.pyi`, and related files.

pygarap requested a review from rushter November 21, 2025 19:42

pygarap and others added 2 commits November 21, 2025 22:27

Clarify docstring for with_top_level_tags parameter in HTML parsing…

1f110cf

… methods.

Merge branch 'master' into match_top_level_tags

530d53b

pygarap added 3 commits November 22, 2025 17:14

Refactor: Rename with_top_level_tags to is_fragment in `LexborHTM…

d95ac2e

…LParser` for improved clarity and consistency. Update associated methods, docstrings, and tests.

Refine docstring formatting for HTML parsing methods to improve reada…

7fe547b

…bility and align with style consistency. Adjust tests to use proper `is_fragment` parameter values.

Fix html method logic by reordering fragment check for consistent b…

4506650

…ehavior

rushter reviewed Nov 22, 2025

View reviewed changes

Improve formatting consistency in lexbor.pxd, lexbor.pyx, and `le…

4e9ce89

…xbor.pyi`. Enhance docstring of `clone` method for clarity and usage details.

Add skip_empty parameter to text method and update docstring for …

b62036e

…clarity

pygarap requested a review from rushter November 22, 2025 23:28

Minor edits

7c86f32

rushter merged commit 6773f75 into rushter:master Nov 23, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added `is_fragment` parameter to the `LexborHTMLParser` constructor, allowing users to specify whether to parse input as a full HTML document or as a fragment!#188

Added `is_fragment` parameter to the `LexborHTMLParser` constructor, allowing users to specify whether to parse input as a full HTML document or as a fragment!#188
rushter merged 22 commits intorushter:masterfrom
pygarap:match_top_level_tags

pygarap commented Nov 21, 2025 •

edited

Loading

Uh oh!

rushter commented Nov 21, 2025

Uh oh!

rushter Nov 21, 2025

Uh oh!

pygarap Nov 21, 2025

Uh oh!

pygarap commented Nov 21, 2025

Uh oh!

rushter commented Nov 21, 2025 •

edited

Loading

Uh oh!

pygarap commented Nov 21, 2025 •

edited

Loading

Uh oh!

rushter commented Nov 22, 2025 •

edited

Loading

Uh oh!

pygarap commented Nov 22, 2025

Uh oh!

rushter Nov 22, 2025

Uh oh!

pygarap Nov 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pygarap commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rushter commented Nov 21, 2025

Uh oh!

rushter Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

pygarap Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

pygarap commented Nov 21, 2025

Uh oh!

rushter commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pygarap commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rushter commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pygarap commented Nov 22, 2025

Uh oh!

rushter Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

pygarap Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pygarap commented Nov 21, 2025 •

edited

Loading

rushter commented Nov 21, 2025 •

edited

Loading

pygarap commented Nov 21, 2025 •

edited

Loading

rushter commented Nov 22, 2025 •

edited

Loading