Skip to content

Added is_fragment parameter to the LexborHTMLParser constructor, allowing users to specify whether to parse input as a full HTML document or as a fragment!#188

Merged
rushter merged 22 commits intorushter:masterfrom
pygarap:match_top_level_tags
Nov 23, 2025
Merged

Conversation

@pygarap
Copy link
Contributor

@pygarap pygarap commented Nov 21, 2025

This pull request introduces significant enhancements to the LexborHTMLParser API, focusing on improved HTML fragment parsing, expanded type stubs, and better documentation. The changes add support for parsing HTML fragments (not just full documents), clarify and expand the Python interface, and improve the maintainability of the codebase.

HTML Fragment Parsing Support:

  • Added an is_fragment parameter to the LexborHTMLParser constructor, allowing users to specify whether the input should be parsed as a full HTML document or as an HTML fragment. When parsing as a fragment, required elements like <html>, <head>, and <body> are not automatically inserted. [1] [2]
  • Introduced new internal methods (_new_html_document, _parse_html_document, and _parse_html_fragment) to handle the creation and parsing of both full documents and fragments. [1] [2]
  • Updated Cython and C header definitions to expose fragment parsing functions from Lexbor. [1] [2]

Python Type Stubs and Documentation Improvements:

  • Expanded and clarified the selectolax/lexbor.pyi type stubs for LexborHTMLParser, including detailed docstrings for all public methods and properties, improved parameter types, and return value annotations. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]
  • Added new properties and methods to the type stubs, such as inner_html (with getter and setter), and improved the documentation and signatures for existing methods like text, tags, strip_tags, merge_text_nodes, and others.
  • Improved the accuracy of default values and parameter types (e.g., default is now Any instead of bool in several methods). [1] [2] [3]
  • Updated docstrings to clarify return types, error conditions, and usage examples. [1] [2] [3] [4] [5] [6] [7]

Code Quality and Linting:

  • Added cython-lint to the Makefile lint target to ensure Cython code style consistency.

These changes collectively make the parser more flexible, easier to use, and better documented for both full HTML documents and fragments.

References:
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17]

This update introduces the `with_top_level_tags` parameter, allowing users to control whether the parser automatically wraps content with top-level HTML tags. Includes relevant method adjustments and tests.
…HTMLParser`

Updated method to correctly handle returning fragments and improved error handling for parsing failures. Includes minor comment adjustments for future enhancements.
Refined `lxb_html_document_parse_fragment` logic to better handle NULL fragment scenarios, aligning with Lexbor status API. Added comments for clarity.
Refactored `_parse_html` to correctly handle scenarios with empty top-level tags, ensuring compatibility with Lexbor's fragment parsing logic. Updated test cases to reflect changes.
…p_level_tags` and `_parse_without_top_level_tags` methods for improved readability and modularity. Updated tests to validate changes.
Streamlined class implementations by removing unnecessary lines and improving formatting consistency. Moved `__dealloc__` method to appropriate location and ensured proper cleanup logic. Simplified Python interface by removing redundant spacing and updated test cases to match changes.
…dd a test case

Deleted the long-commented `_parse_html` method to declutter the codebase. Added a new test case to ensure proper parsing behavior with `with_top_level_tags=False`.
…atting in annotations

Moved the `selector` property for better structure in `lexbor.pyx`. Enhanced `lexbor.pyi` with consistent spacing and improved formatting for better readability.
… hints

Enhanced clarity by adding comprehensive docstrings across `lexbor.pyx` and `lexbor.pyi`, improving developer understanding. Refined type hints for better type safety and consistency.
Introduced comprehensive docstrings for `_new_html_document`, `_parse_html`, `_parse_with_top_level_tags`, and `_parse_without_top_level_tags` methods to improve clarity and developer understanding.
@rushter
Copy link
Owner

rushter commented Nov 21, 2025

I will check this later, but please avoid adding new features/parameters without creating an issue first.
We don't want to introduce a ton of new APIs. That makes it harder to support in the future and confuses users when there are multiple options available.


ctypedef enum lxb_dom_node_type_t:
LXB_DOM_NODE_TYPE_ELEMENT = 0x01
LXB_DOM_NODE_TYPE_ATTRIBUTE = 0x02
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such formatting is intentional. Please avoid changing formatting everywhere. It makes it harder to review and keeps unrelated changes in the same commits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rushter Thanks for pointing this out.

I have reverted the formatting changes and kept this PR focused only on the relevant code changes. I will avoid broad formatting edits in future commits to make diffs more straightforward to review.

@pygarap
Copy link
Contributor Author

pygarap commented Nov 21, 2025

I will check this later, but please avoid adding new features/parameters without creating an issue first. We don't want to introduce a ton of new APIs. That makes it harder to support in the future and confuses users when there are multiple options available.

@rushter Thank you for the feedback!

To clarify, with_top_level_tags is enabled by default, but with the default value, the behavior stays exactly the same as before. The behavior only changes if a user explicitly sets with_top_level_tags=False, so this is an entirely optional switch.
My goal with this parameter is to address:

  1. refactor: replace selectolax with beautifulsoup django-components/django-components#823 (comment)
  2. Parse partial HTML as-is? #160 (comment)

If you prefer, I can open an issue for this and update the pull request to reference it.

@rushter
Copy link
Owner

rushter commented Nov 21, 2025

I will check this later, but please avoid adding new features/parameters without creating an issue first. We don't want to introduce a ton of new APIs. That makes it harder to support in the future and confuses users when there are multiple options available.

@rushter Thank you for the feedback!

To clarify, with_top_level_tags is enabled by default, but with the default value, the behavior stays exactly the same as before. The behavior only changes if a user explicitly sets with_top_level_tags=False, so this is an entirely optional switch. My goal with this parameter is to address:

  1. refactor: replace selectolax with beautifulsoup django-components/django-components#823 (comment)
  2. Parse partial HTML as-is? #160 (comment)

If you prefer, I can open an issue for this and update the pull request to reference it.

We already have parse_fragment. Can you please explain the difference? Can we just improve that method instead?
https://github.com/rushter/selectolax/blob/9d27a521ac800e02bfc5a2f2c4287fa025cc7257/docs/examples.rst#basic-html-parsing

UPD: parse_fragment actually returns a list, so it won't work. I will review your changes tomorrow.

@pygarap
Copy link
Contributor Author

pygarap commented Nov 21, 2025

We already have parse_fragment. Can you please explain the difference? Can we just improve that method instead? https://github.com/rushter/selectolax/blob/9d27a521ac800e02bfc5a2f2c4287fa025cc7257/docs/examples.rst#basic-html-parsing

UPD: parse_fragment actually returns a list, so it won't work. I will review your changes tomorrow.

@rushter Thank you for pointing to parse_fragment!

I had already looked at selectolax.lexbor.util.parse_fragment. As it is today, it is a pure Python helper that is not part of the public parser API, and it returns a list of nodes, which is slightly different from the parser methods that return a single LexborNode.

The new method selectolax.lexbor.LexborHTMLParser._parse_without_top_level_tags is implemented in Cython and calls lexbor's lxb_html_document_parse_fragment directly. It is completely inlined and all calls run with nogil, so all parsing work happens in C.

There is also a structural difference in how the fragment is processed. selectolax.lexbor.util.parse_fragment first constructs a full document by creating a LexborHTMLParser instance, and during that initialization, the parser calls lxb_html_document_parse. Only after that does it apply the fragment-related logic.

In my approach, lxb_html_document_parse is never called, and no full document is built, so the input is treated as a fragment from start to finish, which should be faster and use less memory when users only need fragments.

@pygarap pygarap requested a review from rushter November 21, 2025 19:42
@rushter
Copy link
Owner

rushter commented Nov 22, 2025

@pygarap Let's rename the argument to is_fragment and flip the boolean. with_top_level_tags is too confusing.

And we need to document it to something like this:


When ``True``, treats input as an HTML fragment. Does not add required HTML5 tags when they are missing.

When set to `False`, it parses HTML fragments as well, but differently. 
Adds missing required tags (such as '<html>', '<head>', '<body>') to the tree according to the HTML5 specification.
Same as in browsers when they render HTML fragments as pages.

…LParser` for improved clarity and consistency. Update associated methods, docstrings, and tests.
…bility and align with style consistency. Adjust tests to use proper `is_fragment` parameter values.
@pygarap
Copy link
Contributor Author

pygarap commented Nov 22, 2025

@pygarap Let's rename the argument to is_fragment and flip the boolean. with_top_level_tags is too confusing.

And we need to document it to something like this:


When ``True``, treats input as an HTML fragment. Does not add required HTML5 tags when they are missing.

When set to `False`, it parses HTML fragments as well, but differently. 
Adds missing required tags (such as '<html>', '<head>', '<body>') to the tree according to the HTML5 specification.
Same as in browsers when they render HTML fragments as pages.

@rushter Thank you for the suggestions.
I have renamed the argument to is_fragment, flipped the boolean, and updated the documentation to match your wording. I have pushed the changes.
Please let me know if you would like me to change or clarify anything else.

def clone(self):
"""Clone the current node.

You can use to do temporary modifications without affecting the original HTML tree.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep those comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rushter Sorry about removing them earlier, and thanks for pointing that out.

I have restored those comments as requested and pushed the updated changes.
Please let me know if you would like me to adjust anything else.

@pygarap pygarap changed the title Added a with_top_level_tags parameter to the LexborHTMLParser constructor, allowing users to specify whether to parse input as a full HTML document or as a fragment! Added a is_fragment parameter to the LexborHTMLParser constructor, allowing users to specify whether to parse input as a full HTML document or as a fragment! Nov 22, 2025
…xbor.pyi`. Enhance docstring of `clone` method for clarity and usage details.
@pygarap pygarap changed the title Added a is_fragment parameter to the LexborHTMLParser constructor, allowing users to specify whether to parse input as a full HTML document or as a fragment! Added is_fragment parameter to the LexborHTMLParser constructor, allowing users to specify whether to parse input as a full HTML document or as a fragment! Nov 22, 2025
@pygarap pygarap requested a review from rushter November 22, 2025 23:28
@rushter rushter merged commit 6773f75 into rushter:master Nov 23, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants