Added is_fragment parameter to the LexborHTMLParser constructor, allowing users to specify whether to parse input as a full HTML document or as a fragment!#188
Conversation
This update introduces the `with_top_level_tags` parameter, allowing users to control whether the parser automatically wraps content with top-level HTML tags. Includes relevant method adjustments and tests.
…HTMLParser` Updated method to correctly handle returning fragments and improved error handling for parsing failures. Includes minor comment adjustments for future enhancements.
Refined `lxb_html_document_parse_fragment` logic to better handle NULL fragment scenarios, aligning with Lexbor status API. Added comments for clarity.
Refactored `_parse_html` to correctly handle scenarios with empty top-level tags, ensuring compatibility with Lexbor's fragment parsing logic. Updated test cases to reflect changes.
…p_level_tags` and `_parse_without_top_level_tags` methods for improved readability and modularity. Updated tests to validate changes.
Streamlined class implementations by removing unnecessary lines and improving formatting consistency. Moved `__dealloc__` method to appropriate location and ensured proper cleanup logic. Simplified Python interface by removing redundant spacing and updated test cases to match changes.
…dd a test case Deleted the long-commented `_parse_html` method to declutter the codebase. Added a new test case to ensure proper parsing behavior with `with_top_level_tags=False`.
…atting in annotations Moved the `selector` property for better structure in `lexbor.pyx`. Enhanced `lexbor.pyi` with consistent spacing and improved formatting for better readability.
… hints Enhanced clarity by adding comprehensive docstrings across `lexbor.pyx` and `lexbor.pyi`, improving developer understanding. Refined type hints for better type safety and consistency.
Introduced comprehensive docstrings for `_new_html_document`, `_parse_html`, `_parse_with_top_level_tags`, and `_parse_without_top_level_tags` methods to improve clarity and developer understanding.
… in `lexbor.pyx`
|
I will check this later, but please avoid adding new features/parameters without creating an issue first. |
|
|
||
| ctypedef enum lxb_dom_node_type_t: | ||
| LXB_DOM_NODE_TYPE_ELEMENT = 0x01 | ||
| LXB_DOM_NODE_TYPE_ATTRIBUTE = 0x02 |
There was a problem hiding this comment.
Such formatting is intentional. Please avoid changing formatting everywhere. It makes it harder to review and keeps unrelated changes in the same commits.
There was a problem hiding this comment.
@rushter Thanks for pointing this out.
I have reverted the formatting changes and kept this PR focused only on the relevant code changes. I will avoid broad formatting edits in future commits to make diffs more straightforward to review.
@rushter Thank you for the feedback! To clarify,
If you prefer, I can open an issue for this and update the pull request to reference it. |
We already have UPD: |
…or.pyi`, and related files.
@rushter Thank you for pointing to I had already looked at The new method There is also a structural difference in how the fragment is processed. In my approach, |
|
@pygarap Let's rename the argument to And we need to document it to something like this: |
…LParser` for improved clarity and consistency. Update associated methods, docstrings, and tests.
…bility and align with style consistency. Adjust tests to use proper `is_fragment` parameter values.
@rushter Thank you for the suggestions. |
| def clone(self): | ||
| """Clone the current node. | ||
|
|
||
| You can use to do temporary modifications without affecting the original HTML tree. |
There was a problem hiding this comment.
@rushter Sorry about removing them earlier, and thanks for pointing that out.
I have restored those comments as requested and pushed the updated changes.
Please let me know if you would like me to adjust anything else.
with_top_level_tags parameter to the LexborHTMLParser constructor, allowing users to specify whether to parse input as a full HTML document or as a fragment!is_fragment parameter to the LexborHTMLParser constructor, allowing users to specify whether to parse input as a full HTML document or as a fragment!
…xbor.pyi`. Enhance docstring of `clone` method for clarity and usage details.
is_fragment parameter to the LexborHTMLParser constructor, allowing users to specify whether to parse input as a full HTML document or as a fragment!is_fragment parameter to the LexborHTMLParser constructor, allowing users to specify whether to parse input as a full HTML document or as a fragment!
This pull request introduces significant enhancements to the
LexborHTMLParserAPI, focusing on improved HTML fragment parsing, expanded type stubs, and better documentation. The changes add support for parsing HTML fragments (not just full documents), clarify and expand the Python interface, and improve the maintainability of the codebase.HTML Fragment Parsing Support:
is_fragmentparameter to theLexborHTMLParserconstructor, allowing users to specify whether the input should be parsed as a full HTML document or as an HTML fragment. When parsing as a fragment, required elements like<html>,<head>, and<body>are not automatically inserted. [1] [2]_new_html_document,_parse_html_document, and_parse_html_fragment) to handle the creation and parsing of both full documents and fragments. [1] [2]Python Type Stubs and Documentation Improvements:
selectolax/lexbor.pyitype stubs forLexborHTMLParser, including detailed docstrings for all public methods and properties, improved parameter types, and return value annotations. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]inner_html(with getter and setter), and improved the documentation and signatures for existing methods liketext,tags,strip_tags,merge_text_nodes, and others.defaultis nowAnyinstead ofboolin several methods). [1] [2] [3]Code Quality and Linting:
cython-lintto theMakefilelint target to ensure Cython code style consistency.These changes collectively make the parser more flexible, easier to use, and better documented for both full HTML documents and fragments.
References:
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17]