Skip to content

Commit ec00965

Browse files
authored
Merge pull request #152 from Gallaecio/document-script-html-comments
Document how the contents of script and style tags is parsed differently
2 parents 0691799 + 129713d commit ec00965

File tree

1 file changed

+60
-0
lines changed

1 file changed

+60
-0
lines changed

docs/usage.rst

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -730,6 +730,66 @@ you can just select by class using CSS and then switch to XPath when needed::
730730
This is cleaner than using the verbose XPath trick shown above. Just remember
731731
to use the ``.`` in the XPath expressions that will follow.
732732

733+
734+
Beware of how script and style tags differ from other tags
735+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
736+
737+
`Following the standard`__, the contents of ``script`` and ``style`` elements
738+
are parsed as plain text.
739+
740+
__ https://www.w3.org/TR/html401/types.html#type-cdata
741+
742+
This means that XML-like structures found within them, including comments, are
743+
all treated as part of the element text, and not as separate nodes.
744+
745+
For example::
746+
747+
>>> from parsel import Selector
748+
>>> selector = Selector(text="""
749+
.... <script>
750+
.... <!-- comment -->
751+
.... text
752+
.... <br/>
753+
.... </script>
754+
.... <style>
755+
.... <!-- comment -->
756+
.... text
757+
.... <br/>
758+
.... </style>
759+
.... <div>
760+
.... <!-- comment -->
761+
.... text
762+
.... <br/>
763+
.... </div>""")
764+
>>> for tag in selector.xpath('//*[contains(text(), "text")]'):
765+
... print(tag.xpath('name()').get())
766+
... print(' Text: ' + (tag.xpath('text()').get() or ''))
767+
... print(' Comment: ' + (tag.xpath('comment()').get() or ''))
768+
... print(' Children: ' + ''.join(tag.xpath('*').getall()))
769+
...
770+
script
771+
Text:
772+
text
773+
<!-- comment -->
774+
<br/>
775+
776+
Comment:
777+
Children:
778+
style
779+
Text:
780+
text
781+
<!-- comment -->
782+
<br/>
783+
784+
Comment:
785+
Children:
786+
div
787+
Text:
788+
text
789+
790+
Comment: <!-- comment -->
791+
Children: <br>
792+
733793
.. _old-extraction-api:
734794

735795
extract() and extract_first()

0 commit comments

Comments
 (0)