@@ -385,6 +385,41 @@ XPath specification.
385385.. _Location Paths : https://www.w3.org/TR/xpath#location-paths
386386
387387
388+ Removing elements
389+ -----------------
390+
391+ If for any reason you need to remove elements based on a Selector or
392+ a SelectorList, you can do it with the ``remove() `` method, available for both
393+ classes.
394+
395+ .. warning :: this is a destructive action and cannot be undone. The original
396+ content of the selector is removed from the elements tree. This could be useful
397+ when trying to reduce the memory footprint of Responses.
398+
399+ Example removing an ad from a blog post:
400+
401+ >>> from parsel import Selector
402+ >>> doc = u """
403+ ... < article>
404+ ... < div class = " row" > Content paragraph... < / div>
405+ ... < div class = " row" >
406+ ... < div class = " ad" >
407+ ... Ad content...
408+ ... < a href= " http://..." > Link< / a>
409+ ... < / div>
410+ ... < / div>
411+ ... < div class = " row" > More content... < / div>
412+ ... < / article>
413+ ... """
414+ >>> sel = Selector(text = doc)
415+ >>> sel.xpath(' //div/text()' ).getall()
416+ ['Content paragraph...', 'Ad content...', 'Link', 'More content...']
417+ >>> sel.xpath(' //div[@class="ad"]' ).remove()
418+ >>> sel.xpath(' //div//text()' ).getall()
419+ ['Content paragraph...', 'More content...']
420+ >>>
421+
422+
388423Using EXSLT extensions
389424----------------------
390425
@@ -695,6 +730,66 @@ you can just select by class using CSS and then switch to XPath when needed::
695730This is cleaner than using the verbose XPath trick shown above. Just remember
696731to use the ``. `` in the XPath expressions that will follow.
697732
733+
734+ Beware of how script and style tags differ from other tags
735+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
736+
737+ `Following the standard `__, the contents of ``script `` and ``style `` elements
738+ are parsed as plain text.
739+
740+ __ https://www.w3.org/TR/html401/types.html#type-cdata
741+
742+ This means that XML-like structures found within them, including comments, are
743+ all treated as part of the element text, and not as separate nodes.
744+
745+ For example::
746+
747+ >>> from parsel import Selector
748+ >>> selector = Selector(text="""
749+ .... <script>
750+ .... <!-- comment -->
751+ .... text
752+ .... <br/>
753+ .... </script>
754+ .... <style>
755+ .... <!-- comment -->
756+ .... text
757+ .... <br/>
758+ .... </style>
759+ .... <div>
760+ .... <!-- comment -->
761+ .... text
762+ .... <br/>
763+ .... </div>""")
764+ >>> for tag in selector.xpath('//*[contains(text(), "text")]'):
765+ ... print(tag.xpath('name()').get())
766+ ... print(' Text: ' + (tag.xpath('text()').get() or ''))
767+ ... print(' Comment: ' + (tag.xpath('comment()').get() or ''))
768+ ... print(' Children: ' + ''.join(tag.xpath('*').getall()))
769+ ...
770+ script
771+ Text:
772+ text
773+ <!-- comment -->
774+ <br/>
775+
776+ Comment:
777+ Children:
778+ style
779+ Text:
780+ text
781+ <!-- comment -->
782+ <br/>
783+
784+ Comment:
785+ Children:
786+ div
787+ Text:
788+ text
789+
790+ Comment: <!-- comment -->
791+ Children: <br>
792+
698793.. _old-extraction-api :
699794
700795extract() and extract_first()
@@ -745,6 +840,23 @@ are more predictable: ``.get()`` always returns a single result,
745840``.getall() `` always returns a list of all extracted results.
746841
747842
843+ Command-Line Interface Tools
844+ ============================
845+
846+ There are third-party tools that allow using Parsel from the command line:
847+
848+ - `Parsel CLI <https://github.com/rmax/parsel-cli >`_ allows applying
849+ Parsel selectors to the standard input. For example, you can apply a Parsel
850+ selector to the output of cURL _.
851+
852+ - `parselcli
853+ <https://github.com/Granitosaurus/parsel-cli> `_ provides an interactive
854+ shell that allows applying Parsel selectors to a remote URL or a local
855+ file.
856+
857+ .. _cURL : https://curl.haxx.se/
858+
859+
748860.. _topics-selectors-ref :
749861
750862API reference
0 commit comments