Skip to content

Commit cefa6c7

Browse files
authored
Merge branch 'master' into documentation-introduction
2 parents b6de168 + ec00965 commit cefa6c7

File tree

10 files changed

+307
-19
lines changed

10 files changed

+307
-19
lines changed

.bumpversion.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bumpversion]
2-
current_version = 1.5.1
2+
current_version = 1.5.2
33
commit = True
44
tag = True
55
tag_name = v{new_version}

NEWS

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,16 @@
33
History
44
-------
55

6+
1.5.2 (2019-08-09)
7+
~~~~~~~~~~~~~~~~~~
8+
9+
* ``Selector.remove_namespaces`` received a significant performance improvement
10+
* The value of ``data`` within the printable representation of a selector
11+
(``repr(selector)``) now ends in ``...`` when truncated, to make the
12+
truncation obvious.
13+
* Minor documentation improvements.
14+
15+
616
1.5.1 (2018-10-25)
717
~~~~~~~~~~~~~~~~~~
818

@@ -12,6 +22,7 @@ History
1222
* documentation improvements;
1323
* Python 3.7 tests are run on CI; other test improvements.
1424

25+
1526
1.5.0 (2018-07-04)
1627
~~~~~~~~~~~~~~~~~~
1728

README.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@ Parsel
1515
:alt: Coverage report
1616

1717

18-
Parsel is a BSD-licensed Python_ library to extract data from HTML_ and XML_
19-
using XPath_ and CSS_ selectors, optionally combined with
18+
Parsel is a BSD-licensed Python_ library to extract and remove data from HTML_
19+
and XML_ using XPath_ and CSS_ selectors, optionally combined with
2020
`regular expressions`_.
2121

2222
Find the Parsel online documentation at https://parsel.readthedocs.org.
@@ -30,7 +30,7 @@ Find the Parsel online documentation at https://parsel.readthedocs.org.
3030
<ul>
3131
<li><a href="http://example.com">Link 1</a></li>
3232
<li><a href="http://scrapy.org">Link 2</a></li>
33-
</ul
33+
</ul>
3434
</body>
3535
</html>""")
3636
>>> selector.css('h1::text').get()

docs/usage.rst

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -385,6 +385,41 @@ XPath specification.
385385
.. _Location Paths: https://www.w3.org/TR/xpath#location-paths
386386

387387

388+
Removing elements
389+
-----------------
390+
391+
If for any reason you need to remove elements based on a Selector or
392+
a SelectorList, you can do it with the ``remove()`` method, available for both
393+
classes.
394+
395+
.. warning:: this is a destructive action and cannot be undone. The original
396+
content of the selector is removed from the elements tree. This could be useful
397+
when trying to reduce the memory footprint of Responses.
398+
399+
Example removing an ad from a blog post:
400+
401+
>>> from parsel import Selector
402+
>>> doc = u"""
403+
... <article>
404+
... <div class="row">Content paragraph...</div>
405+
... <div class="row">
406+
... <div class="ad">
407+
... Ad content...
408+
... <a href="http://...">Link</a>
409+
... </div>
410+
... </div>
411+
... <div class="row">More content...</div>
412+
... </article>
413+
... """
414+
>>> sel = Selector(text=doc)
415+
>>> sel.xpath('//div/text()').getall()
416+
['Content paragraph...', 'Ad content...', 'Link', 'More content...']
417+
>>> sel.xpath('//div[@class="ad"]').remove()
418+
>>> sel.xpath('//div//text()').getall()
419+
['Content paragraph...', 'More content...']
420+
>>>
421+
422+
388423
Using EXSLT extensions
389424
----------------------
390425

@@ -695,6 +730,66 @@ you can just select by class using CSS and then switch to XPath when needed::
695730
This is cleaner than using the verbose XPath trick shown above. Just remember
696731
to use the ``.`` in the XPath expressions that will follow.
697732

733+
734+
Beware of how script and style tags differ from other tags
735+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
736+
737+
`Following the standard`__, the contents of ``script`` and ``style`` elements
738+
are parsed as plain text.
739+
740+
__ https://www.w3.org/TR/html401/types.html#type-cdata
741+
742+
This means that XML-like structures found within them, including comments, are
743+
all treated as part of the element text, and not as separate nodes.
744+
745+
For example::
746+
747+
>>> from parsel import Selector
748+
>>> selector = Selector(text="""
749+
.... <script>
750+
.... <!-- comment -->
751+
.... text
752+
.... <br/>
753+
.... </script>
754+
.... <style>
755+
.... <!-- comment -->
756+
.... text
757+
.... <br/>
758+
.... </style>
759+
.... <div>
760+
.... <!-- comment -->
761+
.... text
762+
.... <br/>
763+
.... </div>""")
764+
>>> for tag in selector.xpath('//*[contains(text(), "text")]'):
765+
... print(tag.xpath('name()').get())
766+
... print(' Text: ' + (tag.xpath('text()').get() or ''))
767+
... print(' Comment: ' + (tag.xpath('comment()').get() or ''))
768+
... print(' Children: ' + ''.join(tag.xpath('*').getall()))
769+
...
770+
script
771+
Text:
772+
text
773+
<!-- comment -->
774+
<br/>
775+
776+
Comment:
777+
Children:
778+
style
779+
Text:
780+
text
781+
<!-- comment -->
782+
<br/>
783+
784+
Comment:
785+
Children:
786+
div
787+
Text:
788+
text
789+
790+
Comment: <!-- comment -->
791+
Children: <br>
792+
698793
.. _old-extraction-api:
699794

700795
extract() and extract_first()
@@ -745,6 +840,23 @@ are more predictable: ``.get()`` always returns a single result,
745840
``.getall()`` always returns a list of all extracted results.
746841

747842

843+
Command-Line Interface Tools
844+
============================
845+
846+
There are third-party tools that allow using Parsel from the command line:
847+
848+
- `Parsel CLI <https://github.com/rmax/parsel-cli>`_ allows applying
849+
Parsel selectors to the standard input. For example, you can apply a Parsel
850+
selector to the output of cURL_.
851+
852+
- `parselcli
853+
<https://github.com/Granitosaurus/parsel-cli>`_ provides an interactive
854+
shell that allows applying Parsel selectors to a remote URL or a local
855+
file.
856+
857+
.. _cURL: https://curl.haxx.se/
858+
859+
748860
.. _topics-selectors-ref:
749861

750862
API reference

parsel/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66
__author__ = 'Scrapy project'
77
__email__ = '[email protected]'
8-
__version__ = '1.5.1'
8+
__version__ = '1.5.2'
99

1010
from parsel.selector import Selector, SelectorList # NOQA
1111
from parsel.csstranslator import css2xpath # NOQA

parsel/selector.py

Lines changed: 43 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,18 @@
77
import six
88
from lxml import etree, html
99

10-
from .utils import flatten, iflatten, extract_regex
10+
from .utils import flatten, iflatten, extract_regex, shorten
1111
from .csstranslator import HTMLTranslator, GenericTranslator
1212

1313

14+
class CannotRemoveElementWithoutRoot(Exception):
15+
pass
16+
17+
18+
class CannotRemoveElementWithoutParent(Exception):
19+
pass
20+
21+
1422
class SafeXMLParser(etree.XMLParser):
1523
def __init__(self, *args, **kwargs):
1624
kwargs.setdefault('resolve_entities', False)
@@ -150,6 +158,13 @@ def attrib(self):
150158
else:
151159
return {}
152160

161+
def remove(self):
162+
"""
163+
Remove matched nodes from the parent for each element in this list.
164+
"""
165+
for x in self:
166+
x.remove()
167+
153168

154169
class Selector(object):
155170
"""
@@ -339,8 +354,32 @@ def remove_namespaces(self):
339354
for an in el.attrib.keys():
340355
if an.startswith('{'):
341356
el.attrib[an.split('}', 1)[1]] = el.attrib.pop(an)
342-
# remove namespace declarations
343-
etree.cleanup_namespaces(self.root)
357+
# remove namespace declarations
358+
etree.cleanup_namespaces(self.root)
359+
360+
def remove(self):
361+
"""
362+
Remove matched nodes from the parent element.
363+
"""
364+
try:
365+
parent = self.root.getparent()
366+
except AttributeError:
367+
# 'str' object has no attribute 'getparent'
368+
raise CannotRemoveElementWithoutRoot(
369+
"The node you're trying to remove has no root, "
370+
"are you trying to remove a pseudo-element? "
371+
"Try to use 'li' as a selector instead of 'li::text' or "
372+
"'//li' instead of '//li/text()', for example."
373+
)
374+
375+
try:
376+
parent.remove(self.root)
377+
except AttributeError:
378+
# 'NoneType' object has no attribute 'remove'
379+
raise CannotRemoveElementWithoutParent(
380+
"The node you're trying to remove has no parent, "
381+
"are you trying to remove a root element?"
382+
)
344383

345384
@property
346385
def attrib(self):
@@ -358,6 +397,6 @@ def __bool__(self):
358397
__nonzero__ = __bool__
359398

360399
def __str__(self):
361-
data = repr(self.get()[:40])
400+
data = repr(shorten(self.get(), width=40))
362401
return "<%s xpath=%r data=%s>" % (type(self).__name__, self._expr, data)
363402
__repr__ = __str__

parsel/utils.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -80,4 +80,15 @@ def extract_regex(regex, text, replace_entities=True):
8080
strings = flatten(strings)
8181
if not replace_entities:
8282
return strings
83-
return [w3lib_replace_entities(s, keep=['lt', 'amp']) for s in strings]
83+
return [w3lib_replace_entities(s, keep=['lt', 'amp']) for s in strings]
84+
85+
86+
def shorten(text, width, suffix='...'):
87+
"""Truncate the given text to fit in the given width."""
88+
if len(text) <= width:
89+
return text
90+
if width > len(suffix):
91+
return text[:width-len(suffix)] + suffix
92+
if width >= 0:
93+
return suffix[len(suffix)-width:]
94+
raise ValueError('width must be equal or greater than 0')

setup.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,8 @@ def has_environment_marker_platform_impl_support():
2727

2828
install_requires = [
2929
'w3lib>=1.19.0',
30-
'lxml>=2.3',
30+
'lxml;python_version!="3.4"',
31+
'lxml<=4.3.5;python_version=="3.4"',
3132
'six>=1.5.2',
3233
'cssselect>=0.9'
3334
]
@@ -41,7 +42,7 @@ def has_environment_marker_platform_impl_support():
4142

4243
setup(
4344
name='parsel',
44-
version='1.5.1',
45+
version='1.5.2',
4546
description="Parsel is a library to extract data from HTML and XML using XPath and CSS selectors",
4647
long_description=readme + '\n\n' + history,
4748
author="Scrapy project",

0 commit comments

Comments
 (0)