Skip to content

Commit 4c3d914

Browse files
authored
Merge branch 'master' into documentation-introduction
2 parents cefa6c7 + 93719a1 commit 4c3d914

File tree

13 files changed

+161
-115
lines changed

13 files changed

+161
-115
lines changed

.travis.yml

Lines changed: 5 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,8 @@ matrix:
88
include:
99
- python: 2.7
1010
env: TOXENV=py27
11-
- python: 2.7
11+
- python: pypy
1212
env: TOXENV=pypy
13-
- python: 2.7
14-
env: TOXENV=pypy3
1513
- python: 3.4
1614
env: TOXENV=py34
1715
- python: 3.5
@@ -20,24 +18,11 @@ matrix:
2018
env: TOXENV=py36
2119
- python: 3.7
2220
env: TOXENV=py37
23-
dist: xenial
24-
sudo: true
21+
- python: pypy3
22+
env: TOXENV=pypy3
23+
- python: 3.7
24+
env: TOXENV=docs
2525
install:
26-
- |
27-
if [ "$TOXENV" = "pypy" ]; then
28-
export PYPY_VERSION="pypy-6.0.0-linux_x86_64-portable"
29-
wget "https://bitbucket.org/squeaky/portable-pypy/downloads/${PYPY_VERSION}.tar.bz2"
30-
tar -jxf ${PYPY_VERSION}.tar.bz2
31-
virtualenv --python="$PYPY_VERSION/bin/pypy" "$HOME/virtualenvs/$PYPY_VERSION"
32-
source "$HOME/virtualenvs/$PYPY_VERSION/bin/activate"
33-
fi
34-
if [ "$TOXENV" = "pypy3" ]; then
35-
export PYPY_VERSION="pypy3.5-6.0.0-linux_x86_64-portable"
36-
wget "https://bitbucket.org/squeaky/portable-pypy/downloads/${PYPY_VERSION}.tar.bz2"
37-
tar -jxf ${PYPY_VERSION}.tar.bz2
38-
virtualenv --python="$PYPY_VERSION/bin/pypy3" "$HOME/virtualenvs/$PYPY_VERSION"
39-
source "$HOME/virtualenvs/$PYPY_VERSION/bin/activate"
40-
fi
4126
- pip install -U pip tox twine wheel codecov
4227
script: tox
4328
after_success:

README.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@ and XML_ using XPath_ and CSS_ selectors, optionally combined with
2121

2222
Find the Parsel online documentation at https://parsel.readthedocs.org.
2323

24+
Example (`open online demo`_):
25+
2426
.. code-block:: python
2527
2628
>>> from parsel import Selector
@@ -42,8 +44,10 @@ Find the Parsel online documentation at https://parsel.readthedocs.org.
4244
http://example.com
4345
http://scrapy.org
4446
47+
4548
.. _CSS: https://en.wikipedia.org/wiki/Cascading_Style_Sheets
4649
.. _HTML: https://en.wikipedia.org/wiki/HTML
50+
.. _open online demo: https://colab.research.google.com/drive/149VFa6Px3wg7S3SEnUqk--TyBrKplxCN#forceEdit=true&sandboxMode=true
4751
.. _Python: https://www.python.org/
4852
.. _regular expressions: https://docs.python.org/library/re.html
4953
.. _XML: https://en.wikipedia.org/wiki/XML

docs/conf.py

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,11 @@
4040

4141
# Add any Sphinx extension module names here, as strings. They can be
4242
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
43-
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.viewcode']
43+
extensions = [
44+
'sphinx.ext.autodoc',
45+
'sphinx.ext.intersphinx',
46+
'sphinx.ext.viewcode',
47+
]
4448

4549
# Add any paths that contain templates here, relative to this directory.
4650
templates_path = ['_templates']
@@ -273,3 +277,22 @@
273277

274278
# If true, do not generate a @detailmenu in the "Top" node's menu.
275279
#texinfo_no_detailmenu = False
280+
281+
282+
# -- Options for the InterSphinx extension ------------------------------------
283+
284+
intersphinx_mapping = {
285+
'cssselect': ('https://cssselect.readthedocs.io/en/latest', None),
286+
'python': ('https://docs.python.org/3', None),
287+
}
288+
289+
290+
# --- Nitpicking options ------------------------------------------------------
291+
292+
nitpicky = True
293+
nitpick_ignore = [
294+
('py:class', 'cssselect.xpath.GenericTranslator'),
295+
('py:class', 'cssselect.xpath.HTMLTranslator'),
296+
('py:class', 'cssselect.xpath.XPathExpr'),
297+
('py:class', 'lxml.etree.XMLParser'),
298+
]

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ Contents:
1515

1616
installation
1717
usage
18+
parsel
1819
history
1920

2021
Indices and tables

docs/modules.rst

Lines changed: 0 additions & 7 deletions
This file was deleted.

docs/parsel.rst

Lines changed: 11 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,30 @@
1-
parsel package
2-
==============
1+
API reference
2+
=============
33

4-
Submodules
5-
----------
6-
7-
parsel.csstranslator module
8-
---------------------------
4+
parsel.csstranslator
5+
--------------------
96

107
.. automodule:: parsel.csstranslator
118
:members:
129
:undoc-members:
1310
:show-inheritance:
1411

15-
parsel.selector module
16-
----------------------
1712

18-
.. automodule:: parsel.selector
19-
:members:
20-
:undoc-members:
21-
:show-inheritance:
13+
.. _topics-selectors-ref:
2214

23-
parsel.utils module
24-
-------------------
15+
parsel.selector
16+
---------------
2517

26-
.. automodule:: parsel.utils
18+
.. automodule:: parsel.selector
2719
:members:
2820
:undoc-members:
2921
:show-inheritance:
3022

3123

32-
Module contents
33-
---------------
24+
parsel.utils
25+
------------
3426

35-
.. automodule:: parsel
27+
.. automodule:: parsel.utils
3628
:members:
3729
:undoc-members:
3830
:show-inheritance:

docs/readme.rst

Lines changed: 0 additions & 2 deletions
This file was deleted.

docs/usage.rst

Lines changed: 70 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -4,51 +4,57 @@
44
Usage
55
=====
66

7-
Getting started
8-
===============
9-
10-
If you already know how to write `CSS`_ or `XPath`_ expressions, using Parsel
11-
is straightforward: you just need to create a
12-
:class:`~parsel.selector.Selector` object for the HTML or XML text you want to
13-
parse, and use the available methods for selecting parts from the text and
14-
extracting data out of the result.
15-
16-
Creating a :class:`~parsel.selector.Selector` object is simple::
7+
Create a :class:`~parsel.selector.Selector` object for the HTML or XML text
8+
that you want to parse::
179

1810
>>> from parsel import Selector
1911
>>> text = u"<html><body><h1>Hello, Parsel!</h1></body></html>"
20-
>>> sel = Selector(text=text)
12+
>>> selector = Selector(text=text)
2113

22-
.. note::
23-
One important thing to note is that if you're using Python 2,
24-
make sure to use an `unicode` object for the text argument.
25-
:class:`~parsel.selector.Selector` expects text to be an `unicode`
26-
object in Python 2 or an `str` object in Python 3.
14+
.. note:: In Python 2, the ``text`` argument must be a ``unicode`` string.
2715

28-
Once you have created the Selector object, you can use `CSS`_ or
29-
`XPath`_ expressions to select elements::
16+
Then use `CSS`_ or `XPath`_ expressions to select elements::
3017

31-
>>> sel.css('h1')
18+
>>> selector.css('h1')
3219
[<Selector xpath='descendant-or-self::h1' data='<h1>Hello, Parsel!</h1>'>]
33-
>>> sel.xpath('//h1') # the same, but now with XPath
20+
>>> selector.xpath('//h1') # the same, but now with XPath
3421
[<Selector xpath='//h1' data='<h1>Hello, Parsel!</h1>'>]
3522

3623
And extract data from those elements::
3724

38-
>>> sel.css('h1::text').get()
25+
>>> selector.css('h1::text').get()
3926
'Hello, Parsel!'
40-
>>> sel.xpath('//h1/text()').getall()
27+
>>> selector.xpath('//h1/text()').getall()
4128
['Hello, Parsel!']
4229

30+
.. _CSS: http://www.w3.org/TR/selectors
31+
.. _XPath: http://www.w3.org/TR/xpath
32+
33+
Learning CSS and XPath
34+
======================
35+
36+
`CSS`_ is a language for applying styles to HTML documents. It defines
37+
selectors to associate those styles with specific HTML elements. Resources to
38+
learn CSS_ selectors include:
39+
40+
- `CSS selectors in the MDN`_
41+
42+
- `XPath/CSS Equivalents in Wikibooks`_
43+
4344
`XPath`_ is a language for selecting nodes in XML documents, which can also be
44-
used with HTML. `CSS`_ is a language for applying styles to HTML documents. It
45-
defines selectors to associate those styles with specific HTML elements.
45+
used with HTML. Resources to learn XPath_ include:
4646

47-
You can use either language you're more comfortable with, though you may find
48-
that in some specific cases `XPath`_ is more powerful than `CSS`_.
47+
- `XPath Tutorial in W3Schools`_
4948

50-
.. _XPath: http://www.w3.org/TR/xpath
51-
.. _CSS: http://www.w3.org/TR/selectors
49+
- `XPath cheatsheet`_
50+
51+
You can use either CSS_ or XPath_. CSS_ is usually more readable, but some
52+
things can only be done with XPath_.
53+
54+
.. _CSS selectors in the MDN: https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors
55+
.. _XPath cheatsheet: https://devhints.io/xpath
56+
.. _XPath Tutorial in W3Schools: https://www.w3schools.com/xml/xpath_intro.asp
57+
.. _XPath/CSS Equivalents in Wikibooks: https://en.wikibooks.org/wiki/XPath/CSS_Equivalents
5258

5359

5460
Using selectors
@@ -840,6 +846,29 @@ are more predictable: ``.get()`` always returns a single result,
840846
``.getall()`` always returns a list of all extracted results.
841847

842848

849+
Using CSS selectors in multi-root documents
850+
-------------------------------------------
851+
852+
Some webpages may have multiple root elements. It can happen, for example, when
853+
a webpage has broken code, such as missing closing tags.
854+
855+
You can use XPath to determine if a page has multiple root elements::
856+
857+
>>> len(selector.xpath('/*')) > 1
858+
True
859+
860+
CSS selectors only work on the first root element, because the first root
861+
element is always used as the starting current element, and CSS selectors do
862+
not allow selecting parent elements (XPath’s ``..``) or elements relative to
863+
the document root (XPath’s ``/``).
864+
865+
If you want to use a CSS selector that takes into account all root elements,
866+
you need to precede your CSS query by an XPath query that reaches all root
867+
elements::
868+
869+
selector.xpath('/*').css('<your CSS selector>')
870+
871+
843872
Command-Line Interface Tools
844873
============================
845874

@@ -857,27 +886,11 @@ There are third-party tools that allow using Parsel from the command line:
857886
.. _cURL: https://curl.haxx.se/
858887

859888

860-
.. _topics-selectors-ref:
861-
862-
API reference
863-
=============
864-
865-
Selector objects
866-
----------------
867-
868-
.. autoclass:: parsel.selector.Selector
869-
:members:
870-
871-
872-
SelectorList objects
873-
--------------------
874-
875-
.. autoclass:: parsel.selector.SelectorList
876-
:members:
877-
878-
879889
.. _selector-examples-html:
880890

891+
Examples
892+
========
893+
881894
Working on HTML
882895
---------------
883896

@@ -936,7 +949,8 @@ Removing namespaces
936949
When dealing with scraping projects, it is often quite convenient to get rid of
937950
namespaces altogether and just work with element names, to write more
938951
simple/convenient XPaths. You can use the
939-
:meth:`Selector.remove_namespaces` method for that.
952+
:meth:`Selector.remove_namespaces <parsel.selector.Selector.remove_namespaces>`
953+
method for that.
940954

941955
Let's show an example that illustrates this with the Python Insider blog atom feed.
942956

@@ -947,10 +961,12 @@ Let's download the atom feed using `requests`_ and create a selector::
947961
>>> text = requests.get('https://feeds.feedburner.com/PythonInsider').text
948962
>>> sel = Selector(text=text, type='xml')
949963

950-
This is how the file starts::
964+
This is how the file starts:
965+
966+
.. code-block:: xml
951967
952968
<?xml version="1.0" encoding="UTF-8"?>
953-
<?xml-stylesheet ...
969+
<?xml-stylesheet ... ?>
954970
<feed xmlns="http://www.w3.org/2005/Atom"
955971
xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/"
956972
xmlns:blogger="http://schemas.google.com/blogger/2008"
@@ -959,6 +975,7 @@ This is how the file starts::
959975
xmlns:thr="http://purl.org/syndication/thread/1.0"
960976
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
961977
...
978+
</feed>
962979
963980
You can see several namespace declarations including a default
964981
"http://www.w3.org/2005/Atom" and another one using the "gd:" prefix for
@@ -970,8 +987,9 @@ We can try selecting all ``<link>`` objects and then see that it doesn't work
970987
>>> sel.xpath("//link")
971988
[]
972989

973-
But once we call the :meth:`Selector.remove_namespaces` method, all
974-
nodes can be accessed directly by their names::
990+
But once we call the :meth:`Selector.remove_namespaces
991+
<parsel.selector.Selector.remove_namespaces>` method, all nodes can be accessed
992+
directly by their names::
975993

976994
>>> sel.remove_namespaces()
977995
>>> sel.xpath("//link")

parsel/csstranslator.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
from cssselect import HTMLTranslator as OriginalHTMLTranslator
1010
from cssselect.xpath import XPathExpr as OriginalXPathExpr
1111
from cssselect.xpath import _unicode_safe_getattr, ExpressionError
12-
from cssselect.parser import parse, FunctionalPseudoElement
12+
from cssselect.parser import FunctionalPseudoElement
1313

1414

1515
class XPathExpr(OriginalXPathExpr):

0 commit comments

Comments
 (0)