44Usage
55=====
66
7- Getting started
8- ===============
9-
10- If you already know how to write `CSS `_ or `XPath `_ expressions, using Parsel
11- is straightforward: you just need to create a
12- :class: `~parsel.selector.Selector ` object for the HTML or XML text you want to
13- parse, and use the available methods for selecting parts from the text and
14- extracting data out of the result.
15-
16- Creating a :class: `~parsel.selector.Selector ` object is simple::
7+ Create a :class: `~parsel.selector.Selector ` object for the HTML or XML text
8+ that you want to parse::
179
1810 >>> from parsel import Selector
1911 >>> text = u"<html><body><h1>Hello, Parsel!</h1></body></html>"
20- >>> sel = Selector(text=text)
12+ >>> selector = Selector(text=text)
2113
22- .. note ::
23- One important thing to note is that if you're using Python 2,
24- make sure to use an `unicode ` object for the text argument.
25- :class: `~parsel.selector.Selector ` expects text to be an `unicode `
26- object in Python 2 or an `str ` object in Python 3.
14+ .. note :: In Python 2, the ``text`` argument must be a ``unicode`` string.
2715
28- Once you have created the Selector object, you can use `CSS `_ or
29- `XPath `_ expressions to select elements::
16+ Then use `CSS `_ or `XPath `_ expressions to select elements::
3017
31- >>> sel .css('h1')
18+ >>> selector .css('h1')
3219 [<Selector xpath='descendant-or-self::h1' data='<h1>Hello, Parsel!</h1>'>]
33- >>> sel .xpath('//h1') # the same, but now with XPath
20+ >>> selector .xpath('//h1') # the same, but now with XPath
3421 [<Selector xpath='//h1' data='<h1>Hello, Parsel!</h1>'>]
3522
3623And extract data from those elements::
3724
38- >>> sel .css('h1::text').get()
25+ >>> selector .css('h1::text').get()
3926 'Hello, Parsel!'
40- >>> sel .xpath('//h1/text()').getall()
27+ >>> selector .xpath('//h1/text()').getall()
4128 ['Hello, Parsel!']
4229
30+ .. _CSS : http://www.w3.org/TR/selectors
31+ .. _XPath : http://www.w3.org/TR/xpath
32+
33+ Learning CSS and XPath
34+ ======================
35+
36+ `CSS `_ is a language for applying styles to HTML documents. It defines
37+ selectors to associate those styles with specific HTML elements. Resources to
38+ learn CSS _ selectors include:
39+
40+ - `CSS selectors in the MDN `_
41+
42+ - `XPath/CSS Equivalents in Wikibooks `_
43+
4344`XPath `_ is a language for selecting nodes in XML documents, which can also be
44- used with HTML. `CSS `_ is a language for applying styles to HTML documents. It
45- defines selectors to associate those styles with specific HTML elements.
45+ used with HTML. Resources to learn XPath _ include:
4646
47- You can use either language you're more comfortable with, though you may find
48- that in some specific cases `XPath `_ is more powerful than `CSS `_.
47+ - `XPath Tutorial in W3Schools `_
4948
50- .. _XPath : http://www.w3.org/TR/xpath
51- .. _CSS : http://www.w3.org/TR/selectors
49+ - `XPath cheatsheet `_
50+
51+ You can use either CSS _ or XPath _. CSS _ is usually more readable, but some
52+ things can only be done with XPath _.
53+
54+ .. _CSS selectors in the MDN : https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors
55+ .. _XPath cheatsheet : https://devhints.io/xpath
56+ .. _XPath Tutorial in W3Schools : https://www.w3schools.com/xml/xpath_intro.asp
57+ .. _XPath/CSS Equivalents in Wikibooks : https://en.wikibooks.org/wiki/XPath/CSS_Equivalents
5258
5359
5460Using selectors
@@ -840,6 +846,29 @@ are more predictable: ``.get()`` always returns a single result,
840846``.getall() `` always returns a list of all extracted results.
841847
842848
849+ Using CSS selectors in multi-root documents
850+ -------------------------------------------
851+
852+ Some webpages may have multiple root elements. It can happen, for example, when
853+ a webpage has broken code, such as missing closing tags.
854+
855+ You can use XPath to determine if a page has multiple root elements::
856+
857+ >>> len(selector.xpath('/*')) > 1
858+ True
859+
860+ CSS selectors only work on the first root element, because the first root
861+ element is always used as the starting current element, and CSS selectors do
862+ not allow selecting parent elements (XPath’s ``.. ``) or elements relative to
863+ the document root (XPath’s ``/ ``).
864+
865+ If you want to use a CSS selector that takes into account all root elements,
866+ you need to precede your CSS query by an XPath query that reaches all root
867+ elements::
868+
869+ selector.xpath('/*').css('<your CSS selector>')
870+
871+
843872Command-Line Interface Tools
844873============================
845874
@@ -857,27 +886,11 @@ There are third-party tools that allow using Parsel from the command line:
857886.. _cURL : https://curl.haxx.se/
858887
859888
860- .. _topics-selectors-ref :
861-
862- API reference
863- =============
864-
865- Selector objects
866- ----------------
867-
868- .. autoclass :: parsel.selector.Selector
869- :members:
870-
871-
872- SelectorList objects
873- --------------------
874-
875- .. autoclass :: parsel.selector.SelectorList
876- :members:
877-
878-
879889.. _selector-examples-html :
880890
891+ Examples
892+ ========
893+
881894Working on HTML
882895---------------
883896
@@ -936,7 +949,8 @@ Removing namespaces
936949When dealing with scraping projects, it is often quite convenient to get rid of
937950namespaces altogether and just work with element names, to write more
938951simple/convenient XPaths. You can use the
939- :meth: `Selector.remove_namespaces ` method for that.
952+ :meth: `Selector.remove_namespaces <parsel.selector.Selector.remove_namespaces> `
953+ method for that.
940954
941955Let's show an example that illustrates this with the Python Insider blog atom feed.
942956
@@ -947,10 +961,12 @@ Let's download the atom feed using `requests`_ and create a selector::
947961 >>> text = requests.get('https://feeds.feedburner.com/PythonInsider').text
948962 >>> sel = Selector(text=text, type='xml')
949963
950- This is how the file starts::
964+ This is how the file starts:
965+
966+ .. code-block :: xml
951967
952968 <?xml version =" 1.0" encoding =" UTF-8" ?>
953- <?xml-stylesheet ...
969+ <?xml-stylesheet ... ?>
954970 <feed xmlns =" http://www.w3.org/2005/Atom"
955971 xmlns : openSearch =" http://a9.com/-/spec/opensearchrss/1.0/"
956972 xmlns : blogger =" http://schemas.google.com/blogger/2008"
@@ -959,6 +975,7 @@ This is how the file starts::
959975 xmlns : thr =" http://purl.org/syndication/thread/1.0"
960976 xmlns : feedburner =" http://rssnamespace.org/feedburner/ext/1.0" >
961977 ...
978+ </feed >
962979
963980 You can see several namespace declarations including a default
964981"http://www.w3.org/2005/Atom" and another one using the "gd:" prefix for
@@ -970,8 +987,9 @@ We can try selecting all ``<link>`` objects and then see that it doesn't work
970987 >>> sel.xpath("//link")
971988 []
972989
973- But once we call the :meth: `Selector.remove_namespaces ` method, all
974- nodes can be accessed directly by their names::
990+ But once we call the :meth: `Selector.remove_namespaces
991+ <parsel.selector.Selector.remove_namespaces> ` method, all nodes can be accessed
992+ directly by their names::
975993
976994 >>> sel.remove_namespaces()
977995 >>> sel.xpath("//link")
0 commit comments