@@ -17,29 +17,19 @@ HTML to Text
1717
1818Extract text from HTML
1919
20-
2120* Free software: MIT license
2221
23-
2422How is html_text different from ``.xpath('//text()') `` from LXML
2523or ``.get_text() `` from Beautiful Soup?
26- Text extracted with ``html_text `` does not contain inline styles,
27- javascript, comments and other text that is not normally visible to the users.
28- It normalizes whitespace, but is also smarter than
29- ``.xpath('normalize-space()) ``, adding spaces around inline elements
30- (which are often used as block elements in html markup),
31- tries to avoid adding extra spaces for punctuation and
32- can add newlines so that the output text looks like how it is rendered in
33- browsers.
34-
35- Apart from just getting text from the page (e.g. for display or search),
36- one intended usage of this library is for machine learning (feature extraction).
37- If you want to use the text of the html page as a feature (e.g. for classification),
38- this library gives you plain text that you can later feed into a standard text
39- classification pipeline.
40- If you feel that you need html structure as well, check out
41- `webstruct <http://webstruct.readthedocs.io/en/latest/ >`_ library.
4224
25+ * Text extracted with ``html_text `` does not contain inline styles,
26+ javascript, comments and other text that is not normally visible to users;
27+ * ``html_text `` normalizes whitespace, but in a way smarter than
28+ ``.xpath('normalize-space()) ``, adding spaces around inline elements
29+ (which are often used as block elements in html markup), and trying to
30+ avoid adding extra spaces for punctuation;
31+ * ``html-text `` can add newlines (e.g. after headers or paragraphs), so
32+ that the output text looks more like how it is rendered in browsers.
4333
4434Install
4535-------
@@ -48,7 +38,7 @@ Install with pip::
4838
4939 pip install html-text
5040
51- The package depends on lxml, so you might need to install some additional
41+ The package depends on lxml, so you might need to install additional
5242packages: http://lxml.de/installation.html
5343
5444
@@ -64,31 +54,46 @@ Extract text from HTML::
6454 >>> html_text.extract_text('<h1>Hello</h1> world!', guess_layout=False)
6555 'Hello world!'
6656
57+ Passed html is first cleaned from invisible non-text content such
58+ as styles, and then text is extracted.
6759
68-
69- You can also pass already parsed ``lxml.html.HtmlElement ``:
60+ You can also pass an already parsed ``lxml.html.HtmlElement ``:
7061
7162 >>> import html_text
7263 >>> tree = html_text.parse_html(' <h1>Hello</h1> world!' )
7364 >>> html_text.extract_text(tree)
7465 'Hello\n\nworld!'
7566
76- Or define a selector to extract text only from specific elements:
67+ If you want, you can handle cleaning manually; use lower-level
68+ ``html_text.etree_to_text `` in this case:
69+
70+ >>> import html_text
71+ >>> tree = html_text.parse_html(' <h1>Hello<style>.foo{} </style>!</h1>' )
72+ >>> cleaned_tree = html_text.cleaner.clean_html(tree)
73+ >>> html_text.etree_to_text(cleaned_tree)
74+ 'Hello!'
75+
76+ parsel.Selector objects are also supported; you can define
77+ a parsel.Selector to extract text only from specific elements:
7778
7879 >>> import html_text
7980 >>> sel = html_text.cleaned_selector(' <h1>Hello</h1> world!' )
8081 >>> subsel = sel.xpath(' //h1' )
8182 >>> html_text.selector_to_text(subsel)
8283 'Hello'
8384
84- Passed html will be first cleaned from invisible non-text content such
85- as styles, and then text would be extracted.
86- NB Selectors are not cleaned automatically you need to call
85+ NB parsel.Selector objects are not cleaned automatically, you need to call
8786``html_text.cleaned_selector `` first.
8887
89- Main functions:
88+ Main functions and objects :
9089
9190* ``html_text.extract_text `` accepts html and returns extracted text.
91+ * ``html_text.etree_to_text `` accepts parsed lxml Element and returns
92+ extracted text; it is a lower-level function, cleaning is not handled
93+ here.
94+ * ``html_text.cleaner `` is an ``lxml.html.clean.Cleaner `` instance which
95+ can be used with ``html_text.etree_to_text ``; its options are tuned for
96+ speed and text extraction quality.
9297* ``html_text.cleaned_selector `` accepts html as text or as
9398 ``lxml.html.HtmlElement ``, and returns cleaned ``parsel.Selector ``.
9499* ``html_text.selector_to_text `` accepts ``parsel.Selector `` and returns
@@ -111,10 +116,13 @@ after ``<div>`` tags:
111116 ... newline_tags= newline_tags)
112117 'Hello world!'
113118
114- Credits
115- -------
116-
117- The code is extracted from utilities used in several projects, written by Mikhail Korobov.
119+ Apart from just getting text from the page (e.g. for display or search),
120+ one intended usage of this library is for machine learning (feature extraction).
121+ If you want to use the text of the html page as a feature (e.g. for classification),
122+ this library gives you plain text that you can later feed into a standard text
123+ classification pipeline.
124+ If you feel that you need html structure as well, check out
125+ `webstruct <http://webstruct.readthedocs.io/en/latest/ >`_ library.
118126
119127----
120128
0 commit comments