Skip to content

Commit 80289f1

Browse files
authored
Merge pull request #15 from TeamHG-Memex/remove-parsel-dependency
Remove parsel dependency
2 parents b5cd26a + d8fa17a commit 80289f1

File tree

8 files changed

+92
-59
lines changed

8 files changed

+92
-59
lines changed

.travis.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,14 @@ matrix:
99
include:
1010
- python: 2.7
1111
env: TOXENV=py27
12+
- python: 2.7
13+
env: TOXENV=py27-parsel
1214
- python: 3.5
1315
env: TOXENV=py35
1416
- python: 3.6
1517
env: TOXENV=py36
18+
- python: 3.6
19+
env: TOXENV=py36-parsel
1620
- python: 3.7
1721
env: TOXENV=py37
1822
dist: xenial

README.rst

Lines changed: 38 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -17,29 +17,19 @@ HTML to Text
1717

1818
Extract text from HTML
1919

20-
2120
* Free software: MIT license
2221

23-
2422
How is html_text different from ``.xpath('//text()')`` from LXML
2523
or ``.get_text()`` from Beautiful Soup?
26-
Text extracted with ``html_text`` does not contain inline styles,
27-
javascript, comments and other text that is not normally visible to the users.
28-
It normalizes whitespace, but is also smarter than
29-
``.xpath('normalize-space())``, adding spaces around inline elements
30-
(which are often used as block elements in html markup),
31-
tries to avoid adding extra spaces for punctuation and
32-
can add newlines so that the output text looks like how it is rendered in
33-
browsers.
34-
35-
Apart from just getting text from the page (e.g. for display or search),
36-
one intended usage of this library is for machine learning (feature extraction).
37-
If you want to use the text of the html page as a feature (e.g. for classification),
38-
this library gives you plain text that you can later feed into a standard text
39-
classification pipeline.
40-
If you feel that you need html structure as well, check out
41-
`webstruct <http://webstruct.readthedocs.io/en/latest/>`_ library.
4224

25+
* Text extracted with ``html_text`` does not contain inline styles,
26+
javascript, comments and other text that is not normally visible to users;
27+
* ``html_text`` normalizes whitespace, but in a way smarter than
28+
``.xpath('normalize-space())``, adding spaces around inline elements
29+
(which are often used as block elements in html markup), and trying to
30+
avoid adding extra spaces for punctuation;
31+
* ``html-text`` can add newlines (e.g. after headers or paragraphs), so
32+
that the output text looks more like how it is rendered in browsers.
4333

4434
Install
4535
-------
@@ -48,7 +38,7 @@ Install with pip::
4838

4939
pip install html-text
5040

51-
The package depends on lxml, so you might need to install some additional
41+
The package depends on lxml, so you might need to install additional
5242
packages: http://lxml.de/installation.html
5343

5444

@@ -64,31 +54,46 @@ Extract text from HTML::
6454
>>> html_text.extract_text('<h1>Hello</h1> world!', guess_layout=False)
6555
'Hello world!'
6656

57+
Passed html is first cleaned from invisible non-text content such
58+
as styles, and then text is extracted.
6759

68-
69-
You can also pass already parsed ``lxml.html.HtmlElement``:
60+
You can also pass an already parsed ``lxml.html.HtmlElement``:
7061

7162
>>> import html_text
7263
>>> tree = html_text.parse_html('<h1>Hello</h1> world!')
7364
>>> html_text.extract_text(tree)
7465
'Hello\n\nworld!'
7566

76-
Or define a selector to extract text only from specific elements:
67+
If you want, you can handle cleaning manually; use lower-level
68+
``html_text.etree_to_text`` in this case:
69+
70+
>>> import html_text
71+
>>> tree = html_text.parse_html('<h1>Hello<style>.foo{}</style>!</h1>')
72+
>>> cleaned_tree = html_text.cleaner.clean_html(tree)
73+
>>> html_text.etree_to_text(cleaned_tree)
74+
'Hello!'
75+
76+
parsel.Selector objects are also supported; you can define
77+
a parsel.Selector to extract text only from specific elements:
7778

7879
>>> import html_text
7980
>>> sel = html_text.cleaned_selector('<h1>Hello</h1> world!')
8081
>>> subsel = sel.xpath('//h1')
8182
>>> html_text.selector_to_text(subsel)
8283
'Hello'
8384

84-
Passed html will be first cleaned from invisible non-text content such
85-
as styles, and then text would be extracted.
86-
NB Selectors are not cleaned automatically you need to call
85+
NB parsel.Selector objects are not cleaned automatically, you need to call
8786
``html_text.cleaned_selector`` first.
8887

89-
Main functions:
88+
Main functions and objects:
9089

9190
* ``html_text.extract_text`` accepts html and returns extracted text.
91+
* ``html_text.etree_to_text`` accepts parsed lxml Element and returns
92+
extracted text; it is a lower-level function, cleaning is not handled
93+
here.
94+
* ``html_text.cleaner`` is an ``lxml.html.clean.Cleaner`` instance which
95+
can be used with ``html_text.etree_to_text``; its options are tuned for
96+
speed and text extraction quality.
9297
* ``html_text.cleaned_selector`` accepts html as text or as
9398
``lxml.html.HtmlElement``, and returns cleaned ``parsel.Selector``.
9499
* ``html_text.selector_to_text`` accepts ``parsel.Selector`` and returns
@@ -111,10 +116,13 @@ after ``<div>`` tags:
111116
... newline_tags=newline_tags)
112117
'Hello world!'
113118

114-
Credits
115-
-------
116-
117-
The code is extracted from utilities used in several projects, written by Mikhail Korobov.
119+
Apart from just getting text from the page (e.g. for display or search),
120+
one intended usage of this library is for machine learning (feature extraction).
121+
If you want to use the text of the html page as a feature (e.g. for classification),
122+
this library gives you plain text that you can later feed into a standard text
123+
classification pipeline.
124+
If you feel that you need html structure as well, check out
125+
`webstruct <http://webstruct.readthedocs.io/en/latest/>`_ library.
118126

119127
----
120128

codecov.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
comment:
2+
layout: "header, diff, tree"
3+
4+
coverage:
5+
status:
6+
project: false

html_text/__init__.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# -*- coding: utf-8 -*-
22
__version__ = '0.4.1'
33

4-
from .html_text import (extract_text, parse_html, cleaned_selector,
5-
selector_to_text, NEWLINE_TAGS, DOUBLE_NEWLINE_TAGS)
4+
from .html_text import (etree_to_text, extract_text, selector_to_text,
5+
parse_html, cleaned_selector, cleaner,
6+
NEWLINE_TAGS, DOUBLE_NEWLINE_TAGS)

html_text/html_text.py

Lines changed: 27 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,6 @@
44
import lxml
55
import lxml.etree
66
from lxml.html.clean import Cleaner
7-
import parsel
8-
from parsel.selector import create_root_node
97

108

119
NEWLINE_TAGS = frozenset([
@@ -18,7 +16,7 @@
1816
'p', 'pre', 'title', 'ul'
1917
])
2018

21-
_clean_html = Cleaner(
19+
cleaner = Cleaner(
2220
scripts=True,
2321
javascript=False, # onclick attributes are fine
2422
comments=True,
@@ -33,21 +31,27 @@
3331
annoying_tags=False,
3432
remove_unknown_tags=False,
3533
safe_attrs_only=False,
36-
).clean_html
34+
)
3735

3836

3937
def _cleaned_html_tree(html):
4038
if isinstance(html, lxml.html.HtmlElement):
4139
tree = html
4240
else:
4341
tree = parse_html(html)
44-
return _clean_html(tree)
42+
return cleaner.clean_html(tree)
4543

4644

4745
def parse_html(html):
4846
""" Create an lxml.html.HtmlElement from a string with html.
47+
XXX: mostly copy-pasted from parsel.selector.create_root_node
4948
"""
50-
return create_root_node(html, lxml.html.HTMLParser)
49+
body = html.strip().replace('\x00', '').encode('utf8') or b'<html/>'
50+
parser = lxml.html.HTMLParser(recover=True, encoding='utf8')
51+
root = lxml.etree.fromstring(body, parser=parser)
52+
if root is None:
53+
root = lxml.etree.fromstring(b'<html/>', parser=parser)
54+
return root
5155

5256

5357
_whitespace = re.compile(r'\s+')
@@ -60,15 +64,18 @@ def _normalize_whitespace(text):
6064
return _whitespace.sub(' ', text.strip())
6165

6266

63-
def _html_to_text(tree,
67+
def etree_to_text(tree,
6468
guess_punct_space=True,
6569
guess_layout=True,
6670
newline_tags=NEWLINE_TAGS,
6771
double_newline_tags=DOUBLE_NEWLINE_TAGS):
6872
"""
69-
Convert a cleaned html tree to text.
70-
See html_text.extract_text docstring for description of the approach
71-
and options.
73+
Convert a html tree to text. Tree should be cleaned with
74+
``html_text.html_text.cleaner.clean_html`` before passing to this
75+
function.
76+
77+
See html_text.extract_text docstring for description of the
78+
approach and options.
7279
"""
7380
chunks = []
7481

@@ -131,31 +138,33 @@ def traverse_text_fragments(tree, context, handle_tail=True):
131138

132139

133140
def selector_to_text(sel, guess_punct_space=True, guess_layout=True):
134-
""" Convert a cleaned selector to text.
141+
""" Convert a cleaned parsel.Selector to text.
135142
See html_text.extract_text docstring for description of the approach
136143
and options.
137144
"""
145+
import parsel
138146
if isinstance(sel, parsel.SelectorList):
139147
# if selecting a specific xpath
140148
text = []
141149
for s in sel:
142-
extracted = _html_to_text(
150+
extracted = etree_to_text(
143151
s.root,
144152
guess_punct_space=guess_punct_space,
145153
guess_layout=guess_layout)
146154
if extracted:
147155
text.append(extracted)
148156
return ' '.join(text)
149157
else:
150-
return _html_to_text(
158+
return etree_to_text(
151159
sel.root,
152160
guess_punct_space=guess_punct_space,
153161
guess_layout=guess_layout)
154162

155163

156164
def cleaned_selector(html):
157-
""" Clean selector.
165+
""" Clean parsel.selector.
158166
"""
167+
import parsel
159168
try:
160169
tree = _cleaned_html_tree(html)
161170
sel = parsel.Selector(root=tree, type='html')
@@ -183,6 +192,9 @@ def extract_text(html,
183192
184193
html should be a unicode string or an already parsed lxml.html element.
185194
195+
``html_text.etree_to_text`` is a lower-level function which only accepts
196+
an already parsed lxml.html Element, and is not doing html cleaning itself.
197+
186198
When guess_punct_space is True (default), no extra whitespace is added
187199
for punctuation. This has a slight (around 10%) performance overhead
188200
and is just a heuristic.
@@ -198,7 +210,7 @@ def extract_text(html,
198210
if html is None:
199211
return ''
200212
cleaned = _cleaned_html_tree(html)
201-
return _html_to_text(
213+
return etree_to_text(
202214
cleaned,
203215
guess_punct_space=guess_punct_space,
204216
guess_layout=guess_layout,

setup.py

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,6 @@
99
with open('CHANGES.rst') as history_file:
1010
history = history_file.read()
1111

12-
requirements = [
13-
'lxml',
14-
'parsel',
15-
]
16-
17-
test_requirements = [
18-
'pytest',
19-
]
2012

2113
setup(
2214
name='html_text',
@@ -28,7 +20,7 @@
2820
url='https://github.com/TeamHG-Memex/html-text',
2921
packages=['html_text'],
3022
include_package_data=True,
31-
install_requires=requirements,
23+
install_requires=['lxml'],
3224
license="MIT license",
3325
zip_safe=False,
3426
classifiers=[
@@ -44,5 +36,5 @@
4436
'Programming Language :: Python :: 3.7',
4537
],
4638
test_suite='tests',
47-
tests_require=test_requirements
39+
tests_require=['pytest'],
4840
)

tests/test_html_text.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,8 @@
66
import pytest
77

88
from html_text import (extract_text, parse_html, cleaned_selector,
9-
selector_to_text, NEWLINE_TAGS, DOUBLE_NEWLINE_TAGS)
9+
etree_to_text, cleaner, selector_to_text, NEWLINE_TAGS,
10+
DOUBLE_NEWLINE_TAGS)
1011

1112

1213
ROOT = os.path.dirname(os.path.abspath(__file__))
@@ -48,6 +49,10 @@ def test_empty(all_options):
4849
assert extract_text(None, **all_options) == ''
4950

5051

52+
def test_comment(all_options):
53+
assert extract_text(u"<!-- hello world -->", **all_options) == ''
54+
55+
5156
def test_extract_text_from_tree(all_options):
5257
html = (u'<html><style>.div {}</style>'
5358
'<body><p>Hello, world!</body></html>')
@@ -96,6 +101,7 @@ def test_bad_punct_whitespace():
96101

97102

98103
def test_selectors(all_options):
104+
pytest.importorskip("parsel")
99105
html = (u'<span><span id="extract-me">text<a>more</a>'
100106
'</span>and more text <a> and some more</a> <a></a> </span>')
101107
# Selector
@@ -184,3 +190,6 @@ def test_webpages(page, extracted):
184190
html = html.replace('&nbsp;', ' ')
185191
expected = _load_file(extracted)
186192
assert extract_text(html) == expected
193+
194+
tree = cleaner.clean_html(parse_html(html))
195+
assert etree_to_text(tree) == expected

tox.ini

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
11
[tox]
2-
envlist = py27,py35,py36,py37
2+
envlist = py27,py35,py36,py37,{py27,py36}-parsel
33

44
[testenv]
55
deps =
66
pytest
77
pytest-cov
8+
{py27,py36}-parsel: parsel
89

910
commands =
1011
pip install -U pip

0 commit comments

Comments
 (0)