Skip to content
This repository was archived by the owner on Jul 11, 2023. It is now read-only.

Commit 3b06de8

Browse files
authored
Add support for raw_html extraction in html parser (#341)
* Add support for raw_html extraction in html parser * Adhere better to the tabulator standard
1 parent 609563f commit 3b06de8

File tree

4 files changed

+18
-9
lines changed

4 files changed

+18
-9
lines changed

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -668,13 +668,15 @@ Supports simple tables (no merged cells) with any legal combination of the td, t
668668
Usually `foramt='html'` would need to be specified explicitly as web URLs don't always use the `.html` extension.
669669

670670
```python
671-
stream = Stream('http://example.com/some/page.aspx', format='html' selector='.content .data table#id1')
671+
stream = Stream('http://example.com/some/page.aspx', format='html' selector='.content .data table#id1', raw_html=True)
672672
```
673673

674674
**Options**
675675

676676
- **selector**: CSS selector for specifying which `table` element to extract. By default it's `table`, which takes the first `table` element in the document. If empty, will assume the entire page is the table to be extracted (useful with some Excel formats).
677677

678+
- **raw_html**: False (default) to extract the textual contents of each cell. True to return the inner html without modification.
679+
678680
### Custom file sources and formats
679681

680682
Tabulator is written with extensibility in mind, allowing you to add support for

data/table3.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
</tr>
2626
<tr>
2727
<td>1</td>
28-
<td>english</td>
28+
<td><b>english</b></td>
2929
</tr>
3030
<tr>
3131
<td>2</td>

tabulator/parsers/html.py

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -19,15 +19,17 @@ class HTMLTableParser(Parser):
1919

2020
options = [
2121
'selector',
22+
'raw_html'
2223
]
2324

24-
def __init__(self, loader, force_parse=False, selector='table'):
25+
def __init__(self, loader, force_parse=False, selector='table', raw_html=False):
2526
self.__loader = loader
2627
self.__selector = selector
2728
self.__force_parse = force_parse
2829
self.__extended_rows = None
2930
self.__encoding = None
3031
self.__chars = None
32+
self.__extractor = (lambda x: x.html()) if raw_html else (lambda x: x.text())
3133

3234
@property
3335
def closed(self):
@@ -78,14 +80,11 @@ def __iter_extended_rows(self):
7880
table.children('tbody').children('tr')
7981
)
8082
rows = [pq(r) for r in rows if len(r) > 0]
81-
first_row = rows.pop(0)
82-
headers = [pq(th).text() for th in first_row.find('th,td')]
83-
8483
# Extract rows
85-
rows = [pq(tr).find('td') for tr in rows]
86-
rows = [[pq(td).text() for td in tr]
84+
rows = [pq(tr).children('td,th') for tr in rows]
85+
rows = [[self.__extractor(pq(td)) for td in tr]
8786
for tr in rows if len(tr) > 0]
8887

8988
# Yield rows
9089
for row_number, row in enumerate(rows, start=1):
91-
yield (row_number, headers, row)
90+
yield (row_number, None, row)

tests/formats/test_html.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,3 +26,11 @@ def test_stream_html(source, selector):
2626
{'id': '1', 'name': 'english'},
2727
{'id': '2', 'name': '中国人'}]
2828

29+
def test_stream_html_raw_html():
30+
with Stream('data/table3.html', selector='.mememe', headers=1, encoding='utf8', raw_html=True) as stream:
31+
assert stream.headers == ['id', 'name']
32+
assert stream.read(keyed=True) == [
33+
{'id': '1', 'name': '<b>english</b>'},
34+
{'id': '2', 'name': '中国人'}]
35+
36+

0 commit comments

Comments
 (0)