Skip to content

HTML parse error #16

@ciciaip

Description

@ciciaip

Hi,I use wikiprep-esa to process a wikiprep dump in Zemanta format but i encounter with an error in scanData.py when it run with the page "Tying (commerce)" :
File "scanData.py", line 372, in
recordArticle(doc)
File "scanData.py", line 318, in recordArticle
t = html.fromstring(text)
File "D:\Python\lib\site-packages\lxml-3.4.0-py2.7-win32.egg\lxml\html__init_
.py", line 723, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "D:\Python\lib\site-packages\lxml-3.4.0-py2.7-win32.egg\lxml\html__init

_.py", line 616, in document_fromstring
"Document is empty")
lxml.etree.ParserError: Document is empty

I check the "text" of this page carefully and find it really has unicode content. Thus,it's obvious that lxml can't parse the "text" of the page.Part of the "text" is list below:

<!-- Part of WikiProject Law. Most of this is ripped off from [[Template:Intellectual prop
- collusion is the sale of hot nuts - Eoin "balls" Devlin
! style="padding: 0 7px 0 7px; background:#00FA9A" align="center"
-
! style=" font-size: 95%; padding: 0 7px 0 7px; background:#98FB98" align="center"
-
style=" font-size: 90%; padding: 0 5px 0 5px; text-align: left;"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions