HTML parse error

Hi,I use wikiprep-esa to process a wikiprep dump in Zemanta format but i encounter with an error in scanData.py when it run with the page "Tying (commerce)" :
 File "scanData.py", line 372, in <module>
   recordArticle(doc)
 File "scanData.py", line 318, in recordArticle
   t = html.fromstring(text)
  File "D:\Python\lib\site-packages\lxml-3.4.0-py2.7-win32.egg\lxml\html__init_
_.py", line 723, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "D:\Python\lib\site-packages\lxml-3.4.0-py2.7-win32.egg\lxml\html__init_
_.py", line 616, in document_fromstring
    "Document is empty")
lxml.etree.ParserError: Document is empty

I check the "text" of this page carefully and find it really has unicode content. Thus,it's obvious that lxml can't parse the "text" of the page.Part of the "text" is list below:
<text>
&lt;!-- Part of WikiProject Law.  Most of this is ripped off from [[Template:Intellectual prop
|-
|<w namespace="File" title="Scale of justice 2.svg"></w>
|- collusion is the sale of hot nuts - Eoin &quot;balls&quot; Devlin
! style=&quot;padding: 0 7px 0 7px; background:#00FA9A&quot; align=&quot;center&quot; | <a id="666256">Competition law</a>
|-
! style=&quot; font-size: 95%; padding: 0 7px 0 7px; background:#98FB98&quot; align=&quot;center&quot; | Basic concepts 
|-
| style=&quot; font-size: 90%; padding: 0 5px 0 5px; text-align: left;&quot; |
- <a id="12870157">History of competition law</a>
- <a id="18878">Monopoly</a>
  *\* <a id="397654">Coercive monopoly</a>
  *\* <a id="21143">Natural monopoly</a>
  ......
  it contains lots of anchor texts in the former part,which is a bit different from usual texts. I can't tell if this is the cuase of the problem.Have you ever meet this kind of problems before? I'm really confused about it.
  Best regards!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML parse error #16

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

HTML parse error #16

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions