Skip to content

Bill 1900145 Parsing Error Due to Crawling Error #34

@hunkim

Description

@hunkim

html2json에 을 돌리다 에러가 나서 뭔일인가 보니

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gevent/greenlet.py", line 327, in run
    result = self._run(*self.args, **self.kwargs)
  File "/home/ubuntu/crawlers/bills/specific/html2json.py", line 242, in parse_page
    d = extract_specifics(assembly_id, bill_id, meta)
  File "/home/ubuntu/crawlers/bills/specific/html2json.py", line 166, in extract_specifics
    table       = utils.get_elems(page, X['spec_table'])[1]
IndexError: list index out of range
<Greenlet at 0x7f27e79417d0: parse_page(19, '1900145',        bill_id  status                            , u'./json/19')> failed with IndexError

sources/specifics/19/1900145.html 파일을 받을때 오류가 발생한것 같습니다.

^M
^M
^M
<SCRIPT LANGUAGE="javascript">^M
<!--^M
        function onLoad() {^M
                alert(document.all["MSG"].innerText);^M
        }^M
-->^M
</SCRIPT>^M
^M
^M
^M
<HTML>^M
<BODY ONLOAD="javascript:onLoad()">^M
        <TEXTAREA ID="MSG" STYLE="display:none">[SQLException] Code[24757] Msg[ORA-24757: Æ®·£Àè¼Ç ½Äº°ÀÚ°¡ Áߺ¹µÇ¾ú½À´Ï´Ù
ORA-02063: line°¡ ¼±ÇàµÊ (NALAW_LINK·Î ºÎÅÍ)
][µ¥ÀÌÅͺ£À̽º ¿À·ù]</TEXTAREA> ^M
</BODY>^M
</HTML>

이런 경우 어떻게 하면 될까요? SQL Exception이 나왔는데 이런경우 crawler에서 다시 받아 오기 기능이 필요할듯 합니다.


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions