의안의 부가정보 수집하기

<img width="984" alt="screen shot 2015-12-03 at 7 31 30 pm" src="https://cloud.githubusercontent.com/assets/1205890/11558188/87749a26-99f4-11e5-82f6-361dcb6054ee.png">

[[image source]](http://likms.assembly.go.kr/bill/jsp/BillDetail.jsp?bill_id=PRC_Y1Z5A1I1C3U0J1M4T5N2W4F0W6L5U7)

지금은 의안 크롤러가 의안의 "부가정보"를 수집하지 않고 있는데, 대안 의안들의 경우 이 영역에 관련 의안이 표기되어 있기 때문에 무척 중요한 정보를 놓치고 있는 꼴입니다. 이 데이터를 추가적으로 수집하기 위해서는 html을 json으로 파싱하는 파일을 수정하면 됩니다.
- 현재
  
  ``` python
  for i, r in enumerate(elem_row_contents):
      if row_titles[i]!='부가정보':  # "부가정보" 외의 다른 영역(행)들 처리
          status_dict[row_titles[i]] = extract_row_contents(r)
      else:  # "부가정보" 영역 처리
          t = r.xpath('span[@class="text8"]/text()')
          c = filter(None, (t.strip() for t in r.xpath('text()')))
          status_dict[row_titles[i]] = dict(zip(t, c))
  ```
- 개선: 아마 위의 코드 snippet에서 ["부가정보" 영역을 처리하는 곳](https://github.com/teampopong/crawlers/blob/master/bills/specific/html2json.py#L189-L191)에서 xpath가 정상적으로 작동하지 않는 것 같습니다. 디버깅하는 것이 아마 크게 어려운 일은 아닐 것 같은데, html 파일을 다시 찬찬히 뜯어보는 노력이 필요합니다. 

혹시 xpath의 사용법에 익숙하지 않으신 분들이 있다면 다음 링크를 확인해주시기 바랍니다: http://www.slideshare.net/lucypark/the-beginners-guide-to-54279917/49
## 


<bountysource-plugin>
---
Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/28812970-?utm_campaign=plugin&utm_content=tracker%2F248104&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F248104&utm_medium=issues&utm_source=github).
</bountysource-plugin>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

의안의 부가정보 수집하기 #39

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

의안의 부가정보 수집하기 #39

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions