|
| 1 | +# Usage |
| 2 | + |
| 3 | +`chompjs` can be used in web scrapping for turning JavaScript objects embedded in pages into valid Python dictionaries. |
| 4 | + |
| 5 | +```python |
| 6 | +>>> import chompjs |
| 7 | +>>> chompjs.parse_js_object('{"my_data": "test"}') |
| 8 | +{u'my_data': u'test'} |
| 9 | +``` |
| 10 | + |
| 11 | +Think of it as a more powerful `json.loads`. For example, it can handle JSON objects containing embedded methods by storing their code in a string: |
| 12 | + |
| 13 | +```python |
| 14 | +>>> import chompjs |
| 15 | +>>> js = """ |
| 16 | +... var myObj = { |
| 17 | +... myMethod: function(params) { |
| 18 | +... // ... |
| 19 | +... }, |
| 20 | +... myValue: 100 |
| 21 | +... } |
| 22 | +... """ |
| 23 | +>>> chompjs.parse_js_object(js, json_params={'strict': False}) |
| 24 | +{'myMethod': 'function(params) {\n // ...\n }', 'myValue': 100} |
| 25 | +``` |
| 26 | + |
| 27 | +An example usage with `scrapy`: |
| 28 | + |
| 29 | +```python |
| 30 | +import chompjs |
| 31 | +import scrapy |
| 32 | + |
| 33 | + |
| 34 | +class MySpider(scrapy.Spider): |
| 35 | + # ... |
| 36 | + |
| 37 | + def parse(self, response): |
| 38 | + script_css = 'script:contains("__NEXT_DATA__")::text' |
| 39 | + script_pattern = r'__NEXT_DATA__ = (.*);' |
| 40 | + # warning: for some pages you need to pass replace_entities=True |
| 41 | + # into re_first to have JSON escaped properly |
| 42 | + script_text = response.css(script_css).re_first(script_pattern) |
| 43 | + try: |
| 44 | + json_data = chompjs.parse_js_object(script_text) |
| 45 | + except ValueError: |
| 46 | + self.log('Failed to extract data from {}'.format(response.url)) |
| 47 | + return |
| 48 | + |
| 49 | + # work on json_data |
| 50 | +``` |
| 51 | + |
| 52 | +If the input string is not yet escaped and contains a lot of `\\` characters, then `unicode_escape=True` argument might help to sanitize it: |
| 53 | + |
| 54 | +```python |
| 55 | +>>> chompjs.parse_js_object('{\\\"a\\\": 12}', unicode_escape=True) |
| 56 | +{u'a': 12} |
| 57 | +``` |
| 58 | + |
| 59 | +`jsonlines=True` can be used to parse JSON Lines: |
| 60 | + |
| 61 | +```python |
| 62 | +>>> chompjs.parse_js_object('[1,2]\n[2,3]\n[3,4]', jsonlines=True) |
| 63 | +[[1, 2], [2, 3], [3, 4]] |
| 64 | +``` |
| 65 | + |
| 66 | +By default `chompjs` tries to start with first `{` or `[` character it founds, omitting the rest: |
| 67 | + |
| 68 | +```python |
| 69 | +>>> chompjs.parse_js_object('<div>...</div><script>foo = [1, 2, 3];</script><div>...</div>') |
| 70 | +[1, 2, 3] |
| 71 | +``` |
| 72 | + |
| 73 | +`json_params` argument can be used to pass options to underlying `json_loads`, such as `strict` or `object_hook`: |
| 74 | + |
| 75 | +```python |
| 76 | +>>> import decimal |
| 77 | +>>> import chompjs |
| 78 | +>>> chompjs.parse_js_object('[23.2]', json_params={'parse_float': decimal.Decimal}) |
| 79 | +[Decimal('23.2')] |
| 80 | +``` |
| 81 | + |
| 82 | +# Rationale |
| 83 | + |
| 84 | +In web scraping data often is not present directly inside HTML, but instead provided as an embedded JavaScript object that is later used to initialize the page, for example: |
| 85 | + |
| 86 | +```html |
| 87 | +<html> |
| 88 | +<head>...</head> |
| 89 | +<body> |
| 90 | +... |
| 91 | +<script type="text/javascript">window.__PRELOADED_STATE__={"foo": "bar"}</script> |
| 92 | +... |
| 93 | +</body> |
| 94 | +</html> |
| 95 | +``` |
| 96 | + |
| 97 | +Standard library function `json.loads` is usually sufficient to extract this data: |
| 98 | + |
| 99 | +```python |
| 100 | +>>> # scrapy shell file:///tmp/test.html |
| 101 | +>>> import json |
| 102 | +>>> script_text = response.css('script:contains(__PRELOADED_STATE__)::text').re_first('__PRELOADED_STATE__=(.*)') |
| 103 | +>>> json.loads(script_text) |
| 104 | +{u'foo': u'bar'} |
| 105 | + |
| 106 | +``` |
| 107 | +The problem is that not all valid JavaScript objects are also valid JSONs. For example all those strings are valid JavaScript objects but not valid JSONs: |
| 108 | + |
| 109 | +* `"{'a': 'b'}"` is not a valid JSON because it uses `'` character to quote |
| 110 | +* `'{a: "b"}'`is not a valid JSON because property name is not quoted at all |
| 111 | +* `'{"a": [1, 2, 3,]}'` is not a valid JSON because there is an extra `,` character at the end of the array |
| 112 | +* `'{"a": .99}'` is not a valid JSON because float value lacks a leading 0 |
| 113 | + |
| 114 | +As a result, `json.loads` fail to extract any of those: |
| 115 | + |
| 116 | +``` |
| 117 | +>>> json.loads("{'a': 'b'}") |
| 118 | +Traceback (most recent call last): |
| 119 | + File "<console>", line 1, in <module> |
| 120 | + File "/usr/lib/python2.7/json/__init__.py", line 339, in loads |
| 121 | + return _default_decoder.decode(s) |
| 122 | + File "/usr/lib/python2.7/json/decoder.py", line 364, in decode |
| 123 | + obj, end = self.raw_decode(s, idx=_w(s, 0).end()) |
| 124 | + File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode |
| 125 | + obj, end = self.scan_once(s, idx) |
| 126 | +ValueError: Expecting property name: line 1 column 2 (char 1) |
| 127 | +>>> json.loads('{a: "b"}') |
| 128 | +Traceback (most recent call last): |
| 129 | + File "<console>", line 1, in <module> |
| 130 | + File "/usr/lib/python2.7/json/__init__.py", line 339, in loads |
| 131 | + return _default_decoder.decode(s) |
| 132 | + File "/usr/lib/python2.7/json/decoder.py", line 364, in decode |
| 133 | + obj, end = self.raw_decode(s, idx=_w(s, 0).end()) |
| 134 | + File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode |
| 135 | + obj, end = self.scan_once(s, idx) |
| 136 | +ValueError: Expecting property name: line 1 column 2 (char 1) |
| 137 | +>>> json.loads('{"a": [1, 2, 3,]}') |
| 138 | +Traceback (most recent call last): |
| 139 | + File "<console>", line 1, in <module> |
| 140 | + File "/usr/lib/python2.7/json/__init__.py", line 339, in loads |
| 141 | + return _default_decoder.decode(s) |
| 142 | + File "/usr/lib/python2.7/json/decoder.py", line 364, in decode |
| 143 | + obj, end = self.raw_decode(s, idx=_w(s, 0).end()) |
| 144 | + File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode |
| 145 | + raise ValueError("No JSON object could be decoded") |
| 146 | +ValueError: No JSON object could be decoded |
| 147 | +>>> json.loads('{"a": .99}') |
| 148 | +Traceback (most recent call last): |
| 149 | + File "<stdin>", line 1, in <module> |
| 150 | + File "/usr/lib/python3.7/json/__init__.py", line 348, in loads |
| 151 | + return _default_decoder.decode(s) |
| 152 | + File "/usr/lib/python3.7/json/decoder.py", line 337, in decode |
| 153 | + obj, end = self.raw_decode(s, idx=_w(s, 0).end()) |
| 154 | + File "/usr/lib/python3.7/json/decoder.py", line 355, in raw_decode |
| 155 | + raise JSONDecodeError("Expecting value", s, err.value) from None |
| 156 | +json.decoder.JSONDecodeError: Expecting value: line 1 column 7 (char 6) |
| 157 | +
|
| 158 | +``` |
| 159 | +`chompjs` library was designed to bypass this limitation, and it allows to scrape such JavaScript objects into proper Python dictionaries: |
| 160 | + |
| 161 | +``` |
| 162 | +>>> import chompjs |
| 163 | +>>> |
| 164 | +>>> chompjs.parse_js_object("{'a': 'b'}") |
| 165 | +{u'a': u'b'} |
| 166 | +>>> chompjs.parse_js_object('{a: "b"}') |
| 167 | +{u'a': u'b'} |
| 168 | +>>> chompjs.parse_js_object('{"a": [1, 2, 3,]}') |
| 169 | +{u'a': [1, 2, 3]} |
| 170 | +``` |
| 171 | + |
| 172 | +Internally `chompjs` use a parser written in C to iterate over raw string, fixing its issues along the way. The final result is then passed down to standard library's `json.loads`, ensuring a high speed as compared to full-blown JavaScript parsers such as `demjson`. |
| 173 | + |
| 174 | +``` |
| 175 | +>>> import json |
| 176 | +>>> import _chompjs |
| 177 | +>>> |
| 178 | +>>> _chompjs.parse('{a: 1}') |
| 179 | +'{"a":1}' |
| 180 | +>>> json.loads(_) |
| 181 | +{u'a': 1} |
| 182 | +>>> chompjs.parse_js_object('{"a": .99}') |
| 183 | +{'a': 0.99} |
| 184 | +``` |
| 185 | + |
| 186 | +# Installation |
| 187 | +From PIP: |
| 188 | + |
| 189 | +```bash |
| 190 | +$ python3 -m venv venv |
| 191 | +$ . venv/bin/activate |
| 192 | +# pip install chompjs |
| 193 | +``` |
| 194 | +From sources: |
| 195 | +```bash |
| 196 | +$ git clone https://github.com/Nykakin/chompjs |
| 197 | +$ cd chompjs |
| 198 | +$ python setup.py build |
| 199 | +$ python setup.py install |
| 200 | +``` |
| 201 | + |
| 202 | +To run unittests |
| 203 | + |
| 204 | +``` |
| 205 | +$ python -m unittest |
| 206 | +``` |
0 commit comments