Skip to content

Commit 8d161b0

Browse files
committed
[script.module.chompjs@matrix] 1.0.0
1 parent 6b07e3c commit 8d161b0

File tree

12 files changed

+1014
-0
lines changed

12 files changed

+1014
-0
lines changed

script.module.chompjs/CHANGELOG

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
1.1.9
2+
* Handle NaN in input (#37)
3+
4+
1.1.8
5+
* Fixed previous release (package couldn't be installed)
6+
7+
1.1.7
8+
* Handle unquoted properties starting with reserved JS keywords (#34)
9+
10+
1.1.6
11+
* Handle bug with parsing arrays like `["","/"]` (#33)
12+
13+
1.1.5
14+
* Correctly handle malformed quotations (#31)
15+
16+
1.1.4
17+
* Performance improvement (#19)
18+
* Handle numeric keys (#20)
19+
* Refactor error handling (#29)
20+
21+
1.1.3
22+
* Avoid an infinite loop on malformed input (#27)
23+
24+
1.1.2
25+
* Handle comments in JavaScript code (#22)
26+
27+
1.1.1
28+
* Fix instalation bug (headers moved to a different dir)
29+
30+
1.1.0
31+
* Parser refactored and rewritten in order to simplify code and improve speed
32+
* Allow handling JavaScript functions and other strange stuff such as regexes (#16)
33+
* Allow passing down json.loads parameters
34+
* Allow handling hexadecimal, octal and binary literals (#12)
35+
36+
1.0.17
37+
* Handle memory corruption on unclosed quotations (#13)
38+
39+
1.0.16
40+
* Handle floats with leading zeros (#10)
41+
42+
1.0.15
43+
* Handle $ and _ characters at the beginning of keys (#9)
44+
45+
1.0.14
46+
* Handle "undefined" keyword in JavaScript objects (#7)
47+
48+
1.0.13
49+
* Handle escaped quotations correctly (#6)
50+
51+
1.0.12
52+
* Handle windows newlines (#5)
53+
54+
1.0.11
55+
* Handle jsonlines (#3)
56+
57+
1.0.1
58+
* Handle Unicode in keys (#2)

script.module.chompjs/LICENSE

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
Copyright 2020 Mariusz Obajtek
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4+
5+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6+
7+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

script.module.chompjs/MANIFEST.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
graft _chompjs

script.module.chompjs/README.md

Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
# Usage
2+
3+
`chompjs` can be used in web scrapping for turning JavaScript objects embedded in pages into valid Python dictionaries.
4+
5+
```python
6+
>>> import chompjs
7+
>>> chompjs.parse_js_object('{"my_data": "test"}')
8+
{u'my_data': u'test'}
9+
```
10+
11+
Think of it as a more powerful `json.loads`. For example, it can handle JSON objects containing embedded methods by storing their code in a string:
12+
13+
```python
14+
>>> import chompjs
15+
>>> js = """
16+
... var myObj = {
17+
... myMethod: function(params) {
18+
... // ...
19+
... },
20+
... myValue: 100
21+
... }
22+
... """
23+
>>> chompjs.parse_js_object(js, json_params={'strict': False})
24+
{'myMethod': 'function(params) {\n // ...\n }', 'myValue': 100}
25+
```
26+
27+
An example usage with `scrapy`:
28+
29+
```python
30+
import chompjs
31+
import scrapy
32+
33+
34+
class MySpider(scrapy.Spider):
35+
# ...
36+
37+
def parse(self, response):
38+
script_css = 'script:contains("__NEXT_DATA__")::text'
39+
script_pattern = r'__NEXT_DATA__ = (.*);'
40+
# warning: for some pages you need to pass replace_entities=True
41+
# into re_first to have JSON escaped properly
42+
script_text = response.css(script_css).re_first(script_pattern)
43+
try:
44+
json_data = chompjs.parse_js_object(script_text)
45+
except ValueError:
46+
self.log('Failed to extract data from {}'.format(response.url))
47+
return
48+
49+
# work on json_data
50+
```
51+
52+
If the input string is not yet escaped and contains a lot of `\\` characters, then `unicode_escape=True` argument might help to sanitize it:
53+
54+
```python
55+
>>> chompjs.parse_js_object('{\\\"a\\\": 12}', unicode_escape=True)
56+
{u'a': 12}
57+
```
58+
59+
`jsonlines=True` can be used to parse JSON Lines:
60+
61+
```python
62+
>>> chompjs.parse_js_object('[1,2]\n[2,3]\n[3,4]', jsonlines=True)
63+
[[1, 2], [2, 3], [3, 4]]
64+
```
65+
66+
By default `chompjs` tries to start with first `{` or `[` character it founds, omitting the rest:
67+
68+
```python
69+
>>> chompjs.parse_js_object('<div>...</div><script>foo = [1, 2, 3];</script><div>...</div>')
70+
[1, 2, 3]
71+
```
72+
73+
`json_params` argument can be used to pass options to underlying `json_loads`, such as `strict` or `object_hook`:
74+
75+
```python
76+
>>> import decimal
77+
>>> import chompjs
78+
>>> chompjs.parse_js_object('[23.2]', json_params={'parse_float': decimal.Decimal})
79+
[Decimal('23.2')]
80+
```
81+
82+
# Rationale
83+
84+
In web scraping data often is not present directly inside HTML, but instead provided as an embedded JavaScript object that is later used to initialize the page, for example:
85+
86+
```html
87+
<html>
88+
<head>...</head>
89+
<body>
90+
...
91+
<script type="text/javascript">window.__PRELOADED_STATE__={"foo": "bar"}</script>
92+
...
93+
</body>
94+
</html>
95+
```
96+
97+
Standard library function `json.loads` is usually sufficient to extract this data:
98+
99+
```python
100+
>>> # scrapy shell file:///tmp/test.html
101+
>>> import json
102+
>>> script_text = response.css('script:contains(__PRELOADED_STATE__)::text').re_first('__PRELOADED_STATE__=(.*)')
103+
>>> json.loads(script_text)
104+
{u'foo': u'bar'}
105+
106+
```
107+
The problem is that not all valid JavaScript objects are also valid JSONs. For example all those strings are valid JavaScript objects but not valid JSONs:
108+
109+
* `"{'a': 'b'}"` is not a valid JSON because it uses `'` character to quote
110+
* `'{a: "b"}'`is not a valid JSON because property name is not quoted at all
111+
* `'{"a": [1, 2, 3,]}'` is not a valid JSON because there is an extra `,` character at the end of the array
112+
* `'{"a": .99}'` is not a valid JSON because float value lacks a leading 0
113+
114+
As a result, `json.loads` fail to extract any of those:
115+
116+
```
117+
>>> json.loads("{'a': 'b'}")
118+
Traceback (most recent call last):
119+
File "<console>", line 1, in <module>
120+
File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
121+
return _default_decoder.decode(s)
122+
File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
123+
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
124+
File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode
125+
obj, end = self.scan_once(s, idx)
126+
ValueError: Expecting property name: line 1 column 2 (char 1)
127+
>>> json.loads('{a: "b"}')
128+
Traceback (most recent call last):
129+
File "<console>", line 1, in <module>
130+
File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
131+
return _default_decoder.decode(s)
132+
File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
133+
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
134+
File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode
135+
obj, end = self.scan_once(s, idx)
136+
ValueError: Expecting property name: line 1 column 2 (char 1)
137+
>>> json.loads('{"a": [1, 2, 3,]}')
138+
Traceback (most recent call last):
139+
File "<console>", line 1, in <module>
140+
File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
141+
return _default_decoder.decode(s)
142+
File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
143+
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
144+
File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
145+
raise ValueError("No JSON object could be decoded")
146+
ValueError: No JSON object could be decoded
147+
>>> json.loads('{"a": .99}')
148+
Traceback (most recent call last):
149+
File "<stdin>", line 1, in <module>
150+
File "/usr/lib/python3.7/json/__init__.py", line 348, in loads
151+
return _default_decoder.decode(s)
152+
File "/usr/lib/python3.7/json/decoder.py", line 337, in decode
153+
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
154+
File "/usr/lib/python3.7/json/decoder.py", line 355, in raw_decode
155+
raise JSONDecodeError("Expecting value", s, err.value) from None
156+
json.decoder.JSONDecodeError: Expecting value: line 1 column 7 (char 6)
157+
158+
```
159+
`chompjs` library was designed to bypass this limitation, and it allows to scrape such JavaScript objects into proper Python dictionaries:
160+
161+
```
162+
>>> import chompjs
163+
>>>
164+
>>> chompjs.parse_js_object("{'a': 'b'}")
165+
{u'a': u'b'}
166+
>>> chompjs.parse_js_object('{a: "b"}')
167+
{u'a': u'b'}
168+
>>> chompjs.parse_js_object('{"a": [1, 2, 3,]}')
169+
{u'a': [1, 2, 3]}
170+
```
171+
172+
Internally `chompjs` use a parser written in C to iterate over raw string, fixing its issues along the way. The final result is then passed down to standard library's `json.loads`, ensuring a high speed as compared to full-blown JavaScript parsers such as `demjson`.
173+
174+
```
175+
>>> import json
176+
>>> import _chompjs
177+
>>>
178+
>>> _chompjs.parse('{a: 1}')
179+
'{"a":1}'
180+
>>> json.loads(_)
181+
{u'a': 1}
182+
>>> chompjs.parse_js_object('{"a": .99}')
183+
{'a': 0.99}
184+
```
185+
186+
# Installation
187+
From PIP:
188+
189+
```bash
190+
$ python3 -m venv venv
191+
$ . venv/bin/activate
192+
# pip install chompjs
193+
```
194+
From sources:
195+
```bash
196+
$ git clone https://github.com/Nykakin/chompjs
197+
$ cd chompjs
198+
$ python setup.py build
199+
$ python setup.py install
200+
```
201+
202+
To run unittests
203+
204+
```
205+
$ python -m unittest
206+
```

script.module.chompjs/addon.xml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
2+
<addon id="script.module.chompjs"
3+
name="Chompjs"
4+
version="1.0.0"
5+
provider-name="Joaopa00 ([email protected])">
6+
<requires>
7+
<import addon="xbmc.python"
8+
version="3.0.0"/>
9+
</requires>
10+
<extension point="xbmc.python.module"
11+
library="lib" />
12+
<extension point="xbmc.addon.metadata">
13+
<summary lang="en_GB">Convert Javascript object into Python object</summary>
14+
<description lang="en_GB">Chompjs can be used in web scrapping for turning JavaScript objects embedded in pages into valid Python dictionaries.</description>
15+
<license>MIT</license>
16+
<platform>all</platform>
17+
<website>https://github.com/Nykakin/chompjs</website>
18+
<assets>
19+
<icon>resources/icon.png</icon>
20+
</assets>
21+
</extension>
22+
</addon>
23+
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
/*
2+
* Copyright 2020-2021 Mariusz Obajtek. All rights reserved.
3+
* License: https://github.com/Nykakin/chompjs/blob/master/LICENSE
4+
*/
5+
6+
#include <string.h>
7+
#include <stdio.h>
8+
#include <stdlib.h>
9+
10+
#include "buffer.h"
11+
12+
void init_char_buffer(struct CharBuffer* buffer, size_t initial_depth_buffer_size) {
13+
buffer->data = malloc(initial_depth_buffer_size);
14+
buffer->memory_buffer_length = initial_depth_buffer_size;
15+
buffer->index = 0;
16+
}
17+
18+
void release_char_buffer(struct CharBuffer* buffer) {
19+
free(buffer->data);
20+
}
21+
22+
void push(struct CharBuffer* buffer, char value) {
23+
buffer->data[buffer->index] = value;
24+
buffer->index += 1;
25+
if(buffer->index >= buffer->memory_buffer_length) {
26+
buffer->data = realloc(buffer->data, 2*buffer->memory_buffer_length);
27+
buffer->memory_buffer_length *= 2;
28+
}
29+
}
30+
31+
void push_string(struct CharBuffer* buffer, const char* value, size_t len) {
32+
if(buffer->index + len >= buffer->memory_buffer_length) {
33+
buffer->data = realloc(buffer->data, 2*buffer->memory_buffer_length);
34+
buffer->memory_buffer_length *= 2;
35+
}
36+
memcpy(buffer->data + buffer->index, value, len);
37+
buffer->index += len;
38+
}
39+
40+
void pop(struct CharBuffer* buffer) {
41+
buffer->index -= 1;
42+
}
43+
44+
char top(struct CharBuffer* buffer) {
45+
return buffer->data[buffer->index-1];
46+
}
47+
48+
bool empty(struct CharBuffer* buffer) {
49+
return buffer->index <= 0;
50+
}
51+
52+
void clear(struct CharBuffer* buffer) {
53+
buffer->index = 0;
54+
}
55+
56+
size_t size(struct CharBuffer* buffer) {
57+
return buffer->index;
58+
}
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
/*
2+
* Copyright 2020-2021 Mariusz Obajtek. All rights reserved.
3+
* License: https://github.com/Nykakin/chompjs/blob/master/LICENSE
4+
*/
5+
6+
#ifndef CHOMPJS_BUFFER_H
7+
#define CHOMPJS_BUFFER_H
8+
9+
#include <stdbool.h>
10+
#include <stddef.h>
11+
12+
/**
13+
Implements a safe, dynamically growing char buffer
14+
*/
15+
struct CharBuffer {
16+
char* data;
17+
size_t memory_buffer_length;
18+
size_t index;
19+
};
20+
21+
void init_char_buffer(struct CharBuffer* buffer, size_t initial_depth_buffer_size);
22+
23+
void release_char_buffer(struct CharBuffer* buffer);
24+
25+
void push(struct CharBuffer* buffer, char value);
26+
27+
void push_string(struct CharBuffer* buffer, const char* value, size_t len);
28+
29+
void pop(struct CharBuffer* buffer);
30+
31+
char top(struct CharBuffer* buffer);
32+
33+
bool empty(struct CharBuffer* buffer);
34+
35+
void clear(struct CharBuffer* buffer);
36+
37+
size_t size(struct CharBuffer* buffer);
38+
39+
#endif

0 commit comments

Comments
 (0)