-
Notifications
You must be signed in to change notification settings - Fork 15
Description
My pipeline detected 3 invalid JSON lines in the data. Here is one such record
Check failed: reader_.parse(json_line, root, false) {"user_id": null, "user_text": "216.84.45.194", "timestamp": "2007-02-01T21:50:19Z", "authors": "[[null, "216.84.45.194"]]", "content": "* Thank you Beland for updating the page, and for making it clear that XML dumps are the way of the future.\n* Is there a pre-existing way that anyone knows of to load the XML file into MySQL '''without''' having to deal with MediaWiki? (What I and presumably most people want is to get the data into a database with minimum pain and as quickly as possible.)\n* Shouldn't this generate no errors?\nxmllint 20050909_pages_current.xml\nCurrently for me it generates errors like this:\n20050909_pages_current.xml:2771209: error: xmlParseCharRef: invalid xmlChar value 55296\n[[got:\ud800\udf37\ud800\udf3b\ud800\udf30\ud800\udf39\ud800&\n^\n20050909_pages_current.xml:2771209: error: xmlParseCharRef: invalid xmlChar value 57158\n\ud800\udf37\ud800\udf3b\ud800\udf30\ud800\udf39\ud800\udf46\n^\nAll the best,\n", "parent_id": null, "replyTo_id": "104935801.2734.0", "indentation": 1, "type": "COMMENT_ADDING", "conversation_id": "104935801.2734.0", "page_title": "Wikipedia talk:Database download", "page_id": "83068", "id": "104935801.2750.0", "rev_id": "104935801"}
Here is the corresponding wiki page https://en.wikipedia.org/w/index.php?title=Main_Page&oldid=104935801
It looks like the unicode escaping might be problematical.