-
Notifications
You must be signed in to change notification settings - Fork 180
Open
Labels
Description
When running the git backend to analyze the git repo of the torvalds/linux GitHub repository, and uploading it to ElasticSearch (using the Python elasticsearch module) i get an error which apparently is due to trying to UTF-8 encode a "surrogated" string. The relevant part of the debugging messages and the exception I get is:
...
'message': "[PATCH] intel8x0: AC'97 audio patch for Intel ESB2\n\nThis
patch adds the Intel ESB2 DID's to the intel8x0.c file for AC'97
audio\nsupport.\n\nSigned-off-by: \udca0Jason Gaston
<[email protected]>\nSigned-off-by: Andrew Morton
<[email protected]>\nSigned-off-by: Linus Torvalds <[email protected]>",
'commit': 'c4c8ea948aa21527d502e87227b2f1d951bc506d'}
...
Traceback (most recent call last):
...
File "/home/jgb/venvs/grimoirelab/lib/python3.5/site-packages/elasticsearch/transport.py", line 312, in perform_request
body = body.encode('utf-8')
UnicodeEncodeError: 'utf-8' codec can't encode character '\udca0' in position 984: surrogates not allowed
The code causing the error is:
res = self.es.index(index = self.index, doc_type = self.type,
id = self._id(item), body = item)The problem seems to be that there is a character, decoded from the git log by Perceval into a Unicode string, which is using surrogates (the character is '\udca0'), and it cannot be properly encoded in UTF-8 by the elasticsearch code, thus raising the exception.