-
Notifications
You must be signed in to change notification settings - Fork 511
Closed
Description
Great project!
I just ran into one small problem with text containing emojis. These are currently not encoded correctly by preprocess.py
:
Test π!
Outputs the following json:
{"idx_to_token": {"1": "T", "2": "e", "3": "s", "4": "t", "5": " ", "6": "\ud83d", "7": "\ude00", "8": "!", "9": "\n"}, "token_to_idx": {"!": 8, " ": 5, "e": 2, "\ude00": 7, "\n": 9, "s": 3, "T": 1, "\ud83d": 6, "t": 4}}
As you can see, the emoji has been broken into two characters: \ud83d
and \ude00
. cjson throws an error when it attempts to decode this since \ud83d
is not a valid unicode character.
I prototyped a fix in Python3.3+ based on this SO question that I can submit a pull request for, but that requires updating print
and unrelated code for Python 3 as well. I'm not sure what the proper fix is for Python 2.x.
Metadata
Metadata
Assignees
Labels
No labels