encoding issue '❤️' != '❤' #8231
-
A unicode character disappears during the nlp processing of an emoji. All '❤️' characters are turned into '❤', '' (i.e.'❤' followed by an empty character). I tested 3 models. Same results. Both, '❤️' and '❤' share a first unicode char. But '❤️' has got one more which disappears... Below how to reproduce the test:
Info about spaCy
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Actually the character which disappears is still present... but it is not visible. The character == '\U0000fe0f'. |
Beta Was this translation helpful? Give feedback.
-
This seems to be the correct behavior. FE0F is a special character that forces the prior character to be rendered as a colorful emoji instead of a dingbat. https://codepoints.net/U+FE0F?lang=en Whether this should be part of the same token or not by default is arguable, but splitting it into its own token by default is consistent with the way spaCy treats other emoji modifiers. You might want to look at spacymoji, which will let you merge tokens like this. |
Beta Was this translation helpful? Give feedback.
This seems to be the correct behavior. FE0F is a special character that forces the prior character to be rendered as a colorful emoji instead of a dingbat.
https://codepoints.net/U+FE0F?lang=en
Whether this should be part of the same token or not by default is arguable, but splitting it into its own token by default is consistent with the way spaCy treats other emoji modifiers. You might want to look at spacymoji, which will let you merge tokens like this.