encoding issue '❤️' != '❤' #8231

nicolashernandez · 2021-05-28T15:20:22Z

nicolashernandez
May 28, 2021

A unicode character disappears during the nlp processing of an emoji. All '❤️' characters are turned into '❤', '' (i.e.'❤' followed by an empty character).

I tested 3 models. Same results.

Both, '❤️' and '❤' share a first unicode char. But '❤️' has got one more which disappears...

Below how to reproduce the test:

import spacy

spacy_models = ['fr_core_news_sm', 'fr_dep_news_trf', 'en_core_web_sm']

corpus = ["Ah oui 😊 J'aime 😍😘 Merci ❤ !", 
            "'Bah oui ❤️ bien plus !! ❤️' 💕💕💕💕💕💕"]

for spacy_model in spacy_models:
    nlp = spacy.load(spacy_model) 
    for tweet in corpus:
        print (spacy_model, [token.text for token in nlp(tweet)])

print ('❤️' == '❤')
print ('❤️')
print ('❤')
print ('❤️'.encode("unicode_escape"))
print ('❤'.encode("unicode_escape"))
print ('\U00002764\U0000fe0f') 
print ('\U00002764')

Info about spaCy

spaCy version: 3.0.4
Platform: Linux-5.4.0-73-generic-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
Pipelines: en_core_web_sm (3.0.0), fr_dep_news_trf (3.0.0), fr_core_news_sm (3.0.0)

Answered by polm

May 29, 2021

This seems to be the correct behavior. FE0F is a special character that forces the prior character to be rendered as a colorful emoji instead of a dingbat.

https://codepoints.net/U+FE0F?lang=en

Whether this should be part of the same token or not by default is arguable, but splitting it into its own token by default is consistent with the way spaCy treats other emoji modifiers. You might want to look at spacymoji, which will let you merge tokens like this.

View full answer

nicolashernandez · 2021-05-28T16:08:08Z

nicolashernandez
May 28, 2021
Author

Actually the character which disappears is still present... but it is not visible. The character == '\U0000fe0f'.
The problem may be a tokenization issue.

0 replies

polm · 2021-05-29T05:35:33Z

polm
May 29, 2021

This seems to be the correct behavior. FE0F is a special character that forces the prior character to be rendered as a colorful emoji instead of a dingbat.

https://codepoints.net/U+FE0F?lang=en

Whether this should be part of the same token or not by default is arguable, but splitting it into its own token by default is consistent with the way spaCy treats other emoji modifiers. You might want to look at spacymoji, which will let you merge tokens like this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

encoding issue '❤️' != '❤' #8231

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

encoding issue '❤️' != '❤' #8231

Uh oh!

nicolashernandez May 28, 2021

Info about spaCy

Replies: 2 comments

Uh oh!

nicolashernandez May 28, 2021 Author

Uh oh!

polm May 29, 2021

nicolashernandez
May 28, 2021

nicolashernandez
May 28, 2021
Author

polm
May 29, 2021