Skip to content
Discussion options

You must be logged in to vote

This seems to be the correct behavior. FE0F is a special character that forces the prior character to be rendered as a colorful emoji instead of a dingbat.

https://codepoints.net/U+FE0F?lang=en

Whether this should be part of the same token or not by default is arguable, but splitting it into its own token by default is consistent with the way spaCy treats other emoji modifiers. You might want to look at spacymoji, which will let you merge tokens like this.

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by svlandeg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
2 participants
Converted from issue

This discussion was converted from issue #8227 on May 29, 2021 05:19.