Skip to content
Discussion options

You must be logged in to vote

It should be fine to add those characters as infixes, or otherwise modify the tokenizer to get the tokens you need.

It's important to understand that the tokenizer doesn't know anything about your entity annotations. Entity annotations are applied after the tokenizer has already done its work, which is why you have issues if your annotations aren't whole tokens. For the same reason, adding EntityRulers or Matchers will not change tokenization and will not fix your problem.

Consider the case where you have new data coming in. For new data you don't have entity annotations yet. If you expect spaCy to use your annotations at training time, how would it be able to get the same tokenization wi…

Replies: 1 comment 8 replies

Comment options

You must be logged in to vote
8 replies
@marzooq-unbxd
Comment options

@polm
Comment options

@marzooq-unbxd
Comment options

@polm
Comment options

@Sumit5194
Comment options

Answer selected by marzooq-unbxd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / en English language data and models feat / tokenizer Feature: Tokenizer
3 participants