How to predict multiword ids in tokenization? #8185
-
Your Environment
Hi, I wonder how to predict a UD-like tokenization, for example:
So how can I predict
? Thank you. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Not sure what you mean by "predict". Do you want the tokenizer to split the full tokens into subtokens? You can make custom tokenizer rules to split up tokens, but the subtokens have to add up to the original tokens exactly, which doesn't look like it fits your tokens as they are. spaCy has the policy that the raw input next never changes, so transformations like that are somewhat tricky. There's a feature in the cli convert command called |
Beta Was this translation helpful? Give feedback.
-
@wangxinyu0922 I think your question may be related to issue #1460. Perhaps you already saw that thread, but I wanted to make sure you didn't miss it. Like @polm said, creating a custom tokenizer is an option and I wanted to point out that you can extract morphological information from the Morphologizer, so you can get some information from there. See for instance the |
Beta Was this translation helpful? Give feedback.
@wangxinyu0922 I think your question may be related to issue #1460. Perhaps you already saw that thread, but I wanted to make sure you didn't miss it. Like @polm said, creating a custom tokenizer is an option and I wanted to point out that you can extract morphological information from the Morphologizer, so you can get some information from there. See for instance the
Label Scheme
documentation for the Spanish models: https://spacy.io/models/es