How to predict multiword ids in tokenization? #8185

wangxinyu0922 · 2021-05-22T13:13:45Z

wangxinyu0922
May 22, 2021

Your Environment

Operating System:
Python Version Used:
spaCy Version Used:
Environment Information:

Hi, I wonder how to predict a UD-like tokenization, for example:

1-2    vámonos   _
1      vamos     ir
2      nos       nosotros
3-4    al        _
3      a         a
4      el        el
5      mar       mar

So how can I predict 1-2 vámonos and 3-4 al _ given the

doc = nlp(text)

? Thank you.

Answered by damian-romero

May 28, 2021

@wangxinyu0922 I think your question may be related to issue #1460. Perhaps you already saw that thread, but I wanted to make sure you didn't miss it. Like @polm said, creating a custom tokenizer is an option and I wanted to point out that you can extract morphological information from the Morphologizer, so you can get some information from there. See for instance the Label Scheme documentation for the Spanish models: https://spacy.io/models/es

View full answer

polm · 2021-05-24T04:00:39Z

polm
May 24, 2021

Not sure what you mean by "predict". Do you want the tokenizer to split the full tokens into subtokens?

You can make custom tokenizer rules to split up tokens, but the subtokens have to add up to the original tokens exactly, which doesn't look like it fits your tokens as they are. spaCy has the policy that the raw input next never changes, so transformations like that are somewhat tricky.

There's a feature in the cli convert command called --merge-subtokens. By default it's off, so when training a model with the notation you gave an example of, the model will be trained on the subtokens, not the big tokens. Maybe that's a helpful option for you?

0 replies

damian-romero · 2021-05-28T14:56:52Z

damian-romero
May 28, 2021

@wangxinyu0922 I think your question may be related to issue #1460. Perhaps you already saw that thread, but I wanted to make sure you didn't miss it. Like @polm said, creating a custom tokenizer is an option and I wanted to point out that you can extract morphological information from the Morphologizer, so you can get some information from there. See for instance the Label Scheme documentation for the Spanish models: https://spacy.io/models/es

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to predict multiword ids in tokenization? #8185

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to predict multiword ids in tokenization? #8185

Uh oh!

wangxinyu0922 May 22, 2021

Your Environment

Replies: 2 comments

Uh oh!

polm May 24, 2021

Uh oh!

damian-romero May 28, 2021

wangxinyu0922
May 22, 2021

polm
May 24, 2021

damian-romero
May 28, 2021