What does "Expands multi-word tokens (MWT) predicted by the TokenizeProcessor. This is only applicable to some languages." mean? #1498

otakutyrant · 2025-05-30T09:08:35Z

otakutyrant
May 30, 2025

The source is https://stanfordnlp.github.io/stanza/pipeline.html#processors

Does this mean that some languages absolutely do not have any multi-word tokens, so MWTProcessor simply treats every word as a token directly?

Answered by AngledLuffa

May 30, 2025

Yes, that's it exactly. The MWT processor only works on languages where the training data supports it. So, for example, English has don't and similar contractions, but Chinese doesn't have anything like that.

We can word it differently if you suggest how, but I will say the linked article does a good job of explaining it (using French as an example, not English). Maybe we could put a modal window there with an example?

View full answer

AngledLuffa · 2025-05-30T15:04:52Z

AngledLuffa
May 30, 2025
Maintainer

Yes, that's it exactly. The MWT processor only works on languages where the training data supports it. So, for example, English has don't and similar contractions, but Chinese doesn't have anything like that.

We can word it differently if you suggest how, but I will say the linked article does a good job of explaining it (using French as an example, not English). Maybe we could put a modal window there with an example?

1 reply

otakutyrant May 30, 2025
Author

I suggest changing "This is only applicable to some languages." to "For some languages like Chinese, which do not have any multi-words tokens at all, the processor simply treats each word as a token directly."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What does "Expands multi-word tokens (MWT) predicted by the TokenizeProcessor. This is only applicable to some languages." mean? #1498

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What does "Expands multi-word tokens (MWT) predicted by the TokenizeProcessor. This is only applicable to some languages." mean? #1498

Uh oh!

Uh oh!

otakutyrant May 30, 2025

Replies: 1 comment · 1 reply

Uh oh!

AngledLuffa May 30, 2025 Maintainer

Uh oh!

otakutyrant May 30, 2025 Author

otakutyrant
May 30, 2025

Replies: 1 comment 1 reply

AngledLuffa
May 30, 2025
Maintainer

otakutyrant May 30, 2025
Author