What's the right way to make tokenization exception in Chinese tokenizer? #11238
Replies: 3 comments 4 replies
-
The infixes, special cases, and so on are for langauges that use the spaCy tokenizer implementation. That's not used for languages with an external tokenizer like Chinese. I suspect the best way to handle this is by modifying the pkuseg settings. Another option is to customize the pkuseg wrapper in spaCy, or to add a retokenizing component to the front of your pipeline. |
Beta Was this translation helpful? Give feedback.
-
I also like the more general way of retokenier component. I have used it for merging token. The problem with the 'split' function is that it requires dependency parser, which may not be the case for many users. Does that mean retokenization occur after dependency? If prior to it, it wouldn't require the full parser? |
Beta Was this translation helpful? Give feedback.
-
I have another problem when merge and split used in a row. The code looks like the following. I have a rule_segmenter to process patterns recognized by Matcher. The to_be_merged and to_be_split lists collect spans/tokens to be merged and to be split.
Function definition:
The problem is that, after merge_span() is applied, the values in the to_be_split may got changed, so that tokens to be split won't be the intended tokens. Is that because when merge is applied, the token offsets in a doc got changed so it won't match the original offsets collected in the to_be_merged? If that's the case, why if only merge is used it doesn't have this problem. Because after the first pair of tokens is merged, the offsets in a doc should also get changed accordingly, then afterward tokens to be merged should not get offsets correctly. It seems I don't have the same issue when only merge is used. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
For example, if the hash '#' is tokenized as part of a token, I always want to split it from the real token itself. This is a typical tokenization error in pkuseg tokenizer:
#剩者为王#:
Where '王#' should be split into two tokens. I tried the following (See my CustomChinese definition: #11133):
I added a punctation.py to my custom Chinese definition:
Then I added it to CustomChinse:
But this doesn't fix the tokenization issue. In English, it has add_special_case:
Something similar in Chinese should help too.
Beta Was this translation helpful? Give feedback.
All reactions