What's the right way to make tokenization exception in Chinese tokenizer? #11238

lingvisa · 2022-07-28T01:27:28Z

lingvisa
Jul 28, 2022

For example, if the hash '#' is tokenized as part of a token, I always want to split it from the real token itself. This is a typical tokenization error in pkuseg tokenizer:

#剩者为王#：

Where '王#' should be split into two tokens. I tried the following (See my CustomChinese definition: #11133):

I added a punctation.py to my custom Chinese definition:

from spacy.lang.punctuation import TOKENIZER_INFIXES as BASE_TOKENIZER_INFIXES

_infixes = (
    ["#"]
    + BASE_TOKENIZER_INFIXES
)

TOKENIZER_INFIXES = _infixes

Then I added it to CustomChinse:

class CustomChineseDefaults(Chinese.Defaults):
    stop_words = STOP_WORDS
    syntax_iterators = SYNTAX_ITERATORS
    tag_map = TAG_MAP
    infixes = TOKENIZER_INFIXES

But this doesn't fix the tokenization issue. In English, it has add_special_case:

case = [{ORTH: "do"}, {ORTH: "n't", NORM: "not"}]
tokenizer.add_special_case("don't", case)

Something similar in Chinese should help too.

polm · 2022-07-28T03:23:07Z

polm
Jul 28, 2022

The infixes, special cases, and so on are for langauges that use the spaCy tokenizer implementation. That's not used for languages with an external tokenizer like Chinese. I suspect the best way to handle this is by modifying the pkuseg settings. Another option is to customize the pkuseg wrapper in spaCy, or to add a retokenizing component to the front of your pipeline.

0 replies

lingvisa · 2022-07-28T04:00:25Z

lingvisa
Jul 28, 2022
Author

I also like the more general way of retokenier component. I have used it for merging token. The problem with the 'split' function is that it requires dependency parser, which may not be the case for many users. Does that mean retokenization occur after dependency? If prior to it, it wouldn't require the full parser?

3 replies

polm Jul 29, 2022

Calls to split in the retokenizer do require you to provide information (heads) to rearrange the dependency tree, but they don't require you to have a dependency parse. If you run the retokenizer before the parser (or without the parser) you can just provide some placeholder values, like the token you're splitting itself (which is the default head). Like this:

import spacy

nlp = spacy.blank("en")

doc = nlp("Ginger is tasty.")

with doc.retokenize() as retokenizer:
    # you decided this somehow
    token_to_split = doc[0]
    heads = [token_to_split, token_to_split]
    retokenizer.split(token_to_split, ["Gin", "ger"], heads=heads)

for tok in doc:
    print(tok, tok.head)

lingvisa Jul 29, 2022
Author

This works for this particular case. Generally split() needs to specify the exact string pieces of the token, so no variables can be used. In my case, I can specify this way:

# token = '王#'
first_token = token.text[:-1]
heads = [token, token]
retokenizer.split(token, [first_token, "#"], heads=heads)

So split can only handle such very specific cases, while merge can handle more general cases, where token text can be specified with variables: `

for span in spans:
            retokenizer.merge(span)

Is that the case?

Actually I can also use this format to use variables:

first_token = token.text[:-1]
second_token = token.text[-1]
heads = [token, token]
retokenizer.split(token, [first_token, second_token], heads=heads)

polm Aug 2, 2022

In order to split tokens you have to explain how to split them, I don't think that's surprising.

lingvisa · 2022-07-29T19:01:57Z

lingvisa
Jul 29, 2022
Author

I have another problem when merge and split used in a row. The code looks like the following. I have a rule_segmenter to process patterns recognized by Matcher. The to_be_merged and to_be_split lists collect spans/tokens to be merged and to be split.

def __call__(self, doc):

       matches = self.matcher(doc)
       
       to_be_merged = []
       to_be_split = []
       for match_id, start, end in matches:
           label = self.nlp.vocab.strings[match_id]
           if label == '(c1c1)':
               start = start+1
               end = end -1
           elif label == 'c2c1_punct':
               end = end - 1
           elif label == 'punct_c1c2_punct':
               start = start + 1
               end = end - 1
           if not to_be_skipped(start, end, doc, label):
               span = Span(doc, start, end, label=label)
               if self.segmentation_definition[label] == 'merge':
                   to_be_merged.append(span)
               elif self.segmentation_definition[label] == 'split':
                   to_be_split.append(span[0])
               else:
                   raise ValueError("Invalid rule label: ", label)

       # merge span to tokens
       merge_span(filter_spans(to_be_merged), doc)
       split_token(to_be_split, doc)

       return doc

Function definition:

def merge_span(spans, doc):
    with doc.retokenize() as retokenizer:
        for span in spans:
            retokenizer.merge(span)

def split_token(tokens, doc):
    with doc.retokenize() as retokenizer:
        for token in tokens:
            # Currently split the last char with others in a token
            first_token = token.text[:-1]
            second_token = token.text[-1]
            heads = [token, token]
            retokenizer.split(token, [first_token, second_token], heads=heads)

The problem is that, after merge_span() is applied, the values in the to_be_split may got changed, so that tokens to be split won't be the intended tokens. Is that because when merge is applied, the token offsets in a doc got changed so it won't match the original offsets collected in the to_be_merged? If that's the case, why if only merge is used it doesn't have this problem. Because after the first pair of tokens is merged, the offsets in a doc should also get changed accordingly, then afterward tokens to be merged should not get offsets correctly. It seems I don't have the same issue when only merge is used.

1 reply

polm Aug 2, 2022

The issue is that all Spans and Tokens on a Doc are invalidated after you exit the doc.retokenize() block. From the docs:

All views of the Doc (Span and Token) created before the retokenization are invalidated, although they may accidentally continue to work.

You can solve this by putting your split and merge calls in the same doc.retokenize() block. You'll have to reorganize your code a bit to do that.

Uh oh!

What's the right way to make tokenization exception in Chinese tokenizer? #11238

Uh oh!

lingvisa Jul 28, 2022

Replies: 3 comments · 4 replies

Uh oh!

polm Jul 28, 2022

Uh oh!

lingvisa Jul 28, 2022 Author

Uh oh!

polm Jul 29, 2022

Uh oh!

Uh oh!

lingvisa Jul 29, 2022 Author

Uh oh!

polm Aug 2, 2022

Uh oh!

Uh oh!

lingvisa Jul 29, 2022 Author

Uh oh!

polm Aug 2, 2022

lingvisa
Jul 28, 2022

Replies: 3 comments 4 replies

polm
Jul 28, 2022

lingvisa
Jul 28, 2022
Author

lingvisa Jul 29, 2022
Author

lingvisa
Jul 29, 2022
Author