Tokenization quality won't affect PhraseMatcher much? #9902

lingvisa · 2021-12-17T23:14:08Z

lingvisa
Dec 17, 2021

Since PhraseMatcher does tokenization on both term list and doc, and as long as they are using the same tokenizer, the quality of tokenization won't affect phrase matching too much. Is that right? For example below, if 'Washington, D.C.' was tokenized into ['Washing', 'ton', 'D.C.'], since both vocabulary and doc go through the same tokenization, The whole 'Washington, D.C' will still be matched as one whole unit by PhraseMatcher. Is that true?

terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
#patterns = list(nlp.tokenizer.pipe(terms))
matcher.add("TerminologyList", patterns)

doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
          "converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)

Despite that, there is still a possibility that 'Washington, D.C.' will be tokenized differently in the vocabulary and the doc, since the two pieced of texts are not exactly the same, so there is still a possibility that a Doc might not match an entry in the vocabulary list successfully due to tokenization differences even though the entry does occur in the Doc.

For matching all possibilities, a char based match is needed.

polm · 2021-12-20T03:17:04Z

polm
Dec 20, 2021

You are correct that the PhraseMatcher is comparatively resilient to changes in tokenization, but not guaranteed to always work. But inconsistent tokenization, separate from general quality, is not generally an issue. It could happen for text that isn't normal prose, but like OCR of a table of numbers or product codes or something where spaces get deleted and keywords run together.

If you are having problems with it, you would need to do a character based match.

2 replies

lingvisa Dec 20, 2021
Author

I am working on about 1 million term list in the dictionary in Chinese and the Jieba token and char based PhraseMatcher loading takes about 1.2 and 2 minutes respectively. I prefer to using char based dictionary match, because this is more robust, but I still need tokenization. So what I am doing is:

In the pipeline, in PhraseMatcher phrase through a DictionaryMather class, I set tokenization as 'char', so all terms in the dictionary will be added as char based patterns as below in the initializer:
self.matcher.add(label, list(self.nlp.tokenizer.pipe(terms)))

Then in the call function, for incoming text, I temporarily re-create a char doc:
char_doc = self.create_char_doc(doc)

So that both char based doc and dictionary can perform char based matching. After DictionaryMatcher phrase, I switch the tokenization parameter to 'word' from 'char', so a regular word tokenization is performed for later tasks, including using word tokenization to filter out some dictionary match results when the word boundaries are inconsistent. This seems to work fine, but I am still experimenting on it. This allows two segmentation schemes to be used in the same pipeline.

polm Dec 21, 2021

Great that you figured out a workaround!

Personally I would avoid modifying the tokenizer settings, and rather use a separate instance of the tokenizer to avoid issues with state.

Another thing you might look at is using char_span to create entities. That would give you some control over how to handle inconsistent boundaries, for example.

lingvisa · 2021-12-21T15:58:00Z

lingvisa
Dec 21, 2021
Author

By char_span, do you mean don't merge characters recognized as within one NE?

1 reply

polm Dec 22, 2021

char_span is a method that lets you create spans based on character indices, so you can use a regex against the raw text and convert that into a Span without modifying the tokenization of the Doc.

https://spacy.io/api/doc#char_span

lingvisa · 2021-12-22T07:36:18Z

lingvisa
Dec 22, 2021
Author

I don't use regex to recognize entities in my case. I have to use PhraseMatcher for large dictionary matching. And to use PhraseMatcher I need a tokenizer. The original FlashText algorithm doesn't need tokenization and create char trie. I am wondering how PhraseMatcher creates the trie and performs the matching given the fact that it is token based. Secondly, is it possible that PhraseMatcher performs char based match without tokenization first?

1 reply

polm Dec 22, 2021

OK, if PhraseMatcher is the only trie you're familiar with then the way you're using it sounds fine.

PhraseMatcher only works on tokens.

A trie matches a sequence of items. These items are usually characters, traditionally, but they can be anything. In the case of PhraseMatcher they are tokens instead of characters.

Note that the original FlashText is basically just a way of using a trie that exploits spaces and other word boundary characters. That strategy is obviously much less useful in Chinese.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tokenization quality won't affect PhraseMatcher much? #9902

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Tokenization quality won't affect PhraseMatcher much? #9902

Uh oh!

lingvisa Dec 17, 2021

Replies: 3 comments · 4 replies

Uh oh!

polm Dec 20, 2021

Uh oh!

Uh oh!

lingvisa Dec 20, 2021 Author

Uh oh!

polm Dec 21, 2021

Uh oh!

lingvisa Dec 21, 2021 Author

Uh oh!

polm Dec 22, 2021

Uh oh!

lingvisa Dec 22, 2021 Author

Uh oh!

polm Dec 22, 2021

lingvisa
Dec 17, 2021

Replies: 3 comments 4 replies

polm
Dec 20, 2021

lingvisa Dec 20, 2021
Author

lingvisa
Dec 21, 2021
Author

lingvisa
Dec 22, 2021
Author