Tokenization quality won't affect PhraseMatcher much? #9902
Replies: 3 comments 4 replies
-
You are correct that the PhraseMatcher is comparatively resilient to changes in tokenization, but not guaranteed to always work. But inconsistent tokenization, separate from general quality, is not generally an issue. It could happen for text that isn't normal prose, but like OCR of a table of numbers or product codes or something where spaces get deleted and keywords run together. If you are having problems with it, you would need to do a character based match. |
Beta Was this translation helpful? Give feedback.
-
By char_span, do you mean don't merge characters recognized as within one NE? |
Beta Was this translation helpful? Give feedback.
-
I don't use regex to recognize entities in my case. I have to use PhraseMatcher for large dictionary matching. And to use PhraseMatcher I need a tokenizer. The original FlashText algorithm doesn't need tokenization and create char trie. I am wondering how PhraseMatcher creates the trie and performs the matching given the fact that it is token based. Secondly, is it possible that PhraseMatcher performs char based match without tokenization first? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Since PhraseMatcher does tokenization on both term list and doc, and as long as they are using the same tokenizer, the quality of tokenization won't affect phrase matching too much. Is that right? For example below, if 'Washington, D.C.' was tokenized into ['Washing', 'ton', 'D.C.'], since both vocabulary and doc go through the same tokenization, The whole 'Washington, D.C' will still be matched as one whole unit by PhraseMatcher. Is that true?
Despite that, there is still a possibility that 'Washington, D.C.' will be tokenized differently in the vocabulary and the doc, since the two pieced of texts are not exactly the same, so there is still a possibility that a Doc might not match an entry in the vocabulary list successfully due to tokenization differences even though the entry does occur in the Doc.
For matching all possibilities, a char based match is needed.
Beta Was this translation helpful? Give feedback.
All reactions