Entity Ruler in Chinese #10205
-
Hi, I am trying to label organization in Chinese with Spacy. Currently I am using only gazetteer with the Entity Ruler ('ner' component is disabled). I did it in English and Russian and it worked well. But for Chinese, it fails to find entities that are in the text and appear in the gazetteer. This is how I use the Entity Ruler: patterns = [] Even if I process the exact same text as the ORG, meaning only "IT业界", it fails to find any entities in the text. I can see the relevant pattern in the entity_ruler component, but no extracted entities. I'm loading Spacy with "zh_core_web_sm". What am I missing? Thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
I think the underlying problem is that the word segmentation from pkuseg in One issue is that If you tokenize just the phrases, you get: nlp = spacy.load("zh_core_web_sm")
assert [t.text for t in nlp('IT业界')] == ['I', 'T业界']
assert [t.text for t in nlp('IT业界')] == ['IT', '业界'] But if
If you are only interested in matching the entity spans and you don't care about the doc tokenization or the tags/parses from other components in nlp = spacy.blank("zh") # char-based tokenization by default rather than pkuseg
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns([{"label": "IT业界", "pattern": "IT业界"}])
doc = nlp("IT业界新闻,电子设备评测,论坛,广告推广等。")
print(doc.ents) # (IT业界,)
assert [t.text for t in doc] == ['I', 'T', '业', '界', '新', '闻', ',', '电', '子', '设', '备', '评', '测', ',', '论', '坛', ',', '广', '告', '推', '广', '等', '。'] But the doc tokens are all individual characters. The phrase patterns will always work correctly, but you might not want this tokenization. If you use jieba, I still think you could see some mismatched segmentation, but probably less than with pkuseg, and you'd have more usable tokens in the end, although they're not the same as from pkuseg and it also won't work with the tagger or parser from nlp = spacy.blank("zh", config={"nlp": {"tokenizer": {"segmenter": "jieba"}}})
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns([{"label": "IT业界", "pattern": "IT业界"}])
doc = nlp("IT业界新闻,电子设备评测,论坛,广告推广等。")
print(doc.ents) # (IT业界,)
assert [t.text for t in doc] == ['IT', '业界', '新闻', ',', '电子设备', '评测', ',', '论坛', ',', '广告', '推广', '等', '。'] The entity ruler was designed for rule-based tokenization rather than statistical tokenization, and there's currently no way to give the entity ruler the phrase patterns with the right tokenization. You can only give the entity ruler the phrase pattern as a string. If you use the phrase matcher directly rather than using the entity ruler, you can give it the right tokenization when you add the patterns by creating the docs manually. |
Beta Was this translation helpful? Give feedback.
I think the underlying problem is that the word segmentation from pkuseg in
zh_core_web_sm
depends on the context. When you add a phrase pattern to the entity ruler, it's tokenized without any context to create the patterns, but this might not match the tokenization that you see in your actual doc.One issue is that
IT业界
is not treated as equivalent toIT业界
, and I can only replicate the problem withIT业界
and notIT业界
.If you tokenize just the phrases, you get:
But if
IT业界
appears with more context (example from https://zh.wikipedia.org/wiki/IT%E4%B9%…