Entity Ruler in Chinese #10205

mosheziat · 2022-02-03T16:01:35Z

mosheziat
Feb 3, 2022

Hi,

I am trying to label organization in Chinese with Spacy. Currently I am using only gazetteer with the Entity Ruler ('ner' component is disabled).

I did it in English and Russian and it worked well. But for Chinese, it fails to find entities that are in the text and appear in the gazetteer.

This is how I use the Entity Ruler:

patterns = []
patterns.append({"label": "ORG", "pattern": ＩＴ业界})
ruler.add_patterns(list(patterns))

Even if I process the exact same text as the ORG, meaning only "ＩＴ业界", it fails to find any entities in the text. I can see the relevant pattern in the entity_ruler component, but no extracted entities.

I'm loading Spacy with "zh_core_web_sm".

What am I missing?

Thanks.

Answered by adrianeboyd

Feb 4, 2022

I think the underlying problem is that the word segmentation from pkuseg in zh_core_web_sm depends on the context. When you add a phrase pattern to the entity ruler, it's tokenized without any context to create the patterns, but this might not match the tokenization that you see in your actual doc.

One issue is that ＩＴ业界 is not treated as equivalent to IT业界, and I can only replicate the problem with IT业界 and not ＩＴ业界.

If you tokenize just the phrases, you get:

nlp = spacy.load("zh_core_web_sm")
assert [t.text for t in nlp('IT业界')] == ['I', 'T业界']
assert [t.text for t in nlp('ＩＴ业界')] == ['ＩＴ', '业界']

But if IT业界 appears with more context (example from https://zh.wikipedia.org/wiki/IT%E4%B9%…

View full answer

adrianeboyd · 2022-02-04T08:26:17Z

adrianeboyd
Feb 4, 2022

I think the underlying problem is that the word segmentation from pkuseg in zh_core_web_sm depends on the context. When you add a phrase pattern to the entity ruler, it's tokenized without any context to create the patterns, but this might not match the tokenization that you see in your actual doc.

One issue is that ＩＴ业界 is not treated as equivalent to IT业界, and I can only replicate the problem with IT业界 and not ＩＴ业界.

If you tokenize just the phrases, you get:

nlp = spacy.load("zh_core_web_sm")
assert [t.text for t in nlp('IT业界')] == ['I', 'T业界']
assert [t.text for t in nlp('ＩＴ业界')] == ['ＩＴ', '业界']

But if IT业界 appears with more context (example from https://zh.wikipedia.org/wiki/IT%E4%B9%8B%E5%AE%B6), it's segmented differently (like you expect):

assert [t.text for t in nlp('IT业界新闻，电子设备评测，论坛，广告推广等。 ')] == ['IT', '业界', '新闻', '，', '电子', '设备', '评测', '，', '论坛', '，', '广告', '推广', '等', '。']

If you are only interested in matching the entity spans and you don't care about the doc tokenization or the tags/parses from other components in zh_core_web_sm, the entity ruler would work with character tokenization:

nlp = spacy.blank("zh")  # char-based tokenization by default rather than pkuseg
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns([{"label": "IT业界", "pattern": "IT业界"}])
doc = nlp("IT业界新闻，电子设备评测，论坛，广告推广等。")
print(doc.ents)  # (IT业界,)
assert [t.text for t in doc] == ['I', 'T', '业', '界', '新', '闻', '，', '电', '子', '设', '备', '评', '测', '，', '论', '坛', '，', '广', '告', '推', '广', '等', '。']

But the doc tokens are all individual characters. The phrase patterns will always work correctly, but you might not want this tokenization.

If you use jieba, I still think you could see some mismatched segmentation, but probably less than with pkuseg, and you'd have more usable tokens in the end, although they're not the same as from pkuseg and it also won't work with the tagger or parser from zh_core_web_sm:

nlp = spacy.blank("zh", config={"nlp": {"tokenizer": {"segmenter": "jieba"}}})
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns([{"label": "IT业界", "pattern": "IT业界"}])
doc = nlp("IT业界新闻，电子设备评测，论坛，广告推广等。")
print(doc.ents)  # (IT业界,)
assert [t.text for t in doc] == ['IT', '业界', '新闻', '，', '电子设备', '评测', '，', '论坛', '，', '广告', '推广', '等', '。']

The entity ruler was designed for rule-based tokenization rather than statistical tokenization, and there's currently no way to give the entity ruler the phrase patterns with the right tokenization. You can only give the entity ruler the phrase pattern as a string. If you use the phrase matcher directly rather than using the entity ruler, you can give it the right tokenization when you add the patterns by creating the docs manually.

1 reply

mosheziat Feb 8, 2022
Author

Hi Adrian,

Thank you for your comment, it helped a lot. Issue was solved!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Entity Ruler in Chinese #10205

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Entity Ruler in Chinese #10205

Uh oh!

mosheziat Feb 3, 2022

Replies: 1 comment · 1 reply

Uh oh!

adrianeboyd Feb 4, 2022

Uh oh!

mosheziat Feb 8, 2022 Author

mosheziat
Feb 3, 2022

Replies: 1 comment 1 reply

adrianeboyd
Feb 4, 2022

mosheziat Feb 8, 2022
Author