Skip to content
Discussion options

You must be logged in to vote

I think the underlying problem is that the word segmentation from pkuseg in zh_core_web_sm depends on the context. When you add a phrase pattern to the entity ruler, it's tokenized without any context to create the patterns, but this might not match the tokenization that you see in your actual doc.

One issue is that IT业界 is not treated as equivalent to IT业界, and I can only replicate the problem with IT业界 and not IT业界.

If you tokenize just the phrases, you get:

nlp = spacy.load("zh_core_web_sm")
assert [t.text for t in nlp('IT业界')] == ['I', 'T业界']
assert [t.text for t in nlp('IT业界')] == ['IT', '业界']

But if IT业界 appears with more context (example from https://zh.wikipedia.org/wiki/IT%E4%B9%…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@mosheziat
Comment options

Answer selected by adrianeboyd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / zh Chinese language data and models feat / ner Feature: Named Entity Recognizer feat / matcher Feature: Token, phrase and dependency matcher
2 participants