Matching 17.3nch with spaCy Entity Ruler #12802
-
Hi, I'm trying to match screen sizes with the entity ruler from spaCy. For example: ['14"', '7.1-ch', '15.2nch'] The first and second are easy, because they will be tokenized like this: '14' '"' '7.1' '-' 'ch' The problem is to match the '15.2nch' as it is a whole token. I tried: {"label": "inches", "pattern": [ {"SHAPE": {"IN": ["dd.d", "dd", "d.d", "d"]}, "ORTH": 'nch'}]} But when I apply two patterns in the same token it does not work. Any idea how I can solve this problem? Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi! You're right that you can't match the token Trying to do this in preprocessing will probably be your best bet. |
Beta Was this translation helpful? Give feedback.
Hi!
You're right that you can't match the token
"15.2nch"
with a generic pattern like that, identifying first thedd.d
part and thennch
, as long as the tokenizer sees this as one token. For this to work, you'll either need to preprocess your texts so there's a space there ("15.2 nch"
), or you'll need to adjust your tokenizer by fiddling with the tokenizer rules. I worry that the latter will be tricky though, because the current English tokenizer is set up such that it will keep numbers (like15.2
) together which is a very sensible thing to do. What you need in your case, is some sort of rule that will identify that your token is made up of numbers+punctuation on the one hand, and normal …