Matching 17.3nch with spaCy Entity Ruler #12802

joaomsimoes · 2023-07-07T07:53:01Z

joaomsimoes
Jul 7, 2023

Hi,

I'm trying to match screen sizes with the entity ruler from spaCy.

For example: ['14"', '7.1-ch', '15.2nch']

The first and second are easy, because they will be tokenized like this: '14' '"' '7.1' '-' 'ch'

The problem is to match the '15.2nch' as it is a whole token. I tried:

{"label": "inches", "pattern": [ {"SHAPE": {"IN": ["dd.d", "dd", "d.d", "d"]}, "ORTH": 'nch'}]}

But when I apply two patterns in the same token it does not work.

Any idea how I can solve this problem?

Thanks

Answered by svlandeg

Jul 13, 2023

Hi!

You're right that you can't match the token "15.2nch" with a generic pattern like that, identifying first the dd.d part and then nch, as long as the tokenizer sees this as one token. For this to work, you'll either need to preprocess your texts so there's a space there ("15.2 nch"), or you'll need to adjust your tokenizer by fiddling with the tokenizer rules. I worry that the latter will be tricky though, because the current English tokenizer is set up such that it will keep numbers (like 15.2) together which is a very sensible thing to do. What you need in your case, is some sort of rule that will identify that your token is made up of numbers+punctuation on the one hand, and normal …

View full answer

svlandeg · 2023-07-13T15:28:36Z

svlandeg
Jul 13, 2023

Hi!

You're right that you can't match the token "15.2nch" with a generic pattern like that, identifying first the dd.d part and then nch, as long as the tokenizer sees this as one token. For this to work, you'll either need to preprocess your texts so there's a space there ("15.2 nch"), or you'll need to adjust your tokenizer by fiddling with the tokenizer rules. I worry that the latter will be tricky though, because the current English tokenizer is set up such that it will keep numbers (like 15.2) together which is a very sensible thing to do. What you need in your case, is some sort of rule that will identify that your token is made up of numbers+punctuation on the one hand, and normal letters on the other, and split there.

Trying to do this in preprocessing will probably be your best bet.

1 reply

joaomsimoes Jul 13, 2023
Author

Thanks for the reply.

I already tried to change the tokenizer rules and it works. I did a bit ugly solution. I iterate for every possible pattern. But I'm afraid that it will cause some unexpected results.

Thanks for the tip on the preprocessing. I will give a try.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Matching 17.3nch with spaCy Entity Ruler #12802

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Matching 17.3nch with spaCy Entity Ruler #12802

Uh oh!

joaomsimoes Jul 7, 2023

Replies: 1 comment · 1 reply

Uh oh!

svlandeg Jul 13, 2023

Uh oh!

joaomsimoes Jul 13, 2023 Author

joaomsimoes
Jul 7, 2023

Replies: 1 comment 1 reply

svlandeg
Jul 13, 2023

joaomsimoes Jul 13, 2023
Author