Skip to content
Discussion options

You must be logged in to vote

Hi!

You're right that you can't match the token "15.2nch" with a generic pattern like that, identifying first the dd.d part and then nch, as long as the tokenizer sees this as one token. For this to work, you'll either need to preprocess your texts so there's a space there ("15.2 nch"), or you'll need to adjust your tokenizer by fiddling with the tokenizer rules. I worry that the latter will be tricky though, because the current English tokenizer is set up such that it will keep numbers (like 15.2) together which is a very sensible thing to do. What you need in your case, is some sort of rule that will identify that your token is made up of numbers+punctuation on the one hand, and normal …

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@joaomsimoes
Comment options

Answer selected by svlandeg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage General spaCy usage feat / matcher Feature: Token, phrase and dependency matcher feat / tokenizer Feature: Tokenizer
2 participants