Possible to configure Tokenizer to revert to pre-3.2 behaviour? #9787
-
Hi there We're finding that with the Tokenizer changes in version 3.2, where prefixes are removed before suffix matches are applied, is leading to differences in the the output compared to 3.1.4 that we'd prefer to not have. It would be great to have a configuration option to choose between to 3.1.4 Tokenizer behaviour and 3.2 behaviour, if possible. For now, we're having to stick with 3.1.4 because the 3.2 changes are too significant for us, causing differences in POS tagging, and in some cases the process hangs. I can post specifics if needed. We are parsing the the text extracted from PDF documents, mostly research papers. Many thanks Phil |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
If you think the tokenizer is hanging due to this change we'd be interested in a bug report. (Maybe there's a bad regex combination that leads to this? In the past we had problems with the URL regex appearing to hang on extremely long tokens because there was a slow lookbehind, let's see, this was first mentioned in #4362. My first attempt at fixing this was reverted for being too breaking and then we introduced I don't think that we will want to provide an option for this in the default tokenizer (although I can discuss it with the team), but you can use a custom tokenizer with the exact v3.1 behavior if you'd like. The main hassle in this is that the tokenizer requires cython, so you can't use https://github.com/adrianeboyd/custom-cython-tokenizer/ You'd want to override Did you retrain your models with v3.2 or are the POS tagging problems due to running a v3.1-trained model in v3.2? |
Beta Was this translation helpful? Give feedback.
If you think the tokenizer is hanging due to this change we'd be interested in a bug report. (Maybe there's a bad regex combination that leads to this? In the past we had problems with the URL regex appearing to hang on extremely long tokens because there was a slow lookbehind, let's see, this was first mentioned in #4362. My first attempt at fixing this was reverted for being too breaking and then we introduced
url_match
instead in #5121.)I don't think that we will want to provide an option for this in the default tokenizer (although I can discuss it with the team), but you can use a custom tokenizer with the exact v3.1 behavior if you'd like. The main hassle in this is that the tokeniz…