Possible to configure Tokenizer to revert to pre-3.2 behaviour? #9787

phil-scholarcy · 2021-12-01T11:53:20Z

phil-scholarcy
Dec 1, 2021

Hi there

We're finding that with the Tokenizer changes in version 3.2, where prefixes are removed before suffix matches are applied, is leading to differences in the the output compared to 3.1.4 that we'd prefer to not have.

It would be great to have a configuration option to choose between to 3.1.4 Tokenizer behaviour and 3.2 behaviour, if possible. For now, we're having to stick with 3.1.4 because the 3.2 changes are too significant for us, causing differences in POS tagging, and in some cases the process hangs.

I can post specifics if needed. We are parsing the the text extracted from PDF documents, mostly research papers.

Many thanks

Phil

Answered by adrianeboyd

Dec 1, 2021

If you think the tokenizer is hanging due to this change we'd be interested in a bug report. (Maybe there's a bad regex combination that leads to this? In the past we had problems with the URL regex appearing to hang on extremely long tokens because there was a slow lookbehind, let's see, this was first mentioned in #4362. My first attempt at fixing this was reverted for being too breaking and then we introduced url_match instead in #5121.)

I don't think that we will want to provide an option for this in the default tokenizer (although I can discuss it with the team), but you can use a custom tokenizer with the exact v3.1 behavior if you'd like. The main hassle in this is that the tokeniz…

View full answer

adrianeboyd · 2021-12-01T13:03:50Z

adrianeboyd
Dec 1, 2021

If you think the tokenizer is hanging due to this change we'd be interested in a bug report. (Maybe there's a bad regex combination that leads to this? In the past we had problems with the URL regex appearing to hang on extremely long tokens because there was a slow lookbehind, let's see, this was first mentioned in #4362. My first attempt at fixing this was reverted for being too breaking and then we introduced url_match instead in #5121.)

I don't think that we will want to provide an option for this in the default tokenizer (although I can discuss it with the team), but you can use a custom tokenizer with the exact v3.1 behavior if you'd like. The main hassle in this is that the tokenizer requires cython, so you can't use --code functions.py to provide the registered function. You have to register an entry point in a separate compiled package, and that package becomes a dependency of your final packaged pipeline. The hard part is not the tokenizer modifications, but the packaging, so I created a demo to show what this can look like:

https://github.com/adrianeboyd/custom-cython-tokenizer/

You'd want to override _split_affixes instead of __call__ to create a custom tokenizer with the v3.1 behavior.

Did you retrain your models with v3.2 or are the POS tagging problems due to running a v3.1-trained model in v3.2?

3 replies

phil-scholarcy Dec 1, 2021
Author

Hi @adrianeboyd thanks so much for your detailed reply, and for the information on creating a custom tokenizer with 3.1 behaviour.

Let me check the details of the differences we are seeing/what causes the hang, and I'll post a report here.

Good point about running a v3.1-trained model in v3.2 - this might have happened! Let me check and confirm.

phil-scholarcy Dec 1, 2021
Author

OK, I can confirm we are using the correct v3.2 en_core_web_sm model with v3.2 spaCy.

It seems the issues we are experiencing are not the result of the Tokenizer, but with the parsing behaviour, which is then altering the behaviour of our Matchers and the pos_regex_matches that we are using within textacy, and those regexes over POS tags seems to be causing the hang - so not a spaCy issue!

Still, here are some example parsing differences between v3.1.4 and v3.2. Mostly around how which and that are parsed. v3.1.4 is parsing that pretty consistently as DET whereas v3.2. is (perhaps more correctly) parsing it as PRON in context. Plus some ADJ and NOUN vs VERB transpositions.

We'll update things on our side. Thanks again, I think just discussing the issue on here has helped us figure out what is happening, so thanks for your input on this.

v3.1.4

factors NOUN
that DET
are VERB
crucial ADJ
for ADP
neurodegeneration NOUN
arising VERB
from ADP
TDP43 PROPN
deposition NOUN

v3.2

factors NOUN
that PRON
are AUX
crucial ADJ
for ADP
neurodegeneration NOUN
arising VERB
from ADP
TDP43 PROPN
deposition NOUN

v3.1.4

recognition NOUN
motifs NOUN
that DET
exhibit VERB
distinct ADJ
properties NOUN

v3.2

recognition NOUN
motifs NOUN
that PRON
exhibit VERB
distinct ADJ
properties NOUN

v3.1.4

RNA PROPN
substrates VERB
mediating VERB
TDP43_related VERB
neuron PROPN
loss PROPN

v3.2

RNA PROPN
substrates VERB
mediating NOUN
TDP43_related PROPN
neuron PROPN
loss PROPN

adrianeboyd Dec 1, 2021

You should be able to run a v3.1.0 model (model version, not spacy version) in v3.2.0, so you can try that out. You'll see a warning on load and obviously test it first on your data, but as far as I know the output from the statistical components (tagger, parser, ner) should be identical. The v3.2.0 models are all trained from scratch, which can lead to slightly different results, especially for ambiguous cases, and we tried to improve the TAG->POS/MORPH mapping in the attribute ruler, so that's probably what you're seeing with DET/PRON. I think AUX/VERB had the most changes, we made minor changes to many of the rules, especially for morphology. If you want the v3.1.0 mapping in a v3.2.0 model, you can remove the attribute_ruler pipe and source the one from v3.1.0 instead. (Overall the performance is better on the UD corpora we evaluated it on, but if you have specific patterns that depend on the old behavior, you could run into problems.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Possible to configure Tokenizer to revert to pre-3.2 behaviour? #9787

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Possible to configure Tokenizer to revert to pre-3.2 behaviour? #9787

Uh oh!

phil-scholarcy Dec 1, 2021

Replies: 1 comment · 3 replies

Uh oh!

adrianeboyd Dec 1, 2021

Uh oh!

phil-scholarcy Dec 1, 2021 Author

Uh oh!

phil-scholarcy Dec 1, 2021 Author

v3.1.4

v3.2

v3.1.4

v3.2

v3.1.4

v3.2

Uh oh!

adrianeboyd Dec 1, 2021

phil-scholarcy
Dec 1, 2021

Replies: 1 comment 3 replies

adrianeboyd
Dec 1, 2021

phil-scholarcy Dec 1, 2021
Author

phil-scholarcy Dec 1, 2021
Author