Parser output affected by Python version? #10415

kilianfoth7920 · 2022-03-02T14:33:24Z

kilianfoth7920
Mar 2, 2022

I've noticed that the behaviour of the dependency parser is affected by the version of Python that you run it under, even when you load the same language model.

For instance: "The last train to Dallas leaves at 6 o'clock." labels the "leaves" as a verb (VBZ) when run under Python 3.7.7, but as a noun (NNS) when run under Python 3.8.0. I'm loading the "en_core_web_lg" corpus every time.

Is this expected behaviour? Is there something I can do to keep the behaviour that 3.7.7. exhibited (because it seems to me that it's usually the correct alternative)?

Answered by adrianeboyd

Mar 7, 2022

I think it is unexpected to get different results on inference where the only difference is the python version. For training, there will be cross-platform and CPU vs. GPU differences in the exact results due to float rounding, etc., but I wouldn't have expected this on inference.

I also couldn't immediately replicate this with python 3.7 vs. 3.8, spacy v3.2.0 and en_core_web_lg v3.2.0 in linux. So that we can look into the details, could you provide spacy info --markdown for both environments and also the exact versions of numpy, blis, and thinc?

My first guess is that you're comparing en_core_web_lg v3.1.0 to en_core_web_lg v3.2.0. The provided pipelines are retrained from scratch for ea…

View full answer

polm · 2022-03-03T06:15:28Z

polm
Mar 3, 2022

We can't guarantee exactly the same results between Python versions because a lot of internal implementation details in Python can change, unfortunately. Because the difference are not due to anything in our code, there's also not really a way to specify behavior matching one particular Python version.

While you might see a lot of different individual predictions, it would be weird if there were significant changes in accuracy between Python versions. Have you been able to measure a difference in the quality of predictions between Python versions?

4 replies

kilianfoth7920 Mar 3, 2022
Author

I've only made a qualitative analysis of problems, but the difference is always that the newer Python gives worse output. Here are some simple examples of parse trees under Python 3.8.0:

For some reason, the parser now always seems to choose the noun reading of verbs whenever they there is even the slightest ambiguity in word class, even when this should be every unlikely. That difference is certainly systematical.

polm Mar 3, 2022

Thanks for the extra details!

We'll take a look at this, but if you do have some kind of quantitative analysis that confirms what you've noticed, that would be helpful too.

kilianfoth7920 Mar 4, 2022
Author

Okay, I've compared our entire test corpus and the picture becomes less clear... of the 29 substantial discrepancies out of 290 parses, 14 are definitely more correct in Python 3.7 and 15 in Python 3.8, so it looks like essentially random changes (I attach the collected output).

It's a pity that the semantics of the Python engine affect the output so much, you'd really like a language model to be deterministic even if it's statistical in nature.

parses37.txt
parses38.txt

polm Mar 6, 2022

Thanks for the extra data!

We do try to make the models reproducible, and they should give the same results if the environment is exactly the same. But with different programming language versions in the mix, it gets hard to control everything.

If you (or anyone reading this later) ever finds out the exact cause of changes like this, we're happy to make changes to mitigate them.

adrianeboyd · 2022-03-07T07:22:15Z

adrianeboyd
Mar 7, 2022

I think it is unexpected to get different results on inference where the only difference is the python version. For training, there will be cross-platform and CPU vs. GPU differences in the exact results due to float rounding, etc., but I wouldn't have expected this on inference.

I also couldn't immediately replicate this with python 3.7 vs. 3.8, spacy v3.2.0 and en_core_web_lg v3.2.0 in linux. So that we can look into the details, could you provide spacy info --markdown for both environments and also the exact versions of numpy, blis, and thinc?

My first guess is that you're comparing en_core_web_lg v3.1.0 to en_core_web_lg v3.2.0. The provided pipelines are retrained from scratch for each spacy minor release, often with small adjustments to parameters, training data, etc. The overall accuracy should be similar (or better) in each new release, but minor details may change and the output is not expected to be identical across pipeline package versions.

2 replies

kilianfoth7920 Mar 7, 2022
Author

Right on the money, Adriane. My bad - I was convinced I had installed all three versions with the same install script, but the original 3.7 Python must have been installed so long ago that it still had gotten version 3.1.0 of en_core_web_lg by default. After upgrading that, the difference disappears.
Still disappointing that a simple N V N construction such as "Cats chase mice" fails, but it is consistent.

adrianeboyd Mar 8, 2022

The parser itself wouldn't have any trouble with "Cats chase mice" given similar training data, but this kind of sentence is rare in OntoNotes. (I suspect "Chase" as a bank and topics like "police chase" may be more common. There were more occurrences of "cats" and "mice" than I expected, though.)

Uh oh!

Parser output affected by Python version? #10415

Uh oh!

Uh oh!

kilianfoth7920 Mar 2, 2022

Replies: 2 comments · 6 replies

Uh oh!

polm Mar 3, 2022

Uh oh!

Uh oh!

kilianfoth7920 Mar 3, 2022 Author

Uh oh!

polm Mar 3, 2022

Uh oh!

kilianfoth7920 Mar 4, 2022 Author

Uh oh!

polm Mar 6, 2022

Uh oh!

adrianeboyd Mar 7, 2022

Uh oh!

kilianfoth7920 Mar 7, 2022 Author

Uh oh!

adrianeboyd Mar 8, 2022

kilianfoth7920
Mar 2, 2022

Replies: 2 comments 6 replies

polm
Mar 3, 2022

kilianfoth7920 Mar 3, 2022
Author

kilianfoth7920 Mar 4, 2022
Author

adrianeboyd
Mar 7, 2022

kilianfoth7920 Mar 7, 2022
Author