Skip to content

Conversation

@EthanSteinberg
Copy link
Collaborator

Description

We added some additional GIL status assertion in 2.10.2 (#4246).

This change seems fine, but is probably too big for a bugfix release and causing a lot of problems for our users.

This disables those assertions by default. Users can always manually opt-in with PYBIND11_ASSERT_GIL_HELD_INCREF_DECREF.

Helps with #4420

Suggested changelog entry:

Temporarily disabled our GIL status assertions to be re-enabled later.

@EthanSteinberg
Copy link
Collaborator Author

@henryiii If we want to quickly revert the GIL error check stuff, this is how I would do it.

@rwgk
Copy link
Collaborator

rwgk commented Dec 28, 2022

We went from not knowing I have a bug,

to knowing I have a bug,

to masking that I know I have a bug.

Because: there are N people that complained

because they have

  1. a debug build

  2. an unhealthy (IMO) toolchain that segfaults (MingW, what else?) instead of showing a useful stack trace with the C++ exception message, like some other toolchains do. — Side note: we already have a mitigation, PR Improve the error reporting for inc_ref GIL failures #4427, for the toolchains that don't, complete with a tutorial-kind-of addition to the documentation.

In the meantime, and going forward, M people discover/ed "oh I have a bug", fix/ed it, and save/d D hours debugging in other ways, especially D_f hours debugging flaky multi-threaded applications. I know from first-hand experience that D_f can easily go to being measured in weeks for things like long-running RPC services.

To be socially responsible, we need to do the math:

  • Inconveniencing N people, basically prodding them to fix their bugs sooner rather than later, or to pin the pybind11 version until they get to fixing their bugs.

  • Vs helping M people saving D hours of finding out in much more costly ways that they have a bug.

Now assume this PR is merged. The clock starts ticking:

  • A dubious (IMO) benefit for N people ("yes it's UB, but it works in practice") vs accumulating lost time for M people: D steadily increasing.

Note the history of #4246, which had been sitting there since Oct 17, getting little to no attention from other pybind11 maintainers, despite several attempts to get attention, and being in Google production use since Oct 17, which means a large number of 3rd party packages effectively "saw" #4246 already. — That was already accumulating lost time (D increasing), but that's water under the bridge.

When will the 2.11 release be made?

As a pybind11 maintainer I feel a strong responsibility for letting D increase — again.

We cannot possibly do the math exactly, but I'd say:

  • If the 2.11 release is only a week out: Fine.
  • If the 2.11 release is a couple months out: No, that's irresponsible. — And why? Is it more than a short moment of extra thinking to get the version pinning right one way or another?

@rwgk rwgk mentioned this pull request Dec 28, 2022
3 tasks
@henryiii
Copy link
Collaborator

henryiii commented Dec 29, 2022

A couple of major packages (PyTorch, SciPy) see this change as breaking. I don't approve of the pinning SciPy is doing and am not fond of disabling the check - but I think this is the most practical path that will cause the least effort and hours waisted. Modern existing versions of SciPy no longer build with their <2.11 upper cap (I'm not fond of the practice but I'm not going to actively punish it). We can revert this for a quick 2.10.3 release (but we don't need to target master, we can target v2.10), improve the error reporting for 2.11, and move on. This bug has been there for a long time, and the UB behavior wasn't that bad (SciPy and PyTorch work for the most part ;) ). It'll be in 2.11, and by then hopefully most of the fixes needed will be in place.

I don't think we are releasing 2.11 next week - as far as I know, there's pretty much nothing in it yet. There's no need to force improvements into 2.10.x just because 2.11 isn't out yet. Eventually 2.11 will be out and this will be in place.

We don't need to fix every bug we know about in every release - and this isn't even a bug fix, just turning (working for someone1) UB into an error. We just need to fix everything we know about in master, and make practical patch releases when we can that try not to break anyone.

Footnotes

  1. I'm guessing the py::none obtained before Python was initialized worked. If it doesn't, then I'd say this was a real bug and we are fine elevating it to a compile time error in a patch release. But I'm pretty sure that would have been in SciPy's / PyTorch's test suite.

@henryiii henryiii changed the base branch from master to v2.10 December 29, 2022 03:12
@henryiii
Copy link
Collaborator

Ahh, and what's even nicer, with this PR, you can still opt in to the new check manually in 2.10.3. :)

@EthanSteinberg
Copy link
Collaborator Author

@henryiii I think in the long term, it might be a good idea to do at least a scipy and pytorch build before every release as an extra unit test.

@rwgk
Copy link
Collaborator

rwgk commented Dec 30, 2022

@henryiii I think in the long term, it might be a good idea to do at least a scipy and pytorch build before every release as an extra unit test.

Google-global testing does that, but only with the Google toolchain.

The question is then: who has the free bandwidth to - ultimately - do scipy and pytorch a favor, testing with all toolchains. It's pretty thankless effort I'm afraid. In particular in this case: we'd just be telling them a little earlier that they have long-standing bugs. I've done some of that in preparation for the internal deployment of PR #4246 (I was forced to fix all failures first). Sometimes people are thankful. Sometimes it's more hey why are you bugging me, it works in practice. So I tend to not volunteer myself for such projects.

Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

henryiii changed the base branch from master to v2.10 yesterday

Thanks for doing that!

I'm not really happy TBH, but it's a reasonable compromise!

@henryiii henryiii changed the title Revert the addition of the GIL check feature [v2.10] Revert the addition of the GIL check feature Dec 30, 2022
@henryiii henryiii merged commit 2de6e39 into pybind:v2.10 Dec 30, 2022
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Dec 30, 2022
@EthanSteinberg EthanSteinberg deleted the revert_gil_check branch December 30, 2022 18:47
@Skylion007
Copy link
Collaborator

@henryiii I think in the long term, it might be a good idea to do at least a scipy and pytorch build before every release as an extra unit test.

Google-global testing does that, but only with the Google toolchain.

The question is then: who has the free bandwidth to - ultimately - do scipy and pytorch a favor, testing with all toolchains. It's pretty thankless effort I'm afraid. In particular in this case: we'd just be telling them a little earlier that they have long-standing bugs. I've done some of that in preparation for the internal deployment of PR #4246 (I was forced to fix all failures first). Sometimes people are thankful. Sometimes it's more hey why are you bugging me, it works in practice. So I tend to not volunteer myself for such projects.

I am PyTorch contributor, I brought up the breaking change as an issue, but I don't have time to fix all the related bugs. Happy to test RCs on PyTorch to look for obvious errors / edge cases though.

@henryiii henryiii removed the needs changelog Possibly needs a changelog entry label Jan 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants