- 
                Notifications
    You must be signed in to change notification settings 
- Fork 2.2k
          Add PYBIND11_SIMPLE_GIL_MANAGEMENT option (cmake, C++ define)
          #4216
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @rwgk You should take a look given your interest in the GIL recently. | 
| 
 I don't feel like an expert here at all, but @jbms is! IIUC a-while-ago-comments from Jeremy, for recent Python versions, the complicated version of  I think adding a define is a great idea, but the naming seems unfortunate IIUC: the "simple" (as I look at it) version of  How about  Doing something to ci.yml to use the define seems important, pulling in something like the code in #1276 as a new test. | 
| I was not aware of #1276 but I just took a look at it. In regards to my comment about  Feature 1: pybind11/include/pybind11/gil.h Line 29 in 9c04c7b 
 As of Python 3.8, that is no longer true (https://docs.python.org/3/c-api/init.html#c.PyEval_AcquireThread). The calling thread still blindly "terminates", where "terminate" on non-Windows platforms might actually mean "crash the program" (python/cpython#28525). Therefore we can ignore this difference. Feature 2: pybind11/include/pybind11/gil.h Line 33 in 9c04c7b 
 This comment briefly explains the  Feature 3: pybind11/include/pybind11/gil.h Line 37 in 9c04c7b 
 I can see that repeatedly constructing an destroying the thread state may be expensive (and may be an issue in tensorstore and other code at Google).  It would be interesting to get some performance numbers on this.  However, it seems like it may be possible to control the reference count in a similar way just by using CPython APIs, since you can mix calls to  In any case, per the discussion in issue  #1276, it sounds like the  It seems like we should definitely have this incompatibility fixed in pybind11., at least with the default build options. Given that feature 2 is an extremely niche use case (perhaps it is used exclusively by nanoGUI), we certainly do not want to sacrifice safety of pybind11 by default for it. Given that whatever GIL mechanism is used also needs to be compatible with other code that just uses the regular CPython GIL mechanism, it is not clear to me why this feature, if it is possible to implement safely, even needs to be in pybind11 at all. My recommendation is that instead of adding this option, we just eliminate pybind11's custom gil logic and use  | 
| @wjakob for comment Wenzel is the original author of the code that now lives in gil.h: Wenzel: All these years later, do we still need these features? What would be lost if we removed it and only used the simpler version? | 
| Thank you for taking this so seriously! I just wanted to mention that finding this solution was a team effort with two of my colleagues from [SeeChange], Tigran and @tjrin. I'll hold fire with editing the pull request just yet, until it gets clearer if it will be needed or not. | 
| Prompted by the discussion here, I went ahead and stripped down gil.h to the simple version only (PR #4218), then ran Google-global testing, which passed. My feeling: The extra complexity is known to cause trouble (pytorch/pytorch#83101), but it is unclear what it gives us on the positive side (@wjakob?). I'd lean towards either: 
 | 
…version), to avoid breaking ABI compatibility. (pybind#4216 (comment))
| How about we name this  | 
| Naive question that crossed my mind: Is there a (reasonable) way to detect the situations that I think of as a traps ("fatal interpreter error", "results in GIL deadlock")? At least when  | 
| 
 That's a new feature. We can work on it for 2.11 if desired. 2.10.1 is releasing some iterative improvements, not fixes for perceived traps that have existed for years. This is not the only trap in pybind11, or Python, or C++ for that matter. Please approve this improvement before 1:00 PM Eastern tomorrow, or this will be postponed to the next release. | 
| @rwgk Still no approval on this? | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll flip the default on the smart_holder branch when merging this there, but that's easy.
| @henryiii @Skylion007 could you please review this PR again? I made quite a few changes. Some miscellaneous notes: 
 While I was at it, I added this to common.h, so that we don't forget to remove the   | 
|  | ||
| # endif // PYBIND11_SIMPLE_GIL_MANAGEMENT | ||
|  | ||
| #else // WITH_THREAD | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: Do we follow this style anywhere else? I find it quite confusing, because it reads like this is the "WITH_THREAD" block, while it's actually the not WITH_THREAD block. For things like this, they should be enforced (such as the comment at the end of a namespace is forced). I don't think it can, because this style can't be expanded to an elif chain.
Also this block goes away when we drop 3.6, right? We should be able to deduce that when we remove the WITH_THREAD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: Do we follow this style anywhere else?
In the wild I see negations, or not. I don't think it's consistent in the world, and that's clearly confusing.
What do we want to follow here?
I decided even with that ambiguity, having the markers is more helpful than not, because the first scope (non-simple) is quite large.
Ultimately, something like this would be best:
// No ifdefs
gil_scoped_aquire_non_simple ...
...
gil_scoped_acquire_simple ...
#if defined PYBIND11_SIMPLE_GIL_MANAGEMENT
using gil_scoped_acquire = gil_scoped_acquire_simple;
#else
using gil_scoped_acquire = gil_scoped_acquire_non_simple;
#endif
Although if we push it there I'd want to find better names than simple and non-simple.
Please let me know what you prefer:
- What's here.
- Different style.
- The (or a different) fully formal solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with it as is (that's why it's a nit, can be ignored). This code will simplify later when we remove 3.6. I'm just not a fan of commented out code, it can be confusing (on the else, it's ambiguous), and it can easy become wrong if things get moved around (no test for it matching up). I'd rather use indentation and then visually check, just like I already have to do if I don't trust comments like this (which I don't), and everyone has to do on most other if/else blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately, something like this would be best
This would be wasteful. You'd be processing and compiling both sides, while you'd only use one. Unless we added gil_scoped_acquire_simple/gil_scoped_acquire_notsimple to our API, which I don't think we want to do, then it's much harder to remove the complex one later.
        
          
                docs/upgrade.rst
              
                Outdated
          
        
      | ``py::gil_scoped_acquire`` & ``py::gil_scoped_release`` in pybind11 versions | ||
| < v2.10.1 do not support nested access. In v2.10.1, a configuration option | ||
| ``PYBIND11_SIMPLE_GIL_MANAGEMENT`` was added, defaulting to ``OFF``. The | ||
| simpler implementations support nested access, but do not support dissociation | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| simpler implementations support nested access, but do not support dissociation | |
| simpler implementations support nested access, but do not support dissociation | 
I'm a little worried about promising "nested access is supported" when we have exactly 0 tests for it. Possibly "should support nested access" for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a huge difference: the simple version is just a pair of calls to PyEval_SaveThread() and PyEval_RestoreThread(). Those are exhaustively exercised where they are implemented, in core Python.
What you are asking for is similar to "Py_INCREF() should be supported."
The non-simple version is a very complicated implementation that does not have corresponding unit tests in the project that introduces it (i.e. in pybind11).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That still doesn't mean we know it works in pybind11 - we didn't add tests. I've been at this too long to believe we can promise untested features. There could very well be some weird behavior on some weird compiler that is pybind11 specific. I'd prefer the word "should" be included if we don't test - we can't promise without tests. If we didn't test Py_INCREF(), I'd put "should" on that too (though that would be on the pybind11 wrapping, not the upstream feature - that would be more like adding a py::incref() and not testing it and promising it works).
I'd say it probably works, but we did not earn the right to say it works with certainty - that requires a pybind11 test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I also agree it's more likely to work than dissociation without tests - but to be fair, there are at least two major projects with thousands of users each that use this feature in pybind11, while there are no nested access pybind11 users at all yet - the feature wasn't supported / didn't work before.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS: this is also a nit that you have permission to ignore - I think it should be "should", but it's not going to make or break anything. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without using this change, the above would not work at all, or just be flaky (probably when threading)?
I'd much rather have a test than runs in pybind11's CI for something we promise, but okay, downstream works too (and is no worse what what we used for the dissociate feature in the past, apparently).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my comment above about nesting was supposed to be more of a question rather than a statement, but I left off a question mark.
That is the simplest form of nesting, and that should work both with the "non-simple" and "simple" forms of gil_scoped_{acquire,release}.
The real test is nesting where we combine both the pybind11 apis and the CPython PyGILState_Ensure, as in
#1276 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to @wjakob in that thread, the custom implementation is supposed to be compatible (at least it was as of 2.7? Maybe not now). The issue that prompted this is pytorch/pytorch#83101 - that's why we started on this "simple version" in the first place. There it looks exactly like what you have above, except it is running a bunch of Python to import PyTorch inside the "do stuff here" part. Maybe it's due to raw CPython usage inside what it's importing. Or maybe just the simple release then acquire is broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@henryiii I'm thinking we could unblock the release by completely omitting this section here. I think this PR is otherwise good to go as-is: it's nice to have the extra option, but because we don't have a reproducer (in the CI) for the reported problem, it would be premature to say anything about the future. For this release, we could just briefly mention this PR in the changelog, like any other PR.
In summary: Release first, continue work on a reproducer, make new plans based on the outcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a good idea (and why I didn't have a timeline in the original version). This is still a nice option, having it in a released version will make it much easier for downstream projects to test it out, and even if our complex version is perfect or fixable, I rather like that the simple version is also already there for PyPy. We can announce upgrade plans in 2.11.0 if we go with it.
        
          
                docs/upgrade.rst
              
                Outdated
          
        
      | (``py::gil_scoped_release(true)``). In pybind11 2.11, we plan to change the | ||
| default to ``ON``, to avoid pitfalls of the implementations with dissociation | ||
| (see #4216 for more information). Note that the dissociation feature is very | ||
| rarely used and not exercised in any pybind11 unit tests. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And so is nested access. ;)
…ess likely to mask unrelated ThreadSanitizer issues in the future).
…ANAGEMENT` For the tests in the github CI this does not matter, because `PYBIND11_SIMPLE_GIL_MANAGEMENT` is always defined from the command line, but when monkey-patching common.h locally, it matters.
…ting `DEADLOCK`, additionally exercised via added `intentional_deadlock()`
| I added another commit: ea8b132 
 This is after learning the hard way that reducing to debug deadlocks is a silly idea. Long story omitted. I'm hoping the CI is green again, except for expected  If that's true we need to decide: 
 I'll torture github CI a few times to see how much noise the  | 
| The last CI run was green except one  I'll move on with the plan explained previously: repeatedly run the CI to estimate the  | 
…` compatibility. ``` > ForkingPickler(file, protocol).dump(obj) E TypeError: cannot pickle 'PyCapsule' object ``` Observed with all Windows builds including mingw but not PyPy, and macos-latest with Python 3.9, 3.10, 3.11 but not 3.6.
| The last force push only changed the description of the last commit for accuracy, everything else is still the same. | 
…ith expected `DEADLOCK` failures while we figure out what to do about them.
…that we can easily know when harvesting deadlock information from the CI logs).
| @henryiii wrote: 
 I'm glad you asked: with the latest version the "flakes" — which really were deadlocks all the while — will not pollute the CI results anymore at all. That's clearly an improvement all by itself. But it also sets us up for harvesting deadlock information from the CI logs. After this PR is merged, we can harvest from all CI logs, to get good statistics. I'm explaining more in the revised PR description. But I just see this PR is triggering a new flake, observed twice thus far, both times with  Maybe just Python 3.6? Maybe that has a bug that the complex version sidesteps? Thinking at the moment: let's get this merged, monitor those flakes for a few days, then decide what to do. Python 3.6 macos-latest x64: Python 3.6 windows-2022 x64:  | 
| I agree then, merge and watch sounds fine. We are not making pybind11 itself any worse and it's just a new experimental option. | 
| 
 Thanks! Merging now, to then integrate into smart_holder. Having this Google internally will make it easier to get help debugging the deadlocks. | 
| 谢谢,您的邮件我已收到! | 
* Add option to force the use of the PYPY GIL scoped acquire/release logic to support nested gil access, see #1276 and pytorch/pytorch#83101 * Apply suggestions from code review * Update CMakeLists.txt * docs: update upgrade guide * Update docs/upgrade.rst * All bells & whistles. * Add Reminder to common.h, so that we will not forget to purge `!WITH_THREAD` branches when dropping Python 3.6 * New sentence instead of semicolon. * Temporarily pull in snapshot of PR #4246 * Add `test_release_acquire` * Add more unit tests for nested gil locking * Add test_report_builtins_internals_keys * Very minor enhancement: sort list only after filtering. * Revert change in docs/upgrade.rst * Add test_multi_acquire_release_cross_module, while also forcing unique PYBIND11_INTERNALS_VERSION for cross_module_gil_utils.cpp * Hopefully fix apparently new ICC error. ``` 2022-10-28T07:57:54.5187728Z -- The CXX compiler identification is Intel 2021.7.0.20220726 ... 2022-10-28T07:58:53.6758994Z icpc: remark #10441: The Intel(R) C++ Compiler Classic (ICC) is deprecated and will be removed from product release in the second half of 2023. The Intel(R) oneAPI DPC++/C++ Compiler (ICX) is the recommended compiler moving forward. Please transition to use this compiler. Use '-diag-disable=10441' to disable this message. 2022-10-28T07:58:54.5801597Z In file included from /home/runner/work/pybind11/pybind11/include/pybind11/detail/../detail/type_caster_base.h(15), 2022-10-28T07:58:54.5803794Z from /home/runner/work/pybind11/pybind11/include/pybind11/detail/../cast.h(15), 2022-10-28T07:58:54.5805740Z from /home/runner/work/pybind11/pybind11/include/pybind11/detail/../attr.h(14), 2022-10-28T07:58:54.5809556Z from /home/runner/work/pybind11/pybind11/include/pybind11/detail/class.h(12), 2022-10-28T07:58:54.5812154Z from /home/runner/work/pybind11/pybind11/include/pybind11/pybind11.h(13), 2022-10-28T07:58:54.5948523Z from /home/runner/work/pybind11/pybind11/tests/cross_module_gil_utils.cpp(13): 2022-10-28T07:58:54.5949009Z /home/runner/work/pybind11/pybind11/include/pybind11/detail/../detail/internals.h(177): error #2282: unrecognized GCC pragma 2022-10-28T07:58:54.5949374Z PYBIND11_TLS_KEY_INIT(tstate) 2022-10-28T07:58:54.5949579Z ^ 2022-10-28T07:58:54.5949695Z ``` * clang-tidy fixes * Workaround for PYPY WIN exitcode None * Revert "Temporarily pull in snapshot of PR #4246" This reverts commit 23ac16e. * Another workaround for PYPY WIN exitcode None * Clean up how the tests are run "run in process" Part 1: uniformity * Clean up how the tests are run "run in process" Part 2: use `@pytest.mark.parametrize` and clean up the naming. * Skip some tests `#if defined(THREAD_SANITIZER)` (tested with TSAN using the Google-internal toolchain). * Run all tests again but ignore ThreadSanitizer exitcode 66 (this is less likely to mask unrelated ThreadSanitizer issues in the future). * bug fix: missing common.h include before using `PYBIND11_SIMPLE_GIL_MANAGEMENT` For the tests in the github CI this does not matter, because `PYBIND11_SIMPLE_GIL_MANAGEMENT` is always defined from the command line, but when monkey-patching common.h locally, it matters. * if process.exitcode is None: assert t_delta > 9.9 * More sophisiticated `_run_in_process()` implementation, clearly reporting `DEADLOCK`, additionally exercised via added `intentional_deadlock()` * Wrap m.intentional_deadlock in a Python function, for `ForkingPickler` compatibility. ``` > ForkingPickler(file, protocol).dump(obj) E TypeError: cannot pickle 'PyCapsule' object ``` Observed with all Windows builds including mingw but not PyPy, and macos-latest with Python 3.9, 3.10, 3.11 but not 3.6. * Add link to potential solution for WOULD-BE-NICE-TO-HAVE feature. * Add `SKIP_IF_DEADLOCK = True` option, to not pollute the CI results with expected `DEADLOCK` failures while we figure out what to do about them. * Add COPY-PASTE-THIS: gdb ... command (to be used for debugging the detected deadlock) * style: pre-commit fixes * Do better than automatic pre-commit fixes. * Add `PYBIND11_SIMPLE_GIL_MANAGEMENT` to `pytest_report_header()` (so that we can easily know when harvesting deadlock information from the CI logs). Co-authored-by: Arnim Balzer <[email protected]> Co-authored-by: Henry Schreiner <[email protected]> Co-authored-by: Ralf W. Grosse-Kunstleve <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Description
This PR adds a
PYBIND11_SIMPLE_GIL_MANAGEMENToption (cmake, C++ define). The new option may be useful to try when debugging GIL-related issues, to determine if the more complex default implementation is or is not to blame. See comments here for background.This PR was triggered by pytorch/pytorch#83101. While we could not reproduce the original problem in a unit test, many tests were added to test_gil_scoped.py trying to do so. That work exposed that test_gil_scoped.py has been finding deadlocks for a long time (years), but those were thus far ignored as "flakes". test_gil_scoped.py was updated to make it crystal clear when a
DEADLOCKis detected. It turns out deadlocks are detected at a fairly high rate (ballpark 1 in 1000). This needs to be worked on separately. To not continue to pollute the CI results in the meantime,SKIP_IF_DEADLOCK = Truewas added in test_gil_scoped.py. The GitHub Actions CI logs can be harvested in the future to determine what platforms exactly are affected, and at what rate. Hopefully the information accumulating over time will lead to a better understanding and fixes, so thatSKIP_IF_DEADLOCKcan be removed eventually.This PR also adds ThreadSanitizer compatibility to test_gil_scoped.py (closes #2754).
WARNING: Please be careful to not create ODR violations when using the new option: everything that is linked together with mutual symbol visibility needs to be rebuilt.
Suggested changelog entry:
* A `PYBIND11_SIMPLE_GIL_MANAGEMENT` option was added (cmake, C++ define), along with many additional tests in test_gil_scoped.py. The option may be useful to try when debugging GIL-related issues, to determine if the more complex default implementation is or is not to blame. See #4216 for background. WARNING: Please be careful to not create ODR violations when using the option: everything that is linked together with mutual symbol visibility needs to be rebuilt.