-
Couldn't load subscription status.
- Fork 2.2k
ci: ios #5705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: ios #5705
Conversation
4fe7197 to
bf663be
Compare
6c19acd to
bcbab64
Compare
bcbab64 to
23e1a72
Compare
|
@b-pass how hard would it be to work around |
5b568e8 to
9875955
Compare
The easiest thing would be to just exclude the platform from Does CPython's If so, then I can probably make something that uses that.... |
|
This might just be a problem with it defaulting to iOS 1, hopefully a new build with the upstream fix will be out soon. What I did for now was allow this to be manually overridden. 13 is was CPython itself is compiled against. If 13+ turns out to be fine, it's okay. If it does require 17+, we can document it at least, and develop a workaround if it wasn't too hard. |
|
iOS test pass! (Minus the sub interpreter ones which I disabled when building) |
|
I think it must handle |
eec1d51 to
d5ef814
Compare
Signed-off-by: Henry Schreiner <[email protected]>
d5ef814 to
cb6ef95
Compare
|
It would probably be easier to set up a |
Signed-off-by: Henry Schreiner <[email protected]>
As explained in a new code comment, loader_life_support needs to be thread_local but does not need to be isolated to a particular interpreter because any given function call is already going to only happen on a single interpreter by definiton. Performance before on M4 Max using pybind/pybind11_benchmark unmodified repo: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 63.8 nsec per loop ``` After: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 53.1 nsec per loop ``` Open questions: - How do we determine whether we can safely use `thread_local`? I see concerns about old iOS versions on pybind#5705 (comment) and pybind#5709; is there anything else? - Do we have a test that covers "function called in one interpreter calls a C++ function that causes a function call in another interpreter??
As explained in a new code comment, `loader_life_support` needs to be `thread_local` but does not need to be isolated to a particular interpreter because any given function call is already going to only happen on a single interpreter by definiton. Performance before: - on M4 Max using pybind/pybind11_benchmark unmodified repo: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 63.8 nsec per loop ``` - Linux server: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch) 2000000 loops, best of 5: 120 nsec per loop ``` After: - M4 Max: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 53.1 nsec per loop ``` - Linux server: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch) 2000000 loops, best of 5: 101 nsec per loop ``` A quick profile with perf shows that pthread_setspecific and pthread_getspecific are gone. Open questions: - How do we determine whether we can safely use `thread_local`? I see concerns about old iOS versions on pybind#5705 (comment) and pybind#5709; is there anything else? - Do we have a test that covers "function called in one interpreter calls a C++ function that causes a function call in another interpreter"? I think it's fine, but can it happen? - Are we happy with what we think will happen in the case where multiple extensions compiled with and without this PR interoperate? I think it's fine -- each dispatch pushes and cleans up its own state -- but a second opinion is certainly welcome.
* Use thread_local for loader_life_support to improve performance As explained in a new code comment, `loader_life_support` needs to be `thread_local` but does not need to be isolated to a particular interpreter because any given function call is already going to only happen on a single interpreter by definiton. Performance before: - on M4 Max using pybind/pybind11_benchmark unmodified repo: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 63.8 nsec per loop ``` - Linux server: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch) 2000000 loops, best of 5: 120 nsec per loop ``` After: - M4 Max: ``` python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' 5000000 loops, best of 5: 53.1 nsec per loop ``` - Linux server: ``` > python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)' (pytorch) 2000000 loops, best of 5: 101 nsec per loop ``` A quick profile with perf shows that pthread_setspecific and pthread_getspecific are gone. Open questions: - How do we determine whether we can safely use `thread_local`? I see concerns about old iOS versions on #5705 (comment) and #5709; is there anything else? - Do we have a test that covers "function called in one interpreter calls a C++ function that causes a function call in another interpreter"? I think it's fine, but can it happen? - Are we happy with what we think will happen in the case where multiple extensions compiled with and without this PR interoperate? I think it's fine -- each dispatch pushes and cleans up its own state -- but a second opinion is certainly welcome. * Remove PYBIND11_CAN_USE_THREAD_LOCAL * clarify comment * Simplify loader_life_support TLS storage Replace the `fake_thread_specific_storage` struct with a direct thread-local pointer managed via a function-local static: static loader_life_support *& tls_current_frame() This retains the "stack of frames" behavior via the `parent` link. It also reduces indirection and clarifies intent. Note: this form is C++11-compatible; once pybind11 requires C++17, the helper can be simplified to: inline static thread_local loader_life_support *tls_current_frame = nullptr; * loader_life_support: avoid duplicate tls_current_frame() calls Replace repeated calls with a single local reference: auto &frame = tls_current_frame(); This ensures the thread_local initialization guard is checked only once per constructor/destructor call site, avoids potential clang-tidy complaints, and makes the code more readable. Functional behavior is unchanged. * Add REMINDER for next version bump in internals.h --------- Co-authored-by: Ralf W. Grosse-Kunstleve <[email protected]>
Description
Builds on #5708.
Suggested changelog entry: