Skip to content

Conversation

@henryiii
Copy link
Collaborator

@henryiii henryiii commented May 30, 2025

Description

Builds on #5708.

Suggested changelog entry:

  • Test on iOS in CI.

@henryiii henryiii force-pushed the henryiii/fix/ios branch from 4fe7197 to bf663be Compare May 30, 2025 06:15
@henryiii henryiii changed the title fix: ios ci: ios May 30, 2025
@henryiii henryiii force-pushed the henryiii/fix/ios branch 2 times, most recently from 6c19acd to bcbab64 Compare May 30, 2025 15:48
@henryiii henryiii force-pushed the henryiii/fix/ios branch from bcbab64 to 23e1a72 Compare May 30, 2025 16:09
@henryiii
Copy link
Collaborator Author

@b-pass how hard would it be to work around thread_local, possibly via an opt-in mechanism? iOS requires an increased target to get thread_local, but you can't require it currently due to python/cpython#133183 not being in a release yet. It would also be nice to have a way to target iOS 13 (I think 17 is required for thread-local). Ideally we could detect the the target iOS version and select the correct mechanism.

@henryiii henryiii force-pushed the henryiii/fix/ios branch 2 times, most recently from 5b568e8 to 9875955 Compare May 30, 2025 23:04
@b-pass
Copy link
Contributor

b-pass commented May 30, 2025

@b-pass how hard would it be to work around thread_local, possibly via an opt-in mechanism?

The easiest thing would be to just exclude the platform from PYBIND11_HAS_SUBINTERPRETER_SUPPORT.

Does CPython's PyThread_tss_set (and friends) work on this target? pybind11 is using that else where so I am guessing yes?

If so, then I can probably make something that uses that....

@henryiii
Copy link
Collaborator Author

This might just be a problem with it defaulting to iOS 1, hopefully a new build with the upstream fix will be out soon. What I did for now was allow this to be manually overridden. 13 is was CPython itself is compiled against. If 13+ turns out to be fine, it's okay. If it does require 17+, we can document it at least, and develop a workaround if it wasn't too hard.

@henryiii
Copy link
Collaborator Author

iOS test pass! (Minus the sub interpreter ones which I disabled when building)

@henryiii
Copy link
Collaborator Author

henryiii commented May 31, 2025

I think it must handle PYBIND11_TLS_*, happy to test something in the iOS job if you have an idea.

@henryiii henryiii force-pushed the henryiii/fix/ios branch 6 times, most recently from eec1d51 to d5ef814 Compare May 31, 2025 03:49
Signed-off-by: Henry Schreiner <[email protected]>
@henryiii henryiii force-pushed the henryiii/fix/ios branch from d5ef814 to cb6ef95 Compare May 31, 2025 19:45
@henryiii henryiii marked this pull request as ready for review June 1, 2025 04:00
@henryiii
Copy link
Collaborator Author

henryiii commented Jun 1, 2025

It would probably be easier to set up a PYBIND11_TLS_* version if this was in so it would be part of the CI, so taking out of draft.

Signed-off-by: Henry Schreiner <[email protected]>
@henryiii henryiii merged commit 7da1d53 into pybind:master Jun 2, 2025
83 checks passed
@henryiii henryiii deleted the henryiii/fix/ios branch June 2, 2025 04:29
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Jun 2, 2025
@rwgk rwgk removed the needs changelog Possibly needs a changelog entry label Jul 10, 2025
swolchok added a commit to swolchok/pybind11 that referenced this pull request Sep 5, 2025
As explained in a new code comment, loader_life_support needs to be
thread_local but does not need to be isolated to a particular
interpreter because any given function call is already going to only
happen on a single interpreter by definiton.

Performance before on M4 Max using pybind/pybind11_benchmark unmodified repo:
```
> python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 63.8 nsec per loop
```

After:
```
python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 53.1 nsec per loop
```

Open questions:

- How do we determine whether we can safely use `thread_local`? I see
  concerns about old iOS versions on
  pybind#5705 (comment)
  and pybind#5709; is there anything
  else?
- Do we have a test that covers "function called in one interpreter
  calls a C++ function that causes a function call in another
  interpreter??
swolchok added a commit to swolchok/pybind11 that referenced this pull request Sep 5, 2025
As explained in a new code comment, `loader_life_support` needs to be
`thread_local` but does not need to be isolated to a particular
interpreter because any given function call is already going to only
happen on a single interpreter by definiton.

Performance before:
- on M4 Max using pybind/pybind11_benchmark unmodified repo:
```
> python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 63.8 nsec per loop
```

- Linux server:
```
python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'                                                                                                                                        (pytorch)
2000000 loops, best of 5: 120 nsec per loop
```

After:
- M4 Max:
```
python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 53.1 nsec per loop
```

- Linux server:
```
> python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'                                                                                                                                        (pytorch)
2000000 loops, best of 5: 101 nsec per loop
```

A quick profile with perf shows that pthread_setspecific and pthread_getspecific are gone.

Open questions:

- How do we determine whether we can safely use `thread_local`? I see
  concerns about old iOS versions on
  pybind#5705 (comment)
  and pybind#5709; is there anything
  else?
- Do we have a test that covers "function called in one interpreter
  calls a C++ function that causes a function call in another
  interpreter"? I think it's fine, but can it happen?
- Are we happy with what we think will happen in the case where
  multiple extensions compiled with and without this PR interoperate?
  I think it's fine -- each dispatch pushes and cleans up its own
  state -- but a second opinion is certainly welcome.
rwgk added a commit that referenced this pull request Sep 12, 2025
* Use thread_local for loader_life_support to improve performance

As explained in a new code comment, `loader_life_support` needs to be
`thread_local` but does not need to be isolated to a particular
interpreter because any given function call is already going to only
happen on a single interpreter by definiton.

Performance before:
- on M4 Max using pybind/pybind11_benchmark unmodified repo:
```
> python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 63.8 nsec per loop
```

- Linux server:
```
python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'                                                                                                                                        (pytorch)
2000000 loops, best of 5: 120 nsec per loop
```

After:
- M4 Max:
```
python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'
5000000 loops, best of 5: 53.1 nsec per loop
```

- Linux server:
```
> python -m timeit --setup 'from pybind11_benchmark import collatz' 'collatz(4)'                                                                                                                                        (pytorch)
2000000 loops, best of 5: 101 nsec per loop
```

A quick profile with perf shows that pthread_setspecific and pthread_getspecific are gone.

Open questions:

- How do we determine whether we can safely use `thread_local`? I see
  concerns about old iOS versions on
  #5705 (comment)
  and #5709; is there anything
  else?
- Do we have a test that covers "function called in one interpreter
  calls a C++ function that causes a function call in another
  interpreter"? I think it's fine, but can it happen?
- Are we happy with what we think will happen in the case where
  multiple extensions compiled with and without this PR interoperate?
  I think it's fine -- each dispatch pushes and cleans up its own
  state -- but a second opinion is certainly welcome.

* Remove PYBIND11_CAN_USE_THREAD_LOCAL

* clarify comment

* Simplify loader_life_support TLS storage

Replace the `fake_thread_specific_storage` struct with a direct
thread-local pointer managed via a function-local static:

    static loader_life_support *& tls_current_frame()

This retains the "stack of frames" behavior via the `parent` link. It also
reduces indirection and clarifies intent.

Note: this form is C++11-compatible; once pybind11 requires C++17, the
helper can be simplified to:

    inline static thread_local loader_life_support *tls_current_frame = nullptr;

* loader_life_support: avoid duplicate tls_current_frame() calls

Replace repeated calls with a single local reference:

    auto &frame = tls_current_frame();

This ensures the thread_local initialization guard is checked only once
per constructor/destructor call site, avoids potential clang-tidy
complaints, and makes the code more readable. Functional behavior is
unchanged.

* Add REMINDER for next version bump in internals.h

---------

Co-authored-by: Ralf W. Grosse-Kunstleve <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants