Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions include/pybind11/detail/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -1344,5 +1344,8 @@ constexpr
# define PYBIND11_BACKWARD_COMPATIBILITY_TP_DICTOFFSET
#endif

// TODO: determine which platforms cannot use thread_local.
Copy link
Contributor

@b-pass b-pass Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was one of the iOS platforms. But that should've failed one of the PR Checks (since this is unconditionally 1 right now) and didn't. Meaning all of the Checks pybind11 currently has support thread_local anyway.... can that be right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might have fixed the issue with iOS, you have to set some minimum version to be able to use it and it wasn't letting us set the minimum version. But now I think it's working.

#define PYBIND11_CAN_USE_THREAD_LOCAL 1

PYBIND11_NAMESPACE_END(detail)
PYBIND11_NAMESPACE_END(PYBIND11_NAMESPACE)
37 changes: 34 additions & 3 deletions include/pybind11/detail/type_caster_base.h
Original file line number Diff line number Diff line change
Expand Up @@ -45,17 +45,48 @@ class loader_life_support {
loader_life_support *parent = nullptr;
std::unordered_set<PyObject *> keep_alive;

#if defined(PYBIND11_CAN_USE_THREAD_LOCAL)
struct fake_thread_specific_storage {
loader_life_support *instance = nullptr;
loader_life_support *get() const { return instance; }

fake_thread_specific_storage &operator=(loader_life_support *pval) {
instance = pval;
return *this;
}
};
using loader_storage = fake_thread_specific_storage;
#else
using loader_storage = thread_specific_storage<loader_life_support>;
#endif

static loader_storage &get_stack_top() {
#if defined(PYBIND11_CAN_USE_THREAD_LOCAL)
// Without this branch, loader_life_support destruction is a
// significant cost per function call.
//
// Observation: loader_life_support needs to be thread-local, but
// we don't need to go to extra effort to keep it per-interpreter
// (i.e., by putting it in internals) since function calls are
// already isolated to a single interpreter.
static thread_local fake_thread_specific_storage storage;
return storage;
#else
return get_internals().loader_life_support_tls;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pybind11's thread_specific_storage should be the same performance as a static thread_local (just a little less convienient). On most platforms it ends up just being pthread's TLS which is the same thing GCC uses to do thread_local.

So, why not just use static thread_specific_storage<loader_life_support> here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pybind11's thread_specific_storage should be the same performance as a static thread_local (just a little less convienient). On most platforms it ends up just being pthread's TLS which is the same thing GCC uses to do thread_local.

I don't think this is correct, but I'll get clearer data about it and come back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's the code generated for static thread_local std::string* p = nullptr; return p; by a smattering of clang and gcc versions. no pthread TLS. https://godbolt.org/z/jvTG4jEso

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough....

Then, I think then someone should fix the thread_specific_storage class. It doesn't make sense to do this here in this function but let the other places using thread_specific_storage continue to suffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix the thread_specific_storage class. It doesn't make sense to do this here in this function but let the other places using thread_specific_storage continue to suffer.

Wouldn't that break ABI? Is that OK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that break ABI?

Two uses of thread_specific_storage are members of internals (this loader_life_support is one, tstate is another, which is used in GIL management). So, yes, changing those would break ABI. This PR would abandon one of them.

I suppose one could fix the other usages of thread_specific_storage and leave these two until its time for an ABI break.... internals_pp_manager can be changed without an ABI break AFAICT, each different module would have its own local copy of that code and its thread locals, sharing the actual internals pointer amongst them through the InterpreterState dict.

I'm willing to refactor that in a separate PR using whatever method is decided here for conditionally compiling these thread_locals out.

In the past we gave people an option to override PYBIND11_INTERNALS_VERSION. It was a bit messy, but maybe worth a consideration here?

It might make sense to add some mechanism to accumulate ABI breaks within a pp flag like that, rather than forget about them. Maybe out of scope for this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to add some mechanism to accumulate ABI breaks within a pp flag like that, rather than forget about them. Maybe out of scope for this PR.

What we had was pretty basic, easy to see in one view here (when support for old internals versions was removed):

https://github.com/pybind/pybind11/pull/5530/files

Not pretty, but it served the purpose; for ~2 years actually.

We'll have to be careful about test coverage, in ci.yml.

I think it would fit in this PR.

Another route is to get #5800 reviewed and merged. It's an XXL review job though, and we have to determine who's taking on maintaining that code long term. @oremanj for viz (btw @oremanj: did you see my direct email from Aug 21?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's the code generated for static thread_local std::string* p = nullptr; return p; by a smattering of clang and gcc versions. no pthread TLS. https://godbolt.org/z/jvTG4jEso

This was misleading; adding -fPIC (as you would generally do to build a shared library such Python extension IIUC, though pybind11_benchmark doesn't seem to do it) causes this code to use _tls_get_addr() (apparently _tlv_get_addr on Mac). It's not pthread_getspecific though, and runtime measurement shows that we are actually getting an improvement despite my incorrect assembly reference.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rwgk I did not see such an email, nor have I been able to find it searching just now - can you resend it and/or tell me what subject to search for?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oremanj I sent another email, from my corp address to your corp address, subject "pybind11 hello".

#endif
}

public:
/// A new patient frame is created when a function is entered
loader_life_support() {
auto &stack_top = get_internals().loader_life_support_tls;
auto &stack_top = get_stack_top();
parent = stack_top.get();
stack_top = this;
}

/// ... and destroyed after it returns
~loader_life_support() {
auto &stack_top = get_internals().loader_life_support_tls;
auto &stack_top = get_stack_top();
if (stack_top.get() != this) {
pybind11_fail("loader_life_support: internal error");
}
Expand All @@ -68,7 +99,7 @@ class loader_life_support {
/// This can only be used inside a pybind11-bound function, either by `argument_loader`
/// at argument preparation time or by `py::cast()` at execution time.
PYBIND11_NOINLINE static void add_patient(handle h) {
loader_life_support *frame = get_internals().loader_life_support_tls.get();
loader_life_support *frame = get_stack_top().get();
if (!frame) {
// NOTE: It would be nice to include the stack frames here, as this indicates
// use of pybind11::cast<> outside the normal call framework, finding such
Expand Down
Loading