Skip to content

Conversation

GregoryComer
Copy link
Member

@GregoryComer GregoryComer commented Oct 7, 2025

Summary

We are seeing pthreadpool-related crashes on Mac when running with pybindings. This appears to be due to XNNPACK using the Google fork of pthreadpool and extension/threadpool using the pthreadpool in libtorch_cpu. See #14321 for more details.

Beyond the obvious one definition rule issues, the specific failure happens because the pthreadpool functions in the copy of pthreadpool built with ET are marked as weak on Apple platforms. The functions are not marked as weak in source code or in the build, and the behavior appears to be specific to Apple's toolchain.

Weak symbols are compiled as indirect calls and can be overridden at runtime by strong symbols in another dylib. For reasons that I don't fully understand, the pthreadpool symbols in libtorch_cpu are strong. Also, the calls in XNNPACK prefer the symbols from the local pthreadpool

This PR works around the issue by building pthreadpool with -fvisibility=hidden, which causes the symbols to not be exposed in the final dylib, and thus not end up in the symbol table as an indirect symbol. Instead, the call to pthreadpool_create in extension_threadpool is compiled as a direct call to the pthreadpool_create in the pthreadpool built by executorch.

This isn't a proper fix for the issue, as there are still two pthreadpool implementations in the process whenever we link libtorch_cpu. However, it does appear to mitigate the symptoms and thus prevent crashes. Long-term, we'll need to find a proper solution, such as namespacing the pthreadpool fork.

Test plan

In addition to validating this change on CI (including trunk CI), I manually verified the fix by testing the repro in #14321 before and after the change. I verified that ASan does not trip upon resetting the threadpool. I also verified with nm and otool that pthreadpool_create does not show up in the indirect symbol table, and thus cannot (to my knowledge) be overridden at runtime by the implementation in libtorch_cpu.

Copy link

pytorch-bot bot commented Oct 7, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14838

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure, 1 Cancelled Job, 1 Unrelated Failure

As of commit eba8e39 with merge base bba9d26 (image):

NEW FAILURE - The following job has failed:

CANCELLED JOB - The following job was cancelled. Please retry:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 7, 2025
@GregoryComer GregoryComer added release notes: none Do not include this in the release notes ciflow/trunk labels Oct 7, 2025
@GregoryComer
Copy link
Member Author

Samsung model failure is due to running on a fork. Demo app build is broken on trunk, so CI is otherwise green on this PR.

@GregoryComer GregoryComer marked this pull request as ready for review October 7, 2025 04:55
@mergennachin mergennachin requested a review from metascroy October 7, 2025 13:37
Copy link
Contributor

@mergennachin mergennachin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In

threadpool_.reset(pthreadpool_create(new_thread_count));

What are your thoughts on adding this logging to see if the old and new threadpool address space is drastically different.

+  void* old_pool = threadpool_.get();
  threadpool_.reset(pthreadpool_create(new_thread_count));
+  ET_LOG(Info, "Reset threadpool from %p to %p (size %u)", old_pool, threadpool_.get(), new_thread_count);

@kimishpatel
Copy link
Contributor

In

threadpool_.reset(pthreadpool_create(new_thread_count));

What are your thoughts on adding this logging to see if the old and new threadpool address space is drastically different.

+  void* old_pool = threadpool_.get();
  threadpool_.reset(pthreadpool_create(new_thread_count));
+  ET_LOG(Info, "Reset threadpool from %p to %p (size %u)", old_pool, threadpool_.get(), new_thread_count);

what does that help with?

@GregoryComer
Copy link
Member Author

GregoryComer commented Oct 7, 2025

In

threadpool_.reset(pthreadpool_create(new_thread_count));

What are your thoughts on adding this logging to see if the old and new threadpool address space is drastically different.

+  void* old_pool = threadpool_.get();
  threadpool_.reset(pthreadpool_create(new_thread_count));
+  ET_LOG(Info, "Reset threadpool from %p to %p (size %u)", old_pool, threadpool_.get(), new_thread_count);

Since the google and upstream pthreadpool use the same allocator for the threadpool state, I wouldn't expect to be able to extract much from the addresses. We should probably have some logging though, so I'll add some. Thanks.

@GregoryComer GregoryComer force-pushed the fix-apple-pthreadpool branch from 644a021 to 3dec676 Compare October 7, 2025 21:21
@GregoryComer GregoryComer force-pushed the fix-apple-pthreadpool branch from 3dec676 to 8e76199 Compare October 7, 2025 21:39
@GregoryComer GregoryComer merged commit 1da530d into pytorch:main Oct 8, 2025
360 of 373 checks passed
GregoryComer added a commit that referenced this pull request Oct 9, 2025
… (#14842)

This reverts commit 750cba7.

Re-applying the better threadpool size defaults from
#14090 with the fix from
#14838. This gives a 2-4x
speedup for many models and platforms (I measured 4x speedup on M1 with
MobileNet V3 + XNNPACK). On high core count server platforms (doing
evals, for example), this can give a 100x speedup out of box.
@digantdesai
Copy link
Contributor

by building pthreadpool with -fvisibility=hidden

Clever!

@GregoryComer
Copy link
Member Author

@pytorchbot cherry-pick --onto release/1.0 -c critical

pytorchbot pushed a commit that referenced this pull request Oct 9, 2025
### Summary
We are seeing pthreadpool-related crashes on Mac when running with
pybindings. This appears to be due to XNNPACK using the Google fork of
pthreadpool and extension/threadpool using the pthreadpool in
libtorch_cpu. See #14321 for
more details.

Beyond the obvious one definition rule issues, the specific failure
happens because the pthreadpool functions in the copy of pthreadpool
built with ET are marked as weak on Apple platforms. The functions are
not marked as weak in source code or in the build, and the behavior
appears to be specific to Apple's toolchain.

Weak symbols are compiled as indirect calls and can be overridden at
runtime by strong symbols in another dylib. For reasons that I don't
fully understand, the pthreadpool symbols in libtorch_cpu are strong.
Also, the calls in XNNPACK prefer the symbols from the local pthreadpool

This PR works around the issue by building pthreadpool with
-fvisibility=hidden, which causes the symbols to not be exposed in the
final dylib, and thus not end up in the symbol table as an indirect
symbol. Instead, the call to pthreadpool_create in extension_threadpool
is compiled as a direct call to the pthreadpool_create in the
pthreadpool built by executorch.

This isn't a proper fix for the issue, as there are still two
pthreadpool implementations in the process whenever we link
libtorch_cpu. However, it does appear to mitigate the symptoms and thus
prevent crashes. Long-term, we'll need to find a proper solution, such
as namespacing the pthreadpool fork.

### Test plan
In addition to validating this change on CI (including trunk CI), I
manually verified the fix by testing the repro in
#14321 before and after the
change. I verified that ASan does not trip upon resetting the
threadpool. I also verified with `nm` and `otool` that
`pthreadpool_create` does not show up in the indirect symbol table, and
thus cannot (to my knowledge) be overridden at runtime by the
implementation in libtorch_cpu.

(cherry picked from commit 1da530d)
@pytorchbot
Copy link
Collaborator

Cherry picking #14838

The cherry pick PR is at #14951 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: none Do not include this in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants