-
Notifications
You must be signed in to change notification settings - Fork 30.7k
Fix GIL=0 segfault and Add GIL=0 compat for regex paths #41329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a really cool idea! I made one comment about locked
, but it makes a lot of sense otherwise.
One potential simplification, though, is that in a lot of cases we import regex
when really we only need re
; this is a leftover from a time when the built-in re
was significantly behind the third-party regex
lib. What's the thread-safety status of the built-in re
?
src/transformers/utils/safe.py
Outdated
@wraps(attr) | ||
def locked(*args, **kwargs): | ||
with self._lock: | ||
return attr(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - in fact it's probably more likely to collide with an object attribute than a top-level class or function in a module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the collison I also noticed the _callable_cache
dict is not lock protected so this safe
thread helper will also crash itself. Ooof. It just happens to be very fast at dict updates so there were no memory rw overlap during ci test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rocketknight1 Please check if the uglified (lol) code changes fix this real collision issue. Is there a better way to avoid the verbose attribute names issue than what I am doing?
Another issue about From docs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Nice PR, we are removing all of these files in favor of just using tokenizers
as the backend, and having one sentencepiece_wrapper
file, from which we will use your safe import!
But happy to review / merge in the mean time! |
…for all properties/helpers
@Rocketknight1 @ArthurZucker Ready for re-review/review. Changes since last review:
|
[For maintainers] Suggested jobs to run (before merge) run-slow: bart, bertweet, blenderbot, blenderbot_small, clip, clvp, codegen, ctrl, deberta, deepseek_vl, deepseek_vl_hybrid, depth_pro, fastspeech2_conformer, got_ocr2, gpt2, gpt_oss |
What does this PR do?
Adds (Full) Regex and (Partial) Tokenization GIL=0 free-threading support. Tested up to Python 3.14T.
In simple terms, Transformers code that relies on regex will segfault under true concurrency. I have confirmed with the regex maintainer that the latest GIL=0–compatible regex package is not GIL=0 safe. This has been corroborated by the submitted unit test in this PR, which demonstrates the segfault. Regex caches and reuses compiled patterns internally and executes them, which makes it inherently unsafe without the GIL.
The core idea is not to make the entire Transformers library GIL=0 safe all at once, but rather to adapt it piece by piece until everything is as GIL=0 compatible as possible.
This PR wraps existing regex calls into a thread-locked, protected serial execution pipeline. Files that import regex only need to adjust their import path to
from util.safe import regex
to minimize migration pain.The current safe wrapper is modular and can be extended to cover additional modules in the future as needed or as issues are discovered.
Since Transformers evolves more rapidly than most packages, introducing a local wrapper may be a pragmatic step while waiting for an upstream fix, which may be delayed. Additionally, many if not most Python APIs are not inherently thread-safe, so applying local wrappers is often necessary beyond just external packages.”
Two new unit test files are included:
A non-crashing test suite that proves the code works under GIL=0 with thread load.
A crash-demonstration test that proves regex code paths segfault without the added protection under thread load.
utils/safe.py
tests/utils/test_safe.py
tests/utils/test_safe_crash.py
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker @itazap @SunMarc @Cyrilvallez @gante @ydshieh @stevhliu