Skip to content

Conversation

Qubitium
Copy link
Contributor

@Qubitium Qubitium commented Oct 3, 2025

What does this PR do?

Adds (Full) Regex and (Partial) Tokenization GIL=0 free-threading support. Tested up to Python 3.14T.

In simple terms, Transformers code that relies on regex will segfault under true concurrency. I have confirmed with the regex maintainer that the latest GIL=0–compatible regex package is not GIL=0 safe. This has been corroborated by the submitted unit test in this PR, which demonstrates the segfault. Regex caches and reuses compiled patterns internally and executes them, which makes it inherently unsafe without the GIL.

The core idea is not to make the entire Transformers library GIL=0 safe all at once, but rather to adapt it piece by piece until everything is as GIL=0 compatible as possible.

This PR wraps existing regex calls into a thread-locked, protected serial execution pipeline. Files that import regex only need to adjust their import path to from util.safe import regex to minimize migration pain.

- import regex as re
+ from ...utils.safe import regex as re

The current safe wrapper is modular and can be extended to cover additional modules in the future as needed or as issues are discovered.

Since Transformers evolves more rapidly than most packages, introducing a local wrapper may be a pragmatic step while waiting for an upstream fix, which may be delayed. Additionally, many if not most Python APIs are not inherently thread-safe, so applying local wrappers is often necessary beyond just external packages.”

Two new unit test files are included:

A non-crashing test suite that proves the code works under GIL=0 with thread load.

A crash-demonstration test that proves regex code paths segfault without the added protection under thread load.

utils/safe.py

tests/utils/test_safe.py

tests/utils/test_safe_crash.py

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker @itazap @SunMarc @Cyrilvallez @gante @ydshieh @stevhliu

Copy link
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a really cool idea! I made one comment about locked, but it makes a lot of sense otherwise.

One potential simplification, though, is that in a lot of cases we import regex when really we only need re; this is a leftover from a time when the built-in re was significantly behind the third-party regex lib. What's the thread-safety status of the built-in re?

Comment on lines 85 to 88
@wraps(attr)
def locked(*args, **kwargs):
with self._lock:
return attr(*args, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - in fact it's probably more likely to collide with an object attribute than a top-level class or function in a module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the collison I also noticed the _callable_cache dict is not lock protected so this safe thread helper will also crash itself. Ooof. It just happens to be very fast at dict updates so there were no memory rw overlap during ci test.

Copy link
Contributor Author

@Qubitium Qubitium Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rocketknight1 Please check if the uglified (lol) code changes fix this real collision issue. Is there a better way to avoid the verbose attribute names issue than what I am doing?

@Qubitium
Copy link
Contributor Author

Qubitium commented Oct 6, 2025

This seems like a really cool idea! I made one comment about locked, but it makes a lot of sense otherwise.

One potential simplification, though, is that in a lot of cases we import regex when really we only need re; this is a leftover from a time when the built-in re was significantly behind the third-party regex lib. What's the thread-safety status of the built-in re?

Another issue aboutre is it's api is highly unstable. There are breaking changes as recent as 3.12 and I just took a casual glance at the doc: https://docs.python.org/3/library/re.html.

From docs re also cache compiled regex so I think we can assume, without look at the implementation, that it may well also be thread unsafe. Need to double check with internal code.

Note The compiled versions of the most recent patterns passed to [re.compile()](https://docs.python.org/3/library/re.html#re.compile) and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.
Changed in version 3.12: In [bytes](https://docs.python.org/3/library/stdtypes.html#bytes) patterns, group name can only contain bytes in the ASCII range (b'\x00'-b'\x7f').
Changed in version 3.12: Group id can only contain ASCII digits. In [bytes](https://docs.python.org/3/library/stdtypes.html#bytes) patterns, group name can only contain bytes in the ASCII range (b'\x00'-b'\x7f').

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Nice PR, we are removing all of these files in favor of just using tokenizers as the backend, and having one sentencepiece_wrapper file, from which we will use your safe import!

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Oct 6, 2025

But happy to review / merge in the mean time!

@Qubitium
Copy link
Contributor Author

Qubitium commented Oct 6, 2025

@Rocketknight1 @ArthurZucker Ready for re-review/review. Changes since last review:

  1. Fixed _hf_safe_callable_cache dict was not thread locked for safety so safe would be un-safe. Big oof. Ci test added.
  2. Fixed some meta data such as __package__ actually doesn't always exist. (hit this in my usage with pytorch). Ci test added.
  3. Fixed reentrant usage case where regex callables can actually call itself. Ci test added.
  4. Properties and any helpers have ugly _hf_safe_ prefix added to minimize any namespace collision. It's ugly but it works, I hope.

make fixup is failing on unrelated code so not sure if hf repo formatting is auto applied.

Copy link
Contributor

github-actions bot commented Oct 6, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: bart, bertweet, blenderbot, blenderbot_small, clip, clvp, codegen, ctrl, deberta, deepseek_vl, deepseek_vl_hybrid, depth_pro, fastspeech2_conformer, got_ocr2, gpt2, gpt_oss

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants