Ensure thread-safety for word_tokenize() wrapper functions#1213
Conversation
|
Hello @Copilot, thank you for submitting a PR! We will respond as soon as possible. สวัสดี @Copilot ขอบคุณที่ส่ง PR เข้ามา เราจะตอบกลับให้เร็วที่สุดเท่าที่จะทำได้ |
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
|
@copilot Looks like for newmm, mm, deepcut use cache for custom_dict, they will be no longer thread safe. What about the lazy load of the default dictionary for newmm? |
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Good catch! I've addressed both concerns in commit 6755885: Default dictionary lazy loading: The
Custom dictionary modifications: You're correct that if someone modifies a custom_dict while threads are using it, that would NOT be thread-safe. The Trie's I've added:
The key insight: As long as the Trie is only READ from (which is all the tokenizers do), concurrent access is safe. But users must not call |
|
@copilot So if we freeze Trie, the longest and the attacut tokenizers will be automatically thread safe then? |
Not quite. The locks in longest.py and attacut.py protect the The race condition is in the check-then-act pattern when adding entries to the cache: if custom_dict_ref_id not in _tokenizers: # Thread A and B both check
_tokenizers[custom_dict_ref_id] = ... # Both might try to addEven with a frozen/immutable Trie, we still need the lock because:
Freezing the Trie would help ensure users can't accidentally modify it during tokenization (addressing the warning in our docs), but wouldn't eliminate the need for locks on the tokenizer cache. |
There was a problem hiding this comment.
Pull request overview
This PR ensures thread-safety for word tokenization engines in PyThaiNLP by addressing race conditions and shared mutable state in three engines and adding comprehensive testing and documentation.
Changes:
- Added thread-safe locking mechanisms to
longestandattacutengines' tokenizer caches to prevent race conditions - Replaced global shared
BreakIteratorinpyicuengine with thread-local storage - Added comprehensive thread-safety test suite covering concurrent access patterns and race conditions
- Created detailed documentation explaining thread-safety implementation, usage patterns, and critical warnings about custom dictionary modifications
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| pythainlp/tokenize/longest.py | Added threading.Lock() to protect tokenizer cache from race conditions |
| pythainlp/tokenize/pyicu.py | Replaced global BreakIterator with thread-local storage for per-thread isolation |
| pythainlp/tokenize/attacut.py | Added threading.Lock() to protect tokenizer cache from race conditions |
| pythainlp/tokenize/core.py | Added thread-safety documentation and warnings about custom dictionary modifications |
| tests/core/test_tokenize_thread_safety.py | Comprehensive thread-safety test suite for all core engines |
| docs/thread_safety.md | Detailed documentation on thread-safety implementation, usage patterns, and best practices |
|
@copilot apply changes based on the comments in this thread |
…ations, fix type hints and docs Co-authored-by: bact <128572+bact@users.noreply.github.com>
|
@copilot remove the claim "This function is thread-safe." etc from docstring/documentation of the tokenizer wrappers, as we can't guarantee thread-safeness of external tokenizers. |
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Clarified in commit d7f6a51. Changed docstrings for external wrappers (attacut, budoux, oskut, sefr_cut, wtsplit) to state that "The wrapper uses a lock to protect [operation]. However, thread-safety of the underlying [library] library itself is not guaranteed." Also updated docs/threadsafe.rst to distinguish between:
|
|
@copilot after the line "However, thread-safety of the underlying budoux library itself is not guaranteed.", adds a line telling the user to consult the external library documentation |
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Added in commit 8d7c80c. All external wrapper docstrings (attacut, budoux, oskut, sefr_cut, wtsplit) now include: "Please refer to the [library name] library documentation for its thread-safety guarantees." after the disclaimer about underlying library thread-safety not being guaranteed. |
| t = threading.Thread(target=worker, args=(text, custom_dict)) | ||
| threads.append(t) | ||
| t.start() | ||
|
|
There was a problem hiding this comment.
The safe example is incomplete - threads are started but never joined, which means the example doesn't wait for completion. Add for thread in threads: thread.join() after starting all threads to make this a complete, runnable example.
| # Wait for all threads to finish | |
| for t in threads: | |
| t.join() |
docs/threadsafe.rst
Outdated
| - Concurrent tokenization with multiple threads | ||
| - Race condition testing with multiple dictionaries | ||
| - Verification of result consistency across threads | ||
| - Stress testing with 5000+ concurrent operations |
There was a problem hiding this comment.
The documentation claims "Stress testing with 5000+ concurrent operations" but there's no test in the test file that performs 5000+ operations. The test suite uses at most 20 threads with 10 iterations each (200 operations). Update the documentation to accurately reflect the actual test coverage or add the claimed stress test.
| - Stress testing with 5000+ concurrent operations | |
| - Stress testing with up to 200 concurrent operations (20 threads × 10 iterations) |
pythainlp/tokenize/core.py
Outdated
| - Built-in tokenizers (*longest*, *mm*, *newmm*, and *newmm-safe*) \ | ||
| are thread-safe. | ||
| - Wrappers of external tokenizer are designed to be thread-safe \ | ||
| but depends on the external tokenizer. |
There was a problem hiding this comment.
The documentation states "Wrappers of external tokenizer are designed to be thread-safe but depends on the external tokenizer" which is grammatically incorrect. Change "depends" to "depend" to match the plural subject "Wrappers".
| but depends on the external tokenizer. | |
| but depend on the external tokenizer. |
|
@copilot apply changes based on the comments in this thread |
… fix grammar Co-authored-by: bact <128572+bact@users.noreply.github.com>
Applied all code review suggestions in commit 71230c6:
|
|



Thread Safety Check for word_tokenize() Functions
Analysis
Thread Safety Issues Found & Fixed
Core/Compact Engines (Fully Thread-Safe):
_tokenizersdict - FIXEDthreading.Lock()to protect cache accessglobaldeclaration (code review)bdBreakIterator - FIXED (compact dependency)External Engines (Wrapper Code Thread-Safe):
_tokenizersdict - FIXEDthreading.Lock()to protect cache accessglobaldeclaration (code review)DEFAULT_ENGINE- FIXEDthreading.Lock()to protect model loadingDEFAULT_ENGINE- FIXEDthreading.Lock()to protect model loading_MODELand_MODEL_NAME- FIXEDthreading.Lock()to protect model loading_parser- FIXEDthreading.Lock()to protect lazy initializationImplementation
Testing
Documentation
Code Quality Improvements (Code Review)
globaldeclarations in longest.py and attacut.pylistinstead oftyping.List(Python 3.9+)External Engines Analysis & Fixes
Summary
Fully thread-safe (internal implementations):
Wrapper code is thread-safe (underlying library not guaranteed):
Thread safety is achieved through:
IMPORTANT:
Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.