Skip to content

Ensure thread-safety for word_tokenize() wrapper functions#1213

Merged
bact merged 19 commits intodevfrom
copilot/check-thread-safeness-word-tokenize
Jan 17, 2026
Merged

Ensure thread-safety for word_tokenize() wrapper functions#1213
bact merged 19 commits intodevfrom
copilot/check-thread-safeness-word-tokenize

Conversation

Copy link
Contributor

Copilot AI commented Jan 16, 2026

Thread Safety Check for word_tokenize() Functions

Analysis

  • Explore repository structure and tokenization modules
  • Identify thread safety issues in standard engines
  • Review core engines: newmm, newmm-safe, longest
  • Review compact engines: icu (pyicu)
  • Address concerns about custom_dict and default dictionary lazy loading
  • Apply code review suggestions
  • Convert documentation to RST format for Sphinx
  • Add thread safety to external engines (sefr_cut, oskut, wtsplit, budoux)
  • Clarify thread-safety guarantees for external wrappers
  • Add references to external library documentation
  • Fix documentation issues from final code review

Thread Safety Issues Found & Fixed

Core/Compact Engines (Fully Thread-Safe):

  • longest.py: Global _tokenizers dict - FIXED
    • Added threading.Lock() to protect cache access
    • Thread-safe check-then-act pattern implemented
    • Documented thread safety in function docstring
    • Removed unnecessary global declaration (code review)
  • pyicu.py: Global bd BreakIterator - FIXED (compact dependency)
    • Replaced global shared BreakIterator with thread-local storage
    • Each thread now has its own BreakIterator instance
    • Documented thread safety in function docstring
  • newmm.py: Confirmed thread-safe (stateless, no global mutable state)
  • multi_cut.py: Confirmed thread-safe (stateless, no global mutable state)

External Engines (Wrapper Code Thread-Safe):

  • attacut.py: Global _tokenizers dict - FIXED
    • Added threading.Lock() to protect cache access
    • Thread-safe check-then-act pattern implemented
    • Clarified wrapper thread-safety with reference to library documentation
    • Removed unnecessary global declaration (code review)
  • sefr_cut.py: Global DEFAULT_ENGINE - FIXED
    • Added threading.Lock() to protect model loading
    • Thread-safe engine switching
    • Clarified wrapper thread-safety with reference to library documentation
  • oskut.py: Global DEFAULT_ENGINE - FIXED
    • Added threading.Lock() to protect model loading
    • Thread-safe engine switching
    • Clarified wrapper thread-safety with reference to library documentation
  • wtsplit.py: Global _MODEL and _MODEL_NAME - FIXED
    • Added threading.Lock() to protect model loading
    • Thread-safe model switching
    • Clarified wrapper thread-safety with reference to library documentation
  • budoux.py: Global _parser - FIXED
    • Added threading.Lock() to protect lazy initialization
    • Thread-safe parser initialization
    • Clarified wrapper thread-safety with reference to library documentation

Implementation

  • Add threading.Lock to protect global cache access in longest.py
  • Add thread-local storage for BreakIterator in pyicu.py
  • Add threading.Lock to protect global cache access in attacut.py
  • Add threading.Lock to protect model loading in sefr_cut.py
  • Add threading.Lock to protect model loading in oskut.py
  • Add threading.Lock to protect model loading in wtsplit.py
  • Add threading.Lock to protect lazy initialization in budoux.py
  • Create comprehensive thread-safety tests
  • Add attacut thread safety test (code review)
  • Run tests to verify fixes work correctly
  • Run Ruff linter to ensure code quality
  • Document thread-safety guarantees in docstrings
  • Clarify limitations for external library wrappers
  • Add references to external library documentation
  • Create comprehensive documentation
  • Convert documentation to RST format for Sphinx
  • Update documentation with disclaimers about external libraries
  • Add warnings about custom_dict modification
  • Document default dictionary lazy loading behavior
  • Fix documentation example to define texts variable (code review)
  • Update type hints to use built-in list instead of typing.List (code review)
  • Fix documentation issues from final review
    • Add thread.join() to safe example
    • Correct stress test claim to actual test coverage
    • Fix grammar error (depends → depend)

Testing

  • Created test_tokenize_thread_safety.py with 9 comprehensive tests
  • Test concurrent access with same and different dictionaries
  • Test all standard engines: newmm, newmm-safe, longest, mm
  • Test compact engine: icu (conditional on PyICU availability)
  • Test extra engine: attacut (conditional on attacut availability)
  • Test race conditions with multiple dictionaries
  • All tests passing (9 tests, 2 skipped due to missing optional deps)
  • All Ruff linter checks passing
  • Stress tests: up to 200 concurrent operations (20 threads × 10 iterations)
  • All existing tokenize tests still passing

Documentation

  • Updated docstrings for all modified functions
  • Clarified thread-safety guarantees and limitations
  • Added references to external library documentation
  • Created docs/threadsafe.rst (reStructuredText format) with:
    • Summary distinguishing internal implementations vs external wrappers
    • Implementation details for each engine
    • Clear disclaimers about external library thread-safety
    • Default dictionary lazy loading explanation
    • WARNING about custom_dict modifications
    • Safe and unsafe usage examples (complete and runnable)
    • Usage examples for multi-threaded applications
    • Performance considerations
    • Best practices for developers
    • Maintenance notes for future contributors
    • Accurate test coverage claims

Code Quality Improvements (Code Review)

  • Removed unnecessary global declarations in longest.py and attacut.py
  • Updated type hints to use built-in list instead of typing.List (Python 3.9+)
  • Added missing attacut thread safety test
  • Fixed documentation examples to define all referenced variables
  • Converted documentation from Markdown to reStructuredText for Sphinx
  • Clarified thread-safety claims for external wrappers
  • Added references to external library documentation
  • Fixed documentation to include thread.join() in safe example
  • Corrected stress test claims to match actual test coverage
  • Fixed grammar errors in documentation

External Engines Analysis & Fixes

  • Analyzed thread-safety of all external engines
  • Fixed sefr_cut, oskut, wtsplit, and budoux (same pattern as core engines)
  • Clarified that wrapper code is thread-safe but underlying libraries are not guaranteed
  • Added clear references to consult external library documentation
  • Documented findings and fixes with appropriate disclaimers

Summary

Fully thread-safe (internal implementations):

  • Core engines: newmm, newmm-safe, longest, mm
  • Compact engines: icu (pyicu)

Wrapper code is thread-safe (underlying library not guaranteed):

  • Extra/external engines: attacut, budoux, deepcut, nercut, nlpo3, oskut, sefr_cut, tltk, wtsplit
  • Users should consult the respective external library documentation for its thread-safety guarantees

Thread safety is achieved through:

  • Thread-safe locks for cache access (longest, attacut)
  • Thread-safe locks for model loading (sefr_cut, oskut, wtsplit, budoux)
  • Thread-local storage for mutable state (pyicu/icu)
  • Stateless implementations (newmm, newmm-safe, mm)
  • Lazy loading of default dictionary is thread-safe via @lru_cache

IMPORTANT:

  • Custom Trie objects must not be modified during tokenization
  • For external library wrappers, while the wrapper code is thread-safe, users should refer to the external library documentation for its thread-safety guarantees
  • Test suite validates thread safety with up to 200 concurrent operations
Original prompt

Check thread-safeness of word_tokenize() functions. Particularly with standard engines that included in core and compact sets of dependencies: newmm, newmm-safe, longest, etc


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

@bact bact added the refactoring a technical improvement which does not add any new features or change existing features. label Jan 16, 2026
@github-actions
Copy link

Hello @Copilot, thank you for submitting a PR! We will respond as soon as possible.

สวัสดี @Copilot ขอบคุณที่ส่ง PR เข้ามา เราจะตอบกลับให้เร็วที่สุดเท่าที่จะทำได้

Copilot AI and others added 4 commits January 16, 2026 21:54
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copilot AI changed the title [WIP] Check thread-safeness of word_tokenize() functions Ensure thread-safety for word_tokenize() functions Jan 16, 2026
Copilot AI requested a review from bact January 16, 2026 22:04
@coveralls
Copy link

coveralls commented Jan 17, 2026

Coverage Status

coverage: 53.501% (-0.08%) from 53.579%
when pulling ae90a9a on copilot/check-thread-safeness-word-tokenize
into 495874c on dev.

@bact
Copy link
Member

bact commented Jan 17, 2026

@copilot Looks like for newmm, mm, deepcut use cache for custom_dict, they will be no longer thread safe.

What about the lazy load of the default dictionary for newmm?

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 17, 2026

@copilot Looks like for newmm, mm, deepcut use cache for custom_dict, they will be no longer thread safe.

What about the lazy load of the default dictionary for newmm?

Good catch! I've addressed both concerns in commit 6755885:

Default dictionary lazy loading: The word_dict_trie() function uses @lru_cache, which is thread-safe. The first thread triggers loading, and subsequent threads receive the cached Trie instance. This is safe because:

  1. The caching mechanism itself has internal locking
  2. Tokenizers only READ from the Trie (using .prefixes() and __contains__)
  3. The default Trie is never modified after creation

Custom dictionary modifications: You're correct that if someone modifies a custom_dict while threads are using it, that would NOT be thread-safe. The Trie's add() and remove() methods modify internal dictionaries without synchronization.

I've added:

  • Clear warnings in the word_tokenize() docstring
  • Safe and unsafe usage examples in docs/thread_safety.md
  • Documentation explaining that custom Trie objects must be created BEFORE threading and never modified during tokenization

The key insight: As long as the Trie is only READ from (which is all the tokenizers do), concurrent access is safe. But users must not call trie.add() or trie.remove() while tokenization is in progress.

@bact
Copy link
Member

bact commented Jan 17, 2026

@copilot So if we freeze Trie, the longest and the attacut tokenizers will be automatically thread safe then?

Copy link
Contributor Author

Copilot AI commented Jan 17, 2026

@copilot So if we freeze Trie, the longest and the attacut tokenizers will be automatically thread safe then?

Not quite. The locks in longest.py and attacut.py protect the _tokenizers cache (a global dictionary), not the Trie itself.

The race condition is in the check-then-act pattern when adding entries to the cache:

if custom_dict_ref_id not in _tokenizers:  # Thread A and B both check
    _tokenizers[custom_dict_ref_id] = ...   # Both might try to add

Even with a frozen/immutable Trie, we still need the lock because:

  1. Multiple threads could simultaneously try to register a new tokenizer for the same Trie
  2. Python dicts are not thread-safe for concurrent modifications (even inserts)
  3. The cache is an optimization that needs protection regardless of Trie mutability

Freezing the Trie would help ensure users can't accidentally modify it during tokenization (addressing the warning in our docs), but wouldn't eliminate the need for locks on the tokenizer cache.

@bact bact marked this pull request as ready for review January 17, 2026 08:26
@bact bact requested a review from Copilot January 17, 2026 08:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ensures thread-safety for word tokenization engines in PyThaiNLP by addressing race conditions and shared mutable state in three engines and adding comprehensive testing and documentation.

Changes:

  • Added thread-safe locking mechanisms to longest and attacut engines' tokenizer caches to prevent race conditions
  • Replaced global shared BreakIterator in pyicu engine with thread-local storage
  • Added comprehensive thread-safety test suite covering concurrent access patterns and race conditions
  • Created detailed documentation explaining thread-safety implementation, usage patterns, and critical warnings about custom dictionary modifications

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pythainlp/tokenize/longest.py Added threading.Lock() to protect tokenizer cache from race conditions
pythainlp/tokenize/pyicu.py Replaced global BreakIterator with thread-local storage for per-thread isolation
pythainlp/tokenize/attacut.py Added threading.Lock() to protect tokenizer cache from race conditions
pythainlp/tokenize/core.py Added thread-safety documentation and warnings about custom dictionary modifications
tests/core/test_tokenize_thread_safety.py Comprehensive thread-safety test suite for all core engines
docs/thread_safety.md Detailed documentation on thread-safety implementation, usage patterns, and best practices

@bact
Copy link
Member

bact commented Jan 17, 2026

@copilot apply changes based on the comments in this thread

…ations, fix type hints and docs

Co-authored-by: bact <128572+bact@users.noreply.github.com>
@bact
Copy link
Member

bact commented Jan 17, 2026

@copilot remove the claim "This function is thread-safe." etc from docstring/documentation of the tokenizer wrappers, as we can't guarantee thread-safeness of external tokenizers.

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 17, 2026

@copilot remove the claim "This function is thread-safe." etc from docstring/documentation of the tokenizer wrappers, as we can't guarantee thread-safeness of external tokenizers.

Clarified in commit d7f6a51. Changed docstrings for external wrappers (attacut, budoux, oskut, sefr_cut, wtsplit) to state that "The wrapper uses a lock to protect [operation]. However, thread-safety of the underlying [library] library itself is not guaranteed."

Also updated docs/threadsafe.rst to distinguish between:

  • Internal implementations (fully thread-safe): mm, newmm, newmm-safe, longest, icu
  • External library wrappers (wrapper code is thread-safe): attacut, budoux, deepcut, nercut, nlpo3, oskut, sefr_cut, tltk, wtsplit - with explicit note that underlying library thread-safety is not guaranteed

Copilot AI changed the title Ensure thread-safety for word_tokenize() functions Ensure thread-safety for word_tokenize() wrapper functions Jan 17, 2026
@bact
Copy link
Member

bact commented Jan 17, 2026

@copilot after the line "However, thread-safety of the underlying budoux library itself is not guaranteed.", adds a line telling the user to consult the external library documentation

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 17, 2026

@copilot after the line "However, thread-safety of the underlying budoux library itself is not guaranteed.", adds a line telling the user to consult the external library documentation

Added in commit 8d7c80c. All external wrapper docstrings (attacut, budoux, oskut, sefr_cut, wtsplit) now include: "Please refer to the [library name] library documentation for its thread-safety guarantees." after the disclaimer about underlying library thread-safety not being guaranteed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

t = threading.Thread(target=worker, args=(text, custom_dict))
threads.append(t)
t.start()

Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safe example is incomplete - threads are started but never joined, which means the example doesn't wait for completion. Add for thread in threads: thread.join() after starting all threads to make this a complete, runnable example.

Suggested change
# Wait for all threads to finish
for t in threads:
t.join()

Copilot uses AI. Check for mistakes.
- Concurrent tokenization with multiple threads
- Race condition testing with multiple dictionaries
- Verification of result consistency across threads
- Stress testing with 5000+ concurrent operations
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation claims "Stress testing with 5000+ concurrent operations" but there's no test in the test file that performs 5000+ operations. The test suite uses at most 20 threads with 10 iterations each (200 operations). Update the documentation to accurately reflect the actual test coverage or add the claimed stress test.

Suggested change
- Stress testing with 5000+ concurrent operations
- Stress testing with up to 200 concurrent operations (20 threads × 10 iterations)

Copilot uses AI. Check for mistakes.
- Built-in tokenizers (*longest*, *mm*, *newmm*, and *newmm-safe*) \
are thread-safe.
- Wrappers of external tokenizer are designed to be thread-safe \
but depends on the external tokenizer.
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states "Wrappers of external tokenizer are designed to be thread-safe but depends on the external tokenizer" which is grammatically incorrect. Change "depends" to "depend" to match the plural subject "Wrappers".

Suggested change
but depends on the external tokenizer.
but depend on the external tokenizer.

Copilot uses AI. Check for mistakes.
@bact
Copy link
Member

bact commented Jan 17, 2026

@copilot apply changes based on the comments in this thread

… fix grammar

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 17, 2026

@copilot apply changes based on the comments in this thread

Applied all code review suggestions in commit 71230c6:

  1. ✅ Added for t in threads: t.join() to safe example in docs/threadsafe.rst
  2. ✅ Corrected stress test claim from "5000+ concurrent operations" to accurate "up to 200 concurrent operations (20 threads × 10 iterations)"
  3. ✅ Fixed grammar: changed "depends" to "depend" in core.py docstring

@sonarqubecloud
Copy link

@bact bact merged commit 7ff7488 into dev Jan 17, 2026
39 of 42 checks passed
@bact bact deleted the copilot/check-thread-safeness-word-tokenize branch January 17, 2026 21:05
@bact bact added this to the 5.3 milestone Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactoring a technical improvement which does not add any new features or change existing features.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants