Skip to content

Commit 7ff7488

Browse files
authored
Merge pull request #1213 from PyThaiNLP/copilot/check-thread-safeness-word-tokenize
Ensure thread-safety for word_tokenize() wrapper functions
2 parents 11c8a9e + 71230c6 commit 7ff7488

File tree

10 files changed

+629
-35
lines changed

10 files changed

+629
-35
lines changed

docs/threadsafe.rst

Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
Thread safety in PyThaiNLP word tokenization
2+
==============================================
3+
4+
Summary
5+
-------
6+
7+
PyThaiNLP's core word tokenization engines are designed with thread-safety
8+
in mind. Internal implementations (``mm``, ``newmm``, ``newmm-safe``,
9+
``longest``, ``icu``) are thread-safe.
10+
11+
For engines that wrap external libraries (``attacut``, ``budoux``, ``deepcut``,
12+
``nercut``, ``nlpo3``, ``oskut``, ``sefr_cut``, ``tltk``, ``wtsplit``), the
13+
wrapper code is thread-safe, but we cannot guarantee thread-safety of the
14+
underlying external libraries themselves.
15+
16+
Thread safety implementation
17+
-----------------------------
18+
19+
**Internal implementations (fully thread-safe):**
20+
21+
- ``mm``, ``newmm``, ``newmm-safe``: Stateless implementation,
22+
all data is local
23+
- ``longest``: uses lock-protected check-then-act for
24+
the management of global cache shared across threads
25+
- ``icu``: each thread gets its own ``BreakIterator`` instance
26+
27+
**External library wrappers (wrapper code is thread-safe):**
28+
29+
- ``attacut``: uses lock-protected check-then-act for
30+
the management of global cache; underlying library thread-safety not guaranteed
31+
- ``budoux``: uses lock-protected lazy initialization of parser;
32+
underlying library thread-safety not guaranteed
33+
- ``deepcut``, ``nercut``, ``nlpo3``, ``tltk``: Stateless wrapper,
34+
underlying library thread-safety not guaranteed
35+
- ``oskut``, ``sefr_cut``, ``wtsplit``: use lock-protected model
36+
loading when switching models/engines; underlying library thread-safety not guaranteed
37+
38+
Usage in multi-threaded applications
39+
-------------------------------------
40+
41+
Using a tokenization engine safely in multi-threaded contexts:
42+
43+
.. code-block:: python
44+
45+
import threading
46+
from pythainlp.tokenize import word_tokenize
47+
48+
def tokenize_worker(text, results, index):
49+
# Thread-safe for all engines
50+
results[index] = word_tokenize(text, engine="longest")
51+
52+
texts = ["ผมรักประเทศไทย", "วันนี้อากาศดี", "เขาไปโรงเรียน"]
53+
results = [None] * len(texts)
54+
threads = []
55+
56+
for i, text in enumerate(texts):
57+
thread = threading.Thread(target=tokenize_worker, args=(text, results, i))
58+
threads.append(thread)
59+
thread.start()
60+
61+
for thread in threads:
62+
thread.join()
63+
64+
# All results are correctly populated
65+
print(results)
66+
67+
Performance considerations
68+
--------------------------
69+
70+
1. **Lock-based synchronization** (longest, attacut):
71+
72+
- Minimal overhead for cache access
73+
- Cache lookups are very fast
74+
- Lock contention is minimal in typical usage
75+
76+
2. **Thread-local storage** (icu):
77+
78+
- Each thread maintains its own instance
79+
- No synchronization overhead after initialization
80+
- Slightly higher memory usage (one instance per thread)
81+
82+
3. **Stateless engines** (newmm, mm):
83+
84+
- Zero synchronization overhead
85+
- Best performance in multi-threaded scenarios
86+
- Recommended for high-throughput applications
87+
88+
Best practices
89+
--------------
90+
91+
1. **For high-throughput applications**: Consider using stateless engines like
92+
``newmm`` or ``mm`` for optimal performance.
93+
94+
2. **For custom dictionaries**: The ``longest`` engine with custom dictionaries
95+
maintains a cache per dictionary object. Reuse dictionary objects across
96+
threads to maximize cache efficiency.
97+
98+
3. **For process pools**: All engines work correctly with multiprocessing as
99+
each process has its own memory space.
100+
101+
4. **IMPORTANT: Do not modify custom dictionaries during tokenization**:
102+
103+
- Create your custom Trie/dictionary before starting threads
104+
- Never call ``trie.add()`` or ``trie.remove()`` while tokenization is in progress
105+
- If you need to update the dictionary,
106+
create a new Trie instance and pass it to subsequent tokenization calls
107+
- The Trie data structure itself is NOT thread-safe for concurrent modifications
108+
109+
Example of safe custom dictionary usage
110+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
111+
112+
.. code-block:: python
113+
114+
from pythainlp.tokenize import word_tokenize
115+
from pythainlp.corpus.common import thai_words
116+
from pythainlp.util import dict_trie
117+
import threading
118+
119+
# SAFE: Create dictionary once before threading
120+
custom_words = set(thai_words())
121+
custom_words.add("คำใหม่")
122+
custom_dict = dict_trie(custom_words)
123+
124+
texts = ["ผมรักประเทศไทย", "วันนี้อากาศดี", "เขาไปโรงเรียน"]
125+
126+
def worker(text, custom_dict):
127+
# SAFE: Only reading from the dictionary
128+
return word_tokenize(text, engine="newmm", custom_dict=custom_dict)
129+
130+
# All threads share the same dictionary (read-only)
131+
threads = []
132+
for text in texts:
133+
t = threading.Thread(target=worker, args=(text, custom_dict))
134+
threads.append(t)
135+
t.start()
136+
137+
# Wait for all threads to finish
138+
for t in threads:
139+
t.join()
140+
141+
Example of UNSAFE usage (DO NOT DO THIS)
142+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
143+
144+
.. code-block:: python
145+
146+
# UNSAFE: Modifying dictionary while threads are using it
147+
custom_dict = dict_trie(thai_words())
148+
149+
def unsafe_worker(text, custom_dict):
150+
result = word_tokenize(text, engine="newmm", custom_dict=custom_dict)
151+
# DANGER: Modifying the shared dictionary
152+
custom_dict.add("คำใหม่") # This is NOT thread-safe!
153+
return result
154+
155+
Testing
156+
-------
157+
158+
Comprehensive thread safety tests are available in:
159+
160+
- ``tests/core/test_tokenize_thread_safety.py``
161+
162+
The test suite includes:
163+
164+
- Concurrent tokenization with multiple threads
165+
- Race condition testing with multiple dictionaries
166+
- Verification of result consistency across threads
167+
- Stress testing with up to 200 concurrent operations (20 threads × 10 iterations)
168+
169+
Maintenance notes
170+
-----------------
171+
172+
When adding new tokenization engines to PyThaiNLP:
173+
174+
1. **Avoid global mutable state** whenever possible
175+
2. If caching is necessary, use thread-safe locks
176+
3. If per-thread state is needed, use ``threading.local()``
177+
4. Always add thread safety tests for new engines
178+
5. Document thread safety guarantees in docstrings
179+
180+
Related files
181+
-------------
182+
183+
- Core implementation: ``pythainlp/tokenize/core.py``
184+
- Engine implementations: ``pythainlp/tokenize/*.py``
185+
- Tests: ``tests/core/test_tokenize_thread_safety.py``

pythainlp/tokenize/attacut.py

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@
99

1010
from __future__ import annotations
1111

12+
import threading
13+
1214
from attacut import Tokenizer
1315

1416

@@ -26,10 +28,17 @@ def tokenize(self, text: str) -> list[str]:
2628

2729

2830
_tokenizers: dict[str, AttacutTokenizer] = {}
31+
_tokenizers_lock = threading.Lock()
2932

3033

3134
def segment(text: str, model: str = "attacut-sc") -> list[str]:
3235
"""Wrapper for AttaCut - Fast and Reasonably Accurate Word Tokenizer for Thai
36+
37+
The wrapper uses a lock to protect access to the internal tokenizer cache.
38+
However, thread-safety of the underlying AttaCut library itself is not
39+
guaranteed. Please refer to the AttaCut library documentation for its
40+
thread-safety guarantees.
41+
3342
:param str text: text to be tokenized to words
3443
:param str model: model of word tokenizer model
3544
:return: list of words, tokenized from the text
@@ -41,8 +50,10 @@ def segment(text: str, model: str = "attacut-sc") -> list[str]:
4150
if not text or not isinstance(text, str):
4251
return []
4352

44-
global _tokenizers
45-
if model not in _tokenizers:
46-
_tokenizers[model] = AttacutTokenizer(model)
53+
# Thread-safe access to the tokenizers cache
54+
with _tokenizers_lock:
55+
if model not in _tokenizers:
56+
_tokenizers[model] = AttacutTokenizer(model)
57+
tokenizer = _tokenizers[model]
4758

48-
return _tokenizers[model].tokenize(text)
59+
return tokenizer.tokenize(text)

pythainlp/tokenize/budoux.py

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,10 @@
1212

1313
from __future__ import annotations
1414

15+
import threading
16+
1517
_parser = None
18+
_parser_lock = threading.Lock()
1619

1720

1821
def _init_parser():
@@ -34,17 +37,23 @@ def _init_parser():
3437
def segment(text: str) -> list[str]:
3538
"""Segment `text` into tokens using budoux.
3639
40+
The wrapper uses a lock to protect lazy initialization of the parser.
41+
However, thread-safety of the underlying budoux library itself is not
42+
guaranteed. Please refer to the budoux library documentation for its
43+
thread-safety guarantees.
44+
3745
The function returns a list of strings. If `budoux` is not available
3846
the function raises ImportError with an installation hint.
3947
"""
4048
if not text or not isinstance(text, str):
4149
return []
4250

43-
global _parser
44-
if _parser is None:
45-
_parser = _init_parser()
46-
47-
parser = _parser
51+
# Thread-safe lazy initialization
52+
with _parser_lock:
53+
if _parser is None:
54+
global _parser
55+
_parser = _init_parser()
56+
parser = _parser
4857

4958
result = parser.parse(text)
5059

pythainlp/tokenize/core.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,15 @@ def word_tokenize(
159159
:Note:
160160
- The **custom_dict** parameter only works for \
161161
*deepcut*, *longest*, *newmm*, and *newmm-safe* engines.
162+
- Built-in tokenizers (*longest*, *mm*, *newmm*, and *newmm-safe*) \
163+
are thread-safe.
164+
- Wrappers of external tokenizer are designed to be thread-safe \
165+
but depend on the external tokenizer.
166+
- **WARNING**: When using custom_dict in multi-threaded environments, \
167+
do NOT modify the Trie object (via add/remove methods) while \
168+
tokenization is in progress. The Trie data structure is not \
169+
thread-safe for concurrent modifications. Create your dictionary \
170+
before starting threads and only read from it during tokenization.
162171
:Example:
163172
164173
Tokenize text with different tokenizers::

pythainlp/tokenize/longest.py

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
from __future__ import annotations
1414

1515
import re
16+
import threading
1617

1718
from pythainlp import thai_tonemarks
1819
from pythainlp.tokenize import word_dict_trie
@@ -154,11 +155,15 @@ def tokenize(self, text: str) -> list[str]:
154155

155156

156157
_tokenizers: dict[int, LongestMatchTokenizer] = {}
158+
_tokenizers_lock = threading.Lock()
157159

158160

159161
def segment(text: str, custom_dict: Trie | None = None) -> list[str]:
160162
"""Dictionary-based longest matching word segmentation.
161163
164+
This function is thread-safe. It uses a lock to protect access to the
165+
internal tokenizer cache.
166+
162167
:param str text: text to be tokenized into words
163168
:param pythainlp.util.Trie custom_dict: dictionary for tokenization
164169
:return: list of words, tokenized from the text
@@ -169,9 +174,12 @@ def segment(text: str, custom_dict: Trie | None = None) -> list[str]:
169174
if not custom_dict:
170175
custom_dict = word_dict_trie()
171176

172-
global _tokenizers
173177
custom_dict_ref_id = id(custom_dict)
174-
if custom_dict_ref_id not in _tokenizers:
175-
_tokenizers[custom_dict_ref_id] = LongestMatchTokenizer(custom_dict)
176178

177-
return _tokenizers[custom_dict_ref_id].tokenize(text)
179+
# Thread-safe access to the tokenizers cache
180+
with _tokenizers_lock:
181+
if custom_dict_ref_id not in _tokenizers:
182+
_tokenizers[custom_dict_ref_id] = LongestMatchTokenizer(custom_dict)
183+
tokenizer = _tokenizers[custom_dict_ref_id]
184+
185+
return tokenizer.tokenize(text)

pythainlp/tokenize/oskut.py

Lines changed: 27 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,17 +11,38 @@
1111

1212
from __future__ import annotations
1313

14+
import threading
15+
1416
import oskut
1517

16-
DEFAULT_ENGINE = "ws"
17-
oskut.load_model(engine=DEFAULT_ENGINE)
18+
_DEFAULT_ENGINE = "ws"
19+
_engine_lock = threading.Lock()
20+
21+
# Load default model at module initialization
22+
oskut.load_model(engine=_DEFAULT_ENGINE)
1823

1924

2025
def segment(text: str, engine: str = "ws") -> list[str]:
21-
global DEFAULT_ENGINE
26+
"""Segment text using OSKut.
27+
28+
The wrapper uses a lock to protect model loading when switching engines.
29+
However, thread-safety of the underlying OSKut library itself is not
30+
guaranteed. Please refer to the OSKut library documentation for its
31+
thread-safety guarantees.
32+
33+
:param str text: text to be tokenized
34+
:param str engine: model engine to use
35+
:return: list of tokens
36+
"""
2237
if not text or not isinstance(text, str):
2338
return []
24-
if engine != DEFAULT_ENGINE:
25-
DEFAULT_ENGINE = engine
26-
oskut.load_model(engine=DEFAULT_ENGINE)
39+
40+
# Thread-safe model loading
41+
with _engine_lock:
42+
if engine != _DEFAULT_ENGINE:
43+
# Need to update global state and reload model
44+
global _DEFAULT_ENGINE
45+
_DEFAULT_ENGINE = engine
46+
oskut.load_model(engine=_DEFAULT_ENGINE)
47+
2748
return oskut.OSKut(text)

0 commit comments

Comments
 (0)