OPENNLP-1816: Make ME classes thread-safe by eliminating shared mutable instance state#1003
OPENNLP-1816: Make ME classes thread-safe by eliminating shared mutable instance state#1003krickert wants to merge 1 commit intoapache:mainfrom
Conversation
2a904dd to
729b9c1
Compare
|
There were 3 checkstyle violations - fixed those. |
|
@krickert Thanks for the PR! |
|
always been a fan of OpenNLP - what I love about finally contributing is that before this patch, I was having to create pools of ME objects or create new ones every time. This gets rid of all that scaffolding. If you would like me to create anymore tests, let me know. I think the new tests cover the concurrency and recall use cases well. And I think the speed tests show that there's no concern about performance. I was excited to see the > 1.5x speedup with POSTagger... It's the single reason why I decided to work on this. |
|
Hi, Thanks for the contribution! Overall, I like the idea of looking into built-in thread safety rather than relying on ThreadLocal-based wrappers, which have known issues in Jakarta EE and other long-lived thread environments. A few concerns I'd like to discuss before this can move forward (imho):
The benchmarks are hand-rolled System.nanoTime() + ExecutorService loops. Without JMH, the results are susceptible to JIT warmup, GC pauses, and profile pollution, i.e. there's no fork isolation, no warmup iterations, and no statistical variance reporting. For a change that removes multiple
Three layers of caching were removed as a shortcut to thread safety:
The regression benchmark reports "performance within noise," but without JMH-level statistical rigor that's hard to verify. More importantly, the benchmark uses a small set of short sentences: a benchmark against a real-world dataset (e.g., from the eval/test corpora: https://nightlies.apache.org/opennlp/) would be far more convincing, particularly for POS tagging where the feature generation cache had the most impact under larger workloads. A thread-safe alternative would be making the caches method-local rather than removing them entirely.
|
|
Regardless of my comment, I am going to trigger a Eval build for this: https://ci-builds.apache.org/job/OpenNLP/job/eval-tests-configurable/39/ |
|
@rzo1 working on addressing all of your concerns right now - it'll be done in a moment. I'm restoring the caches and running tests with and without with proper benchmarks. All great points, and thanks for the feedback. |
729b9c1 to
94ca28d
Compare
|
I'm going to make the caches optional and configurable. This way we can run tests against all scenarios and come up with as many uses cases as needed to measure the impact. The last commit was premature, I'm still working on this. |
@krickert Thx Kristian for tackling this complex topic with so much energy! Much appreciated! Happy to review this PR deeper, especially lkn fwd for the JMH analyses. Richard has already given deep feedback in the first round; I'll share my 2c later on code stylistic nuances, seeking an optimal result from devs perspective. For the moment, completing the 3.0.0-M2 release process is on my list… |
|
@mawiesne no problem... I've been thinking about this for awhile now. @rzo1 you were right about CachedFeatureGenerator. The data shows it clearly and it helps. That particular cache in the old vs new instances to bring a 1.6x boost. This combined with the thread safety feature with reuse show over a 2x increase now. Thanks for pointing that out. But don't trust what I say; I'll update the tests shortly to show it (I would love to see it on another machine too) |
bfc8fdf to
d31aaa6
Compare
|
Thanks for the detailed feedback. We've addressed all four points made by @rzo1 . Here's a summary of what changed and the JMH data behind each decision. 1. Benchmarks (JMH)Replaced all hand-rolled
Also fixed the existing JMH profile - the annotation processor wasn't wired into the compiler plugin, so the Approaches measured
JMH Results (32 threads, all cores)
Tokenizer and SentenceDetector: all approaches within error bars (lightweight constructors). 2. CachesWe restored all caches as ThreadLocal (per-thread, not shared). Same behavior as the originals in single-threaded use, safe under concurrency. We also added a JMH Cache Impact Results (POSTagger, 32 threads)
This told us which caches matter and which don't:
Regarding the BeamSearch cache specifically
We restored it as ThreadLocal with per-thread 3. Thread-safety testsAddressed all sub-points:
4. Missing ME classesAll 7 ME classes are now covered:
All 7 ME classes are annotated 5. ThreadSafe*ME wrappers deprecatedSince the ME classes are now themselves thread-safe, the
We also replaced all internal usages of
No internal code uses the wrappers anymore. Open item
Agreed - this would strengthen the perf claims. The JMH benchmarks currently use the project's test data ( Do you have any real-world dataset tests around that we can run it against quickly? It's the only way I'd feel confident as well. |
|
Summary since first review: Made all 7 ME classes thread-safe by eliminating shared mutable instance state. Deprecate the MotivationME classes were documented as not thread-safe due to mutable instance fields that corrupt under concurrent access. The workarounds were creating a new ME instance per call (expensive) or using ApproachMutable state moved to method-local variables or per-thread caches (ThreadLocal) at every layer:
Files changed (30 total)Source (13 files): TokenizerME, SentenceDetectorME, POSTaggerME, LemmatizerME, ChunkerME, NameFinderME, LanguageDetectorME, BeamSearch, CachedFeatureGenerator, ConfigurablePOSContextGenerator, DefaultPOSContextGenerator, DefaultSDContextGenerator, SentenceContextGenerator (Thai) Deprecated (7 files): ThreadSafeTokenizerME, ThreadSafeSentenceDetectorME, ThreadSafePOSTaggerME, ThreadSafeLemmatizerME, ThreadSafeChunkerME, ThreadSafeNameFinderME, ThreadSafeLanguageDetectorME Internal usage swaps (3 files): Muc6NameSampleStreamFactory, TwentyNewsgroupSampleStreamFactory, POSTaggerMEIT - replaced Tests/benchmarks (5 files): ThreadSafetyBenchmarkTest (8 JUnit tests), 3 JMH benchmarks, CachedFeatureGeneratorTest update Build (1 file): pom.xml - fixed JMH annotation processor wiring |
d31aaa6 to
b02c2eb
Compare
|
@mawiesne - I did a push again to make the code try to match the style better - the problem I had was that your CICD failed linting and forced me to do 80-column code - which makes part of the code look ugly if not for my IDE. Can you ease up on the linting to make it 120 or 140 columns? or is that too much? I don't care either way, it's just a setting on my IDE - but the code in there has 3000+ violations - so I don't suspect it's really been enforced for a long time. |
|
Note: You can use the OpenNLP Formatting XML which is provided as download. In addition, you only have a few fixes: |
Oh cool! Thanks. I'll fix those today |
…le state All 7 ME classes (TokenizerME, SentenceDetectorME, POSTaggerME, LemmatizerME, ChunkerME, NameFinderME, LanguageDetectorME) are now safe for concurrent use from multiple threads. The ThreadSafe*ME wrappers are deprecated — use the ME classes directly. Thread-safety approach: - ME instance fields (bestSequence, tokProbs, newTokens, sentProbs) changed to volatile with method-local processing, atomic swap at end - BeamSearch: probs[] buffer and contextsCache moved to per-thread state via ThreadLocal - CachedFeatureGenerator: cache moved to per-thread state via ThreadLocal (JMH confirms 1.62x benefit from this cache) - ConfigurablePOSContextGenerator: cache moved to per-thread state via ThreadLocal - DefaultSDContextGenerator: buf/collectFeats moved to method-local JMH benchmark results (32 threads): - POSTagger instancePerThread: 2.52x faster than newInstancePerCall - POSTagger cache on vs off: no measurable difference for context generator cache; CachedFeatureGenerator provides 1.62x benefit - Tokenizer/SentenceDetector: all approaches within error bars API changes: - All 7 ME classes annotated @threadsafe - All 7 ThreadSafe*ME wrappers annotated @deprecated(since="3.0.0") - POSTaggerME: added constructor with contextCacheSize parameter - CachedFeatureGenerator: added DISABLE_CACHE_PROPERTY for benchmarking - Internal usages of ThreadSafe*ME replaced with direct ME usage Tests: - ThreadSafetyBenchmarkTest: 8 JUnit tests with CyclicBarrier (all 7 ME classes + probs() concurrency test) - JMH benchmarks for Tokenizer, SentenceDetector, POSTagger - Fixed JMH annotation processor config in pom.xml - All 680 runtime + 352 formats tests pass
b02c2eb to
178386f
Compare
|
Fixed.. let me know if there's more tests you'd like me to do. I think between the benchmarks, passing tests, and harness, it seems like a great use case. |
Summary
Make ME classes (TokenizerME, SentenceDetectorME, POSTaggerME, LemmatizerME) safe for concurrent use by eliminating shared mutable instance state. This enables reusing ME instances across threads instead of allocating a new instance per call, reducing allocation overhead in high-throughput pipelines.
The old pattern (
new TokenizerME(model)per call) continues to work identically — zero regressions in correctness or performance.Motivation
ME classes were documented as not thread-safe due to mutable instance fields (
bestSequence,tokProbs,newTokens,sentProbs) that corrupt under concurrent access. The recommended workaround was either creating a new ME instance per call (expensive for high-throughput pipelines processing thousands of sentences in parallel) or using theThreadSafe*MEwrappers (which useThreadLocaland leak in Jakarta EE / long-running thread environments).The root cause was mutable state at four layers:
contextsCache,wordsKey,buf,collectFeats)CachedFeatureGeneratorwith mutableprevTokensand cacheprobs[]output buffer and acontextsCachethat stored references to the reused buffer (cached values were always stale)Approach
Move mutable state to method-local variables at every layer. ME instance fields are preserved as
volatilefor backward-compatibleprobs()access (last-writer-wins under concurrency). Caches are removed entirely — they were small (size 3 typically), not thread-safe, and in BeamSearch's case, buggy.Files changed (10 source, 5 test)
BeamSearch.javaprobs[]and buggycontextsCache; added@ThreadSafeDefaultSDContextGenerator.javabuf/collectFeatsmoved to method-local;collectFeatures()signature updatedSentenceContextGenerator.java(Thai)collectFeatures()signatureDefaultPOSContextGenerator.javacontextsCacheandwordsKeyConfigurablePOSContextGenerator.javacontextsCacheandwordsKeyCachedFeatureGenerator.javaprevTokens,contextsCache, counters; delegates directlyTokenizerME.javanewTokens/tokProbsvolatile;tokenizePos()uses local listsSentenceDetectorME.javasentProbsvolatile;sentPosDetect()uses local listPOSTaggerME.javabestSequencevolatile;tag()uses local var; added null guardLemmatizerME.javabestSequencevolatile;predictSES()uses local varBackward compatibility
new ME(model)per call) is unchanged — verified by regression benchmarkprobs()methods preserved (deprecated behavior under concurrency, correct single-threaded)cacheSizeparams accepted but ignored, marked@Deprecated(since = "3.0.0"))Test plan
mvn teston opennlp-runtime)ThreadSafetyBenchmarkTest— JUnit correctness test: shared ME instances produce identical results to single-threaded baseline across all CPU coresRegressionBenchmark— head-to-head stock vs patched, new-instance-per-call only: zero mismatches, zero errors, performance within noise on both buildsThreadSafetyBenchmark— three-way comparison (new-instance-per-call / instance-per-thread / shared-single-instance)CachedFeatureGeneratorTest— updated for removed cache behaviormvn clean installat root (checkstyle must be skipped — 9,446 pre-existing violations on main)Regression benchmark results (32 threads, new-instance-per-call)
Proves zero regression — stock vs patched, same API pattern:
Speedup benchmark results (32 threads, three-way comparison)
Approaches
The benchmark compares three strategies for using ME classes in a multi-threaded environment. All three produce identical output for a given input — the difference is how ME instances are allocated and shared.
String[] tags = new POSTaggerME(model).tag(tokens);POSTaggerME tagger = new POSTaggerME(model);for (String[] t : sentences) tagger.tag(t);POSTaggerME shared = new POSTaggerME(model);// pass shared to all threadsBenchmark results
POSTagger sees the largest gain because its constructor is the heaviest — it builds a BeamSearch, a ConfigurablePOSContextGenerator, and a full AdaptiveFeatureGenerator chain on every instantiation. Reusing one instance per thread eliminates that allocation on every call, yielding a 1.67x speedup with zero correctness impact.
Tokenizer and SentenceDetector constructors are lighter, so the per-call overhead is smaller and all three approaches perform similarly.
See
opennlp-core/opennlp-runtime/BENCHMARKS.mdfor full benchmark instructions.Thank you for contributing to Apache OpenNLP.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
https://issues.apache.org/jira/browse/OPENNLP-1816
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?
For code changes:
For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.