Skip to content

Commit fe9da10

Browse files
committed
OPENNLP-1745 : SentenceDetector - Add Junit test for useTokenEnd = false
1 parent 4fae1f5 commit fe9da10

File tree

83 files changed

+2207
-294
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

83 files changed

+2207
-294
lines changed

.asf.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ github:
1212
dismiss_stale_reviews: false
1313
require_code_owner_reviews: false
1414
required_approving_review_count: 1
15+
opennlp-2.x: {}
1516
autolink_jira:
1617
- OPENNLP
1718
custom_subjects:

.github/workflows/maven.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ jobs:
3434
java: [ 17, 21 ]
3535
experimental: [false]
3636
include:
37-
- java: 24-ea
37+
- java: 25-ea
3838
os: ubuntu-latest
3939
experimental: true
4040

@@ -49,11 +49,11 @@ jobs:
4949
restore-keys: |
5050
${{ runner.os }}-maven-
5151
- name: Set up JDK ${{ matrix.java }}
52-
uses: actions/setup-java@v4
52+
uses: actions/setup-java@3a4f6e1af504cf6a31855fa899c6aa5355ba6c12 # v4.7.0
5353
with:
5454
distribution: temurin
5555
java-version: ${{ matrix.java }}
5656
- name: Build with Maven
57-
run: mvn -V clean test install --no-transfer-progress -Pjacoco -Pci
57+
run: mvn -V clean test install --show-version --batch-mode --no-transfer-progress -Pjacoco -Pci
5858
- name: Jacoco
5959
run: mvn jacoco:report

NOTICE

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -61,9 +61,9 @@ List of third-party dependencies grouped by their license type.
6161

6262
MIT License
6363

64-
* onnx-runtime (com.microsoft.onnxruntime:onnxruntime_gpu:1.20.0 - https://microsoft.github.io/onnxruntime/)
65-
* onnx-runtime (com.microsoft.onnxruntime:onnxruntime:1.20.0 - https://microsoft.github.io/onnxruntime/)
66-
* SLF4J API Module (org.slf4j:slf4j-api:2.0.16 - http://www.slf4j.org)
64+
* onnx-runtime (com.microsoft.onnxruntime:onnxruntime_gpu:1.22.0 - https://microsoft.github.io/onnxruntime/)
65+
* onnx-runtime (com.microsoft.onnxruntime:onnxruntime:1.22.0 - https://microsoft.github.io/onnxruntime/)
66+
* SLF4J API Module (org.slf4j:slf4j-api:2.0.17 - http://www.slf4j.org)
6767

6868
The MIT License (MIT)
6969

README.md

Lines changed: 30 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -71,10 +71,10 @@ Currently, the library has different packages:
7171
* `opennlp-tools` : The core toolkit.
7272
* `opennlp-tools-models` : A set of classes to load [OpenNLP models](https://github.com/apache/opennlp-models) from the classpath.
7373
* `opennlp-uima` : A set of [Apache UIMA](https://uima.apache.org) annotators.
74-
* `opennlp-morfologik-addon` : An addon for Morfologik
75-
* `opennlp-dl` : OpenNLP interface implementations for ONNX models using the `onnxruntime` dependency.
74+
* `opennlp-morfologik-addon` : An addon for _Morfologik_.
75+
* `opennlp-dl` : OpenNLP interface implementations for [ONNX](https://onnx.ai) models using the `onnxruntime` dependency.
7676
* `opennlp-dl-gpu` : Replaces `onnxruntime` with the `onnxruntime_gpu` dependency to support GPU acceleration.
77-
* `opennlp-sandbox`: Other projects in progress are found in the [sandbox](https://github.com/apache/opennlp-sandbox)
77+
* `opennlp-sandbox`: Other projects in progress reside in the [sandbox](https://github.com/apache/opennlp-sandbox).
7878

7979
## Getting Started
8080

@@ -104,6 +104,28 @@ compile group: "org.apache.opennlp", name: "opennlp-tools", version: "${opennlp.
104104

105105
For more details please check our [documentation](http://opennlp.apache.org/docs/)
106106

107+
## Branches and Merging Strategy
108+
109+
To support ongoing development and stable maintenance of Apache OpenNLP, the project follows a dual-branch model:
110+
111+
### Branch overview
112+
113+
- **`main`**: Development branch for version **3.0** and beyond. All feature development and 3.x releases occur here.
114+
- **`opennlp-2.x`**: Maintains the stable **2.x** release line. This branch will receive selective updates and patch releases.
115+
116+
### Workflow summary
117+
118+
- Feature development
119+
- New features targeting versions 3.0+ are developed on feature branches _off_ `main` and merged _into_ `main`.
120+
- Bug fixes and dependency updates
121+
- Relevant fixes or dependency updates from `main` may be cherry-picked into `opennlp-2.x` as needed.
122+
- Releases
123+
- **3.x** releases are made from the `main` branch.
124+
- **2.x** releases are made from the `opennlp-2.x` branch.
125+
- Release tags
126+
- Release tags are applied directly to the appropriate version branch (`main` for 3.x or `opennlp-2.x` for 2.x).
127+
- The presence of a version branch does not affect the tagging or visibility of releases.
128+
107129
## Building OpenNLP
108130

109131
At least JDK 17 and Maven 3.3.9 are required to build the library.
@@ -114,12 +136,14 @@ After cloning the repository go into the destination directory and run:
114136
mvn install
115137
```
116138

117-
### Additional Developement Information
139+
### Additional Development Information
118140

119-
- [Building and Integrating Snowball Stemmer for OpenNLP](dev/Snowball-Stemmer.md)
141+
- Building and integrating [Snowball Stemmer](dev/Snowball-Stemmer.md) for OpenNLP.
120142

121143
## Contributing
122144

123-
The Apache OpenNLP project is developed by volunteers and is always looking for new contributors to work on all parts of the project. Every contribution is welcome and needed to make it better. A contribution can be anything from a small documentation typo fix to a new component.
145+
The Apache OpenNLP project is developed by volunteers and is always looking for new contributors to work on all parts of the project.
146+
Every contribution is welcome and needed to make it better.
147+
A contribution can be anything from a small documentation typo fix to a new component.
124148

125149
If you would like to get involved please follow the instructions [here](https://github.com/apache/opennlp/blob/main/.github/CONTRIBUTING.md)

opennlp-distr/pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
<parent>
2525
<groupId>org.apache.opennlp</groupId>
2626
<artifactId>opennlp</artifactId>
27-
<version>2.5.4-SNAPSHOT</version>
27+
<version>3.0.0-SNAPSHOT</version>
2828
<relativePath>../pom.xml</relativePath>
2929
</parent>
3030

opennlp-dl-gpu/pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
<parent>
2525
<groupId>org.apache.opennlp</groupId>
2626
<artifactId>opennlp</artifactId>
27-
<version>2.5.4-SNAPSHOT</version>
27+
<version>3.0.0-SNAPSHOT</version>
2828
<relativePath>../pom.xml</relativePath>
2929
</parent>
3030
<groupId>org.apache.opennlp</groupId>

opennlp-dl/pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
<parent>
2525
<groupId>org.apache.opennlp</groupId>
2626
<artifactId>opennlp</artifactId>
27-
<version>2.5.4-SNAPSHOT</version>
27+
<version>3.0.0-SNAPSHOT</version>
2828
<relativePath>../pom.xml</relativePath>
2929
</parent>
3030
<groupId>org.apache.opennlp</groupId>

opennlp-docs/pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
<parent>
2525
<groupId>org.apache.opennlp</groupId>
2626
<artifactId>opennlp</artifactId>
27-
<version>2.5.4-SNAPSHOT</version>
27+
<version>3.0.0-SNAPSHOT</version>
2828
<relativePath>../pom.xml</relativePath>
2929
</parent>
3030

opennlp-docs/src/docbkx/model-loading.xml

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -78,18 +78,18 @@ under the License.
7878

7979
<programlisting language="java">
8080
<![CDATA[
81-
final ClassgraphModelFinder finder = new ClassgraphModelFinder(); // or use new SimpleClassPathModelFinder()
81+
final ClassgraphModelFinder finder = new ClassgraphModelFinder(); // or use: new SimpleClassPathModelFinder()
8282
final ClassPathModelLoader loader = new ClassPathModelLoader();
8383
final Set<ClassPathModelEntry> models = finder.findModels(false);
8484
for(ClassPathModelEntry entry : models) {
8585
8686
final ClassPathModel model = loader.load(entry);
8787
8888
if(model != null) {
89-
System.out.println(model.getModelName());
90-
System.out.println(model.getModelSHA256());
91-
System.out.println(model.getModelVersion());
92-
System.out.println(model.getModeLanguage());
89+
System.out.println(model.getModelName());
90+
System.out.println(model.getModelSHA256());
91+
System.out.println(model.getModelVersion());
92+
System.out.println(model.getModelLanguage());
9393
// do something with the model by consuming the byte array
9494
}
9595
}]]>
@@ -120,23 +120,23 @@ model.language=${model.language}
120120
Make sure to replace the values accordingly and configure your build tool to include the binary model and the <emphasis>model.properties</emphasis>
121121
in the resulting JAR file.
122122

123-
To load such a custom model, you may need to adjust the pattern for classpath scanning. For example, if you name your model "custom-opennlp-model",
123+
To load such a custom model, you may need to adjust the pattern for classpath scanning. For example, if you name the model "custom-opennlp-model",
124124
you need the following code to successfully find and load it:
125125

126126
<programlisting language="java">
127127
<![CDATA[
128-
final ClassgraphModelFinder finder = new ClassgraphModelFinder("custom-opennlp-model.jar"); // or use new SimpleClassPathModelFinder("custom-opennlp-model.jar")
128+
final ClassgraphModelFinder finder = new ClassgraphModelFinder("custom-opennlp-model.jar"); // or use: new SimpleClassPathModelFinder("custom-opennlp-model.jar")
129129
final ClassPathModelLoader loader = new ClassPathModelLoader();
130130
final Set<ClassPathModelEntry> models = finder.findModels(false);
131131
for(ClassPathModelEntry entry : models) {
132132
133133
final ClassPathModel model = loader.load(entry);
134134
135135
if(model != null) {
136-
System.out.println(model.getModelName());
137-
System.out.println(model.getModelSHA256());
138-
System.out.println(model.getModelVersion());
139-
System.out.println(model.getModeLanguage());
136+
System.out.println(model.getModelName());
137+
System.out.println(model.getModelSHA256());
138+
System.out.println(model.getModelVersion());
139+
System.out.println(model.getModelLanguage());
140140
// do something with the model by consuming the byte array
141141
}
142142
}]]>

opennlp-docs/src/docbkx/postagger.xml

Lines changed: 107 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -237,11 +237,114 @@ try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(model
237237
</para>
238238
<para>
239239
The dictionary is defined in a xml format and can be created and stored with the POSDictionary class.
240-
Please for now checkout the javadoc and source code of that class.
240+
Below is an example to train a custom model using a tag dictionary.
241241
</para>
242-
<para>Note: The format should be documented and sample code should show how to use the dictionary.
243-
Any contributions are very welcome. If you want to contribute please contact us on the mailing list
244-
or comment on the jira issue <ulink url="https://issues.apache.org/jira/browse/OPENNLP-287">OPENNLP-287</ulink>.
242+
<para>
243+
Sample POS Training material (file : en-custom-pos.train)
244+
<screen>
245+
<![CDATA[
246+
It_PRON is_OTHER spring_PROPN season_NOUN. The_DET flowers_NOUN are_OTHER red_ADJ and_CCONJ yellow_ADJ ._PUNCT
247+
Red_NOUN is_OTHER my_DET favourite_ADJ colour_NOUN ._PUNCT]]>
248+
</screen>
249+
</para>
250+
<para>
251+
Sample Tag Dictionary (file : dictionary.xml)
252+
<programlisting language="xml">
253+
<![CDATA[
254+
<?xml version="1.0" encoding="UTF-8"?>
255+
<dictionary case_sensitive="false">
256+
<entry tags="PRON">
257+
<token>It</token>
258+
</entry>
259+
<entry tags="OTHER">
260+
<token>is</token>
261+
</entry>
262+
<entry tags="PROPN">
263+
<token>Spring</token>
264+
</entry>
265+
<entry tags="NOUN">
266+
<token>season</token>
267+
</entry>
268+
<entry tags="DET">
269+
<token>the</token>
270+
</entry>
271+
<entry tags="NOUN">
272+
<token>flowers</token>
273+
</entry>
274+
<entry tags="OTHER">
275+
<token>are</token>
276+
</entry>
277+
<entry tags="NOUN">
278+
<token>red</token>
279+
</entry>
280+
<entry tags="CCONJ">
281+
<token>and</token>
282+
</entry>
283+
<entry tags="NOUN">
284+
<token>yellow</token>
285+
</entry>
286+
<entry tags="PRON">
287+
<token>my</token>
288+
</entry>
289+
<entry tags="ADJ">
290+
<token>favourite</token>
291+
</entry>
292+
<entry tags="NOUN">
293+
<token>colour</token>
294+
</entry>
295+
<entry tags="PUNCT">
296+
<token>.</token>
297+
</entry>
298+
</dictionary>]]>
299+
</programlisting>
300+
</para>
301+
<para>Sample code to train a model using above tag dictionary
302+
<programlisting language="java">
303+
<![CDATA[
304+
POSModel model = null;
305+
try {
306+
ObjectStream<String> lineStream = new PlainTextByLineStream(
307+
new MarkableFileInputStreamFactory(new File("en-custom-pos.train")), StandardCharsets.UTF_8);
308+
309+
ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
310+
311+
TrainingParameters params = ModelUtil.createDefaultTrainingParameters();
312+
params.put(TrainingParameters.CUTOFF_PARAM, 0);
313+
314+
POSTaggerFactory factory = new POSTaggerFactory();
315+
TagDictionary dict = factory.createTagDictionary(new File("dictionary.xml"));
316+
factory.setTagDictionary(dict);
317+
318+
model = POSTaggerME.train("eng", sampleStream, params, factory);
319+
320+
OutputStream modelOut = new BufferedOutputStream(new FileOutputStream("en-custom-pos-maxent.bin"));
321+
model.serialize(modelOut);
322+
323+
} catch (IOException e) {
324+
e.printStackTrace();
325+
}]]>
326+
</programlisting>
327+
</para>
328+
<para>
329+
The custom model is then used to tag a sequence.
330+
<programlisting language="java">
331+
<![CDATA[
332+
String[] sent = new String[]{"Spring", "is", "my", "favourite", "season", "."};
333+
String[] tags = tagger.tag(sent);
334+
Arrays.stream(tags).forEach(k -> System.out.print(k + " "));]]>
335+
</programlisting>
336+
</para>
337+
<para>
338+
<literallayout>
339+
Input
340+
Sentence: Spring is my favourite season.
341+
342+
Output
343+
POS Tags using the custom model (en-custom-pos-maxent.bin): PROPN OTHER PRON ADJ NOUN PUNCT
344+
345+
Output with the default model
346+
POS Tags using the default model (opennlp-en-ud-ewt-pos-1.2-2.5.0.bin): NOUN AUX PRON ADJ NOUN PUNCT
347+
</literallayout>
245348
</para>
246349
</section>
247350
</section>

0 commit comments

Comments
 (0)