Skip to content

Commit adc38fa

Browse files
authored
OPENNLP-1753: Switch to pre-trained Model binaries v1.3 (OpenNLP 2.x) (#810)
- provides and adapts PR #799 for OpenNLP 2.x maintenance branch - adjusts version strings - adjusts index.html template to latest v1.3 copy for DownloadParserTest - adjusts examples in Dev manual
1 parent 5fb0530 commit adc38fa

File tree

23 files changed

+687
-627
lines changed

23 files changed

+687
-627
lines changed

opennlp-docs/src/docbkx/lemmatizer.xml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@
4141
<para>
4242
<screen>
4343
<![CDATA[
44-
$ opennlp LemmatizerME opennlp-en-ud-ewt-lemmas-1.2-2.5.0.bin < sentences]]>
44+
$ opennlp LemmatizerME opennlp-en-ud-ewt-lemmas-1.3-2.5.4.bin < sentences]]>
4545
</screen>
4646
The Lemmatizer now reads a pos tagged sentence(s) per line from
4747
standard input. For example, you can copy this sentence to the
@@ -89,7 +89,7 @@ signed VERB sign
8989
<programlisting language="java">
9090
<![CDATA[
9191
LemmatizerModel model = null;
92-
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-lemmas-1.2-2.5.0.bin"))) {
92+
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-lemmas-1.3-2.5.4.bin"))) {
9393
model = new LemmatizerModel(modelIn);
9494
}
9595
]]>

opennlp-docs/src/docbkx/postagger.xml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ under the License.
4141
Download the English maxent pos model and start the POS Tagger Tool with this command:
4242
<screen>
4343
<![CDATA[
44-
$ opennlp POSTagger opennlp-en-ud-ewt-pos-1.2-2.5.0.bin]]>
44+
$ opennlp POSTagger opennlp-en-ud-ewt-pos-1.3-2.5.4.bin]]>
4545
</screen>
4646
The POS Tagger now reads a tokenized sentence per line from stdin.
4747
Copy these two sentences to the console:
@@ -69,7 +69,7 @@ Mr._PROPN Vinken_PROPN is_AUX chairman_NOUN of_ADP Elsevier_ADJ N.V._PROPN ,_PUN
6969
In the sample below it is loaded from disk.
7070
<programlisting language="java">
7171
<![CDATA[
72-
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-pos-1.2-2.5.0.bin"){
72+
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-pos-1.3-2.5.4.bin"){
7373
POSModel model = new POSModel(modelIn);
7474
}]]>
7575
</programlisting>
@@ -343,7 +343,7 @@ Arrays.stream(tags).forEach(k -> System.out.print(k + " "));]]>
343343
POS Tags using the custom model (en-custom-pos-maxent.bin): PROPN OTHER PRON ADJ NOUN PUNCT
344344

345345
Output with the default model
346-
POS Tags using the default model (opennlp-en-ud-ewt-pos-1.2-2.5.0.bin): NOUN AUX PRON ADJ NOUN PUNCT
346+
POS Tags using the default model (opennlp-en-ud-ewt-pos-1.3-2.5.4.bin): NOUN AUX PRON ADJ NOUN PUNCT
347347
</literallayout>
348348
</para>
349349
</section>

opennlp-docs/src/docbkx/sentdetect.xml

Lines changed: 25 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
<?xml version="1.0" encoding="UTF-8"?>
22
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
3-
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
4-
]>
3+
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
4+
]>
55
<!--
66
Licensed to the Apache Software Foundation (ASF) under one
77
or more contributor license agreements. See the NOTICE file
@@ -57,31 +57,31 @@ Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC,
5757
</para>
5858

5959
<section id="tools.sentdetect.detection.cmdline">
60-
<title>Sentence Detection Tool</title>
61-
<para>
62-
The easiest way to try out the Sentence Detector is the command line tool. The tool is only intended for demonstration and testing.
63-
Download the english sentence detector model and start the Sentence Detector Tool with this command:
64-
<screen>
65-
<![CDATA[
66-
$ opennlp SentenceDetector opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin]]>
67-
</screen>
68-
Just copy the sample text from above to the console. The Sentence Detector will read it and echo one sentence per line to the console.
69-
Usually the input is read from a file and the output is redirected to another file. This can be achieved with the following command.
70-
<screen>
71-
<![CDATA[
72-
$ opennlp SentenceDetector opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin < input.txt > output.txt]]>
73-
</screen>
74-
For the english sentence model from the website the input text should not be tokenized.
75-
</para>
60+
<title>Sentence Detection Tool</title>
61+
<para>
62+
The easiest way to try out the Sentence Detector is the command line tool. The tool is only intended for demonstration and testing.
63+
Download the english sentence detector model and start the Sentence Detector Tool with this command:
64+
<screen>
65+
<![CDATA[
66+
$ opennlp SentenceDetector opennlp-en-ud-ewt-sentence-1.3-2.5.4.bin]]>
67+
</screen>
68+
Just copy the sample text from above to the console. The Sentence Detector will read it and echo one sentence per line to the console.
69+
Usually the input is read from a file and the output is redirected to another file. This can be achieved with the following command.
70+
<screen>
71+
<![CDATA[
72+
$ opennlp SentenceDetector opennlp-en-ud-ewt-sentence-1.3-2.5.4.bin < input.txt > output.txt]]>
73+
</screen>
74+
For the english sentence model from the website the input text should not be tokenized.
75+
</para>
7676
</section>
7777
<section id="tools.sentdetect.detection.api">
78-
<title>Sentence Detection API</title>
79-
<para>
80-
The Sentence Detector can be easily integrated into an application via its API.
81-
To instantiate the Sentence Detector the sentence model must be loaded first.
82-
<programlisting language="java">
83-
<![CDATA[
84-
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin")) {
78+
<title>Sentence Detection API</title>
79+
<para>
80+
The Sentence Detector can be easily integrated into an application via its API.
81+
To instantiate the Sentence Detector the sentence model must be loaded first.
82+
<programlisting language="java">
83+
<![CDATA[
84+
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-sentence-1.3-2.5.4.bin")) {
8585
SentenceModel model = new SentenceModel(modelIn);
8686
}]]>
8787
</programlisting>

opennlp-docs/src/docbkx/tokenizer.xml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ $ opennlp SimpleTokenizer]]>
9797
our website.
9898
<screen>
9999
<![CDATA[
100-
$ opennlp TokenizerME opennlp-en-ud-ewt-tokens-1.2-2.5.0.bin]]>
100+
$ opennlp TokenizerME opennlp-en-ud-ewt-tokens-1.3-2.5.4.bin]]>
101101
</screen>
102102
To test the tokenizer copy the sample from above to the console. The
103103
whitespace separated tokens will be written back to the
@@ -107,7 +107,7 @@ $ opennlp TokenizerME opennlp-en-ud-ewt-tokens-1.2-2.5.0.bin]]>
107107
Usually the input is read from a file and written to a file.
108108
<screen>
109109
<![CDATA[
110-
$ opennlp TokenizerME opennlp-en-ud-ewt-tokens-1.2-2.5.0.bin < article.txt > article-tokenized.txt]]>
110+
$ opennlp TokenizerME opennlp-en-ud-ewt-tokens-1.3-2.5.4.bin < article.txt > article-tokenized.txt]]>
111111
</screen>
112112
It can be done in the same way for the Simple Tokenizer.
113113
</para>
@@ -151,7 +151,7 @@ London share prices were bolstered largely by continued gains on Wall Street and
151151
can be loaded.
152152
<programlisting language="java">
153153
<![CDATA[
154-
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-tokens-1.2-2.5.0.bin")) {
154+
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-tokens-1.3-2.5.4.bin")) {
155155
TokenizerModel model = new TokenizerModel(modelIn);
156156
}]]>
157157
</programlisting>

opennlp-tools/src/main/java/opennlp/tools/monitoring/StopCriteria.java

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,18 +19,14 @@
1919

2020
import java.util.function.Predicate;
2121

22-
import opennlp.tools.ml.model.AbstractModel;
23-
24-
2522
/**
2623
* Stop criteria for model training. If the predicate is met, then the training is aborted.
2724
*
2825
* @see Predicate
29-
* @see AbstractModel
3026
*/
3127
public interface StopCriteria<T extends Number> extends Predicate<T> {
3228

33-
String FINISHED = "Training Finished after completing %s Iterations successfully.";
29+
String FINISHED = "Training finished after completing %s iterations successfully.";
3430

3531
/**
3632
* @return A detailed message captured upon hitting the {@link StopCriteria} during model training.

opennlp-tools/src/main/java/opennlp/tools/util/DownloadUtil.java

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -56,9 +56,9 @@ public class DownloadUtil {
5656
private static final Logger logger = LoggerFactory.getLogger(DownloadUtil.class);
5757

5858
private static final String BASE_URL =
59-
System.getProperty("OPENNLP_DOWNLOAD_BASE_URL", "https://dlcdn.apache.org/opennlp/");
59+
System.getProperty("OPENNLP_DOWNLOAD_BASE_URL", "https://dlcdn.apache.org/opennlp/");
6060
private static final String MODEL_URI_PATH =
61-
System.getProperty("OPENNLP_DOWNLOAD_MODEL_PATH", "models/ud-models-1.2/");
61+
System.getProperty("OPENNLP_DOWNLOAD_MODEL_PATH", "models/ud-models-1.3/");
6262
private static final String OPENNLP_DOWNLOAD_HOME = "OPENNLP_DOWNLOAD_HOME";
6363

6464
private static Map<String, Map<ModelType, String>> availableModels;
@@ -202,7 +202,7 @@ private static void validateModel(URL sha512, Path downloadedModel) throws IOExc
202202
final String actualChecksum = calculateSHA512(downloadedModel);
203203
if (!actualChecksum.equalsIgnoreCase(expectedChecksum)) {
204204
throw new IOException("SHA512 checksum validation failed for " + downloadedModel.getFileName() +
205-
". Expected: " + expectedChecksum + ", but got: " + actualChecksum);
205+
". Expected: " + expectedChecksum + ", but got: " + actualChecksum);
206206
}
207207
}
208208

@@ -353,7 +353,7 @@ private void addModel(String locale, String link, Map<String, Map<ModelType, Str
353353
private String fetchPageIndex() {
354354
final StringBuilder html = new StringBuilder();
355355
try (BufferedReader br = new BufferedReader(
356-
new InputStreamReader(indexUrl.openStream(), StandardCharsets.UTF_8))) {
356+
new InputStreamReader(indexUrl.openStream(), StandardCharsets.UTF_8))) {
357357
String line;
358358
while ((line = br.readLine()) != null) {
359359
html.append(line);

opennlp-tools/src/test/java/opennlp/tools/AbstractModelLoaderTest.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ public abstract class AbstractModelLoaderTest {
3838
private static final String BASE_URL_MODELS_V183 = "https://dlcdn.apache.org/opennlp/models/langdetect/1.8.3/";
3939
protected static final Path OPENNLP_DIR = Paths.get(System.getProperty("OPENNLP_DOWNLOAD_HOME",
4040
System.getProperty("user.home"))).resolve(".opennlp");
41-
protected static final String VER = "1.2-2.5.0";
41+
protected static final String VER = "1.3-2.5.4";
4242
protected static final String BIN = ".bin";
4343
protected static List<String> SUPPORTED_LANG_CODES = List.of(
4444
"en", "fr", "de", "it", "nl", "bg", "ca", "cs", "da", "el",

opennlp-tools/src/test/java/opennlp/tools/cmdline/lemmatizer/LemmatizerModelLoaderIT.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ public void setup() {
5656

5757
@ParameterizedTest(name = "Verify \"{0}\" tokenizer model loading")
5858
@ValueSource(strings = {"en-ud-ewt", "fr-ud-gsd", "de-ud-gsd", "it-ud-vit", "nl-ud-alpino",
59-
"bg-ud-btb", "ca-ud-ancora", "cs-ud-pdt", "da-ud-ddt", "el-ud-gdt", "es-ud-gsd", "et-ud-edt",
59+
"bg-ud-btb", "ca-ud-ancora", "cs-ud-pdtc", "da-ud-ddt", "el-ud-gdt", "es-ud-gsd", "et-ud-edt",
6060
"eu-ud-bdt", "fi-ud-tdt", "hr-ud-set", "hy-ud-bsut", "is-ud-icepahc", "ka-ud-glc", "kk-ud-ktb",
6161
"ko-ud-kaist", "lv-ud-lvtb", "no-ud-bokmaal", "pl-ud-pdb", "pt-ud-gsd", "ro-ud-rrt", "ru-ud-gsd",
6262
"sr-ud-set", "sk-ud-snk", "sl-ud-ssj", "sv-ud-talbanken", "tr-ud-boun", "uk-ud-iu"})

opennlp-tools/src/test/java/opennlp/tools/cmdline/postag/POSModelLoaderIT.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ public void setup() {
5656

5757
@ParameterizedTest(name = "Verify \"{0}\" POS model loading")
5858
@ValueSource(strings = {"en-ud-ewt", "fr-ud-gsd", "de-ud-gsd", "it-ud-vit", "nl-ud-alpino",
59-
"bg-ud-btb", "ca-ud-ancora", "cs-ud-pdt", "da-ud-ddt", "el-ud-gdt", "es-ud-gsd", "et-ud-edt",
59+
"bg-ud-btb", "ca-ud-ancora", "cs-ud-pdtc", "da-ud-ddt", "el-ud-gdt", "es-ud-gsd", "et-ud-edt",
6060
"eu-ud-bdt", "fi-ud-tdt", "hr-ud-set", "hy-ud-bsut", "is-ud-icepahc", "ka-ud-glc", "kk-ud-ktb",
6161
"ko-ud-kaist", "lv-ud-lvtb", "no-ud-bokmaal", "pl-ud-pdb", "pt-ud-gsd", "ro-ud-rrt", "ru-ud-gsd",
6262
"sr-ud-set", "sk-ud-snk", "sl-ud-ssj", "sv-ud-talbanken", "tr-ud-boun", "uk-ud-iu"})

opennlp-tools/src/test/java/opennlp/tools/cmdline/sentdetect/SentenceModelLoaderIT.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ public void setup() {
5656

5757
@ParameterizedTest(name = "Verify \"{0}\" sentence model loading")
5858
@ValueSource(strings = {"en-ud-ewt", "fr-ud-gsd", "de-ud-gsd", "it-ud-vit", "nl-ud-alpino",
59-
"bg-ud-btb", "ca-ud-ancora", "cs-ud-pdt", "da-ud-ddt", "el-ud-gdt", "es-ud-gsd", "et-ud-edt",
59+
"bg-ud-btb", "ca-ud-ancora", "cs-ud-pdtc", "da-ud-ddt", "el-ud-gdt", "es-ud-gsd", "et-ud-edt",
6060
"eu-ud-bdt", "fi-ud-tdt", "hr-ud-set", "hy-ud-bsut", "is-ud-icepahc", "ka-ud-glc", "kk-ud-ktb",
6161
"ko-ud-kaist", "lv-ud-lvtb", "no-ud-bokmaal", "pl-ud-pdb", "pt-ud-gsd", "ro-ud-rrt", "ru-ud-gsd",
6262
"sr-ud-set", "sk-ud-snk", "sl-ud-ssj", "sv-ud-talbanken", "tr-ud-boun", "uk-ud-iu"})

0 commit comments

Comments
 (0)