Skip to content

Commit 7f3f7f8

Browse files
committed
OPENNLP-1753: Switch to pre-trained Model binaries v1.3
- provides and adapts PR #799 for OpenNLP 2.x maintenance branch - adjusts version strings - adjusts index.html template to latest v1.3 copy for DownloadParserTest - adjusts examples in Dev manual
1 parent a96747c commit 7f3f7f8

File tree

23 files changed

+690
-630
lines changed

23 files changed

+690
-630
lines changed

opennlp-docs/src/docbkx/lemmatizer.xml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@
4141
<para>
4242
<screen>
4343
<![CDATA[
44-
$ opennlp LemmatizerME opennlp-en-ud-ewt-lemmas-1.2-2.5.0.bin < sentences]]>
44+
$ opennlp LemmatizerME opennlp-en-ud-ewt-lemmas-1.3-2.5.4.bin < sentences]]>
4545
</screen>
4646
The Lemmatizer now reads a pos tagged sentence(s) per line from
4747
standard input. For example, you can copy this sentence to the
@@ -89,7 +89,7 @@ signed VERB sign
8989
<programlisting language="java">
9090
<![CDATA[
9191
LemmatizerModel model = null;
92-
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-lemmas-1.2-2.5.0.bin"))) {
92+
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-lemmas-1.3-2.5.4.bin"))) {
9393
model = new LemmatizerModel(modelIn);
9494
}
9595
]]>

opennlp-docs/src/docbkx/postagger.xml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ under the License.
4141
Download the English maxent pos model and start the POS Tagger Tool with this command:
4242
<screen>
4343
<![CDATA[
44-
$ opennlp POSTagger opennlp-en-ud-ewt-pos-1.2-2.5.0.bin]]>
44+
$ opennlp POSTagger opennlp-en-ud-ewt-pos-1.3-2.5.4.bin]]>
4545
</screen>
4646
The POS Tagger now reads a tokenized sentence per line from stdin.
4747
Copy these two sentences to the console:
@@ -69,7 +69,7 @@ Mr._PROPN Vinken_PROPN is_AUX chairman_NOUN of_ADP Elsevier_ADJ N.V._PROPN ,_PUN
6969
In the sample below it is loaded from disk.
7070
<programlisting language="java">
7171
<![CDATA[
72-
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-pos-1.2-2.5.0.bin"){
72+
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-pos-1.3-2.5.4.bin"){
7373
POSModel model = new POSModel(modelIn);
7474
}]]>
7575
</programlisting>
@@ -343,7 +343,7 @@ Arrays.stream(tags).forEach(k -> System.out.print(k + " "));]]>
343343
POS Tags using the custom model (en-custom-pos-maxent.bin): PROPN OTHER PRON ADJ NOUN PUNCT
344344

345345
Output with the default model
346-
POS Tags using the default model (opennlp-en-ud-ewt-pos-1.2-2.5.0.bin): NOUN AUX PRON ADJ NOUN PUNCT
346+
POS Tags using the default model (opennlp-en-ud-ewt-pos-1.3-2.5.4.bin): NOUN AUX PRON ADJ NOUN PUNCT
347347
</literallayout>
348348
</para>
349349
</section>

opennlp-docs/src/docbkx/sentdetect.xml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -63,13 +63,13 @@ Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC,
6363
Download the english sentence detector model and start the Sentence Detector Tool with this command:
6464
<screen>
6565
<![CDATA[
66-
$ opennlp SentenceDetector opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin]]>
66+
$ opennlp SentenceDetector opennlp-en-ud-ewt-sentence-1.3-2.5.4.bin]]>
6767
</screen>
6868
Just copy the sample text from above to the console. The Sentence Detector will read it and echo one sentence per line to the console.
6969
Usually the input is read from a file and the output is redirected to another file. This can be achieved with the following command.
7070
<screen>
7171
<![CDATA[
72-
$ opennlp SentenceDetector opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin < input.txt > output.txt]]>
72+
$ opennlp SentenceDetector opennlp-en-ud-ewt-sentence-1.3-2.5.4.bin < input.txt > output.txt]]>
7373
</screen>
7474
For the english sentence model from the website the input text should not be tokenized.
7575
</para>
@@ -81,7 +81,7 @@ $ opennlp SentenceDetector opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin < input.txt
8181
To instantiate the Sentence Detector the sentence model must be loaded first.
8282
<programlisting language="java">
8383
<![CDATA[
84-
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin")) {
84+
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-sentence-1.3-2.5.4.bin")) {
8585
SentenceModel model = new SentenceModel(modelIn);
8686
}]]>
8787
</programlisting>

opennlp-docs/src/docbkx/tokenizer.xml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ $ opennlp SimpleTokenizer]]>
9797
our website.
9898
<screen>
9999
<![CDATA[
100-
$ opennlp TokenizerME opennlp-en-ud-ewt-tokens-1.2-2.5.0.bin]]>
100+
$ opennlp TokenizerME opennlp-en-ud-ewt-tokens-1.3-2.5.4.bin]]>
101101
</screen>
102102
To test the tokenizer copy the sample from above to the console. The
103103
whitespace separated tokens will be written back to the
@@ -107,7 +107,7 @@ $ opennlp TokenizerME opennlp-en-ud-ewt-tokens-1.2-2.5.0.bin]]>
107107
Usually the input is read from a file and written to a file.
108108
<screen>
109109
<![CDATA[
110-
$ opennlp TokenizerME opennlp-en-ud-ewt-tokens-1.2-2.5.0.bin < article.txt > article-tokenized.txt]]>
110+
$ opennlp TokenizerME opennlp-en-ud-ewt-tokens-1.3-2.5.4.bin < article.txt > article-tokenized.txt]]>
111111
</screen>
112112
It can be done in the same way for the Simple Tokenizer.
113113
</para>
@@ -151,7 +151,7 @@ London share prices were bolstered largely by continued gains on Wall Street and
151151
can be loaded.
152152
<programlisting language="java">
153153
<![CDATA[
154-
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-tokens-1.2-2.5.0.bin")) {
154+
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-tokens-1.3-2.5.4.bin")) {
155155
TokenizerModel model = new TokenizerModel(modelIn);
156156
}]]>
157157
</programlisting>

opennlp-tools/src/main/java/opennlp/tools/monitoring/StopCriteria.java

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,18 +19,14 @@
1919

2020
import java.util.function.Predicate;
2121

22-
import opennlp.tools.ml.model.AbstractModel;
23-
24-
2522
/**
2623
* Stop criteria for model training. If the predicate is met, then the training is aborted.
2724
*
2825
* @see Predicate
29-
* @see AbstractModel
3026
*/
3127
public interface StopCriteria<T extends Number> extends Predicate<T> {
3228

33-
String FINISHED = "Training Finished after completing %s Iterations successfully.";
29+
String FINISHED = "Training finished after completing %s iterations successfully.";
3430

3531
/**
3632
* @return A detailed message captured upon hitting the {@link StopCriteria} during model training.

opennlp-tools/src/main/java/opennlp/tools/util/DownloadUtil.java

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -56,9 +56,9 @@ public class DownloadUtil {
5656
private static final Logger logger = LoggerFactory.getLogger(DownloadUtil.class);
5757

5858
private static final String BASE_URL =
59-
System.getProperty("OPENNLP_DOWNLOAD_BASE_URL", "https://dlcdn.apache.org/opennlp/");
59+
System.getProperty("OPENNLP_DOWNLOAD_BASE_URL", "https://dlcdn.apache.org/opennlp/");
6060
private static final String MODEL_URI_PATH =
61-
System.getProperty("OPENNLP_DOWNLOAD_MODEL_PATH", "models/ud-models-1.2/");
61+
System.getProperty("OPENNLP_DOWNLOAD_MODEL_PATH", "models/ud-models-1.3/");
6262
private static final String OPENNLP_DOWNLOAD_HOME = "OPENNLP_DOWNLOAD_HOME";
6363

6464
private static Map<String, Map<ModelType, String>> availableModels;
@@ -202,7 +202,7 @@ private static void validateModel(URL sha512, Path downloadedModel) throws IOExc
202202
final String actualChecksum = calculateSHA512(downloadedModel);
203203
if (!actualChecksum.equalsIgnoreCase(expectedChecksum)) {
204204
throw new IOException("SHA512 checksum validation failed for " + downloadedModel.getFileName() +
205-
". Expected: " + expectedChecksum + ", but got: " + actualChecksum);
205+
". Expected: " + expectedChecksum + ", but got: " + actualChecksum);
206206
}
207207
}
208208

@@ -353,7 +353,7 @@ private void addModel(String locale, String link, Map<String, Map<ModelType, Str
353353
private String fetchPageIndex() {
354354
final StringBuilder html = new StringBuilder();
355355
try (BufferedReader br = new BufferedReader(
356-
new InputStreamReader(indexUrl.openStream(), StandardCharsets.UTF_8))) {
356+
new InputStreamReader(indexUrl.openStream(), StandardCharsets.UTF_8))) {
357357
String line;
358358
while ((line = br.readLine()) != null) {
359359
html.append(line);

opennlp-tools/src/test/java/opennlp/tools/AbstractModelLoaderTest.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ public abstract class AbstractModelLoaderTest {
3838
private static final String BASE_URL_MODELS_V183 = "https://dlcdn.apache.org/opennlp/models/langdetect/1.8.3/";
3939
protected static final Path OPENNLP_DIR = Paths.get(System.getProperty("OPENNLP_DOWNLOAD_HOME",
4040
System.getProperty("user.home"))).resolve(".opennlp");
41-
protected static final String VER = "1.2-2.5.0";
41+
protected static final String VER = "1.3-2.5.4";
4242
protected static final String BIN = ".bin";
4343
protected static List<String> SUPPORTED_LANG_CODES = List.of(
4444
"en", "fr", "de", "it", "nl", "bg", "ca", "cs", "da", "el",

opennlp-tools/src/test/java/opennlp/tools/cmdline/lemmatizer/LemmatizerModelLoaderIT.java

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -56,10 +56,10 @@ public void setup() {
5656

5757
@ParameterizedTest(name = "Verify \"{0}\" tokenizer model loading")
5858
@ValueSource(strings = {"en-ud-ewt", "fr-ud-gsd", "de-ud-gsd", "it-ud-vit", "nl-ud-alpino",
59-
"bg-ud-btb", "ca-ud-ancora", "cs-ud-pdt", "da-ud-ddt", "el-ud-gdt", "es-ud-gsd", "et-ud-edt",
60-
"eu-ud-bdt", "fi-ud-tdt", "hr-ud-set", "hy-ud-bsut", "is-ud-icepahc", "ka-ud-glc", "kk-ud-ktb",
61-
"ko-ud-kaist", "lv-ud-lvtb", "no-ud-bokmaal", "pl-ud-pdb", "pt-ud-gsd", "ro-ud-rrt", "ru-ud-gsd",
62-
"sr-ud-set", "sk-ud-snk", "sl-ud-ssj", "sv-ud-talbanken", "tr-ud-boun", "uk-ud-iu"})
59+
"bg-ud-btb", "ca-ud-ancora", "cs-ud-pdtc", "da-ud-ddt", "el-ud-gdt", "es-ud-gsd", "et-ud-edt",
60+
"eu-ud-bdt", "fi-ud-tdt", "hr-ud-set", "hy-ud-bsut", "is-ud-icepahc", "ka-ud-glc", "kk-ud-ktb",
61+
"ko-ud-kaist", "lv-ud-lvtb", "no-ud-bokmaal", "pl-ud-pdb", "pt-ud-gsd", "ro-ud-rrt", "ru-ud-gsd",
62+
"sr-ud-set", "sk-ud-snk", "sl-ud-ssj", "sv-ud-talbanken", "tr-ud-boun", "uk-ud-iu"})
6363
public void testLoadModelByLanguage(String langModel) throws IOException {
6464
String modelName = "opennlp-" + langModel + "-lemmas-" + VER + BIN;
6565
LemmatizerModel model = loader.loadModel(Files.newInputStream(OPENNLP_DIR.resolve(modelName)));

opennlp-tools/src/test/java/opennlp/tools/cmdline/postag/POSModelLoaderIT.java

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -56,10 +56,10 @@ public void setup() {
5656

5757
@ParameterizedTest(name = "Verify \"{0}\" POS model loading")
5858
@ValueSource(strings = {"en-ud-ewt", "fr-ud-gsd", "de-ud-gsd", "it-ud-vit", "nl-ud-alpino",
59-
"bg-ud-btb", "ca-ud-ancora", "cs-ud-pdt", "da-ud-ddt", "el-ud-gdt", "es-ud-gsd", "et-ud-edt",
60-
"eu-ud-bdt", "fi-ud-tdt", "hr-ud-set", "hy-ud-bsut", "is-ud-icepahc", "ka-ud-glc", "kk-ud-ktb",
61-
"ko-ud-kaist", "lv-ud-lvtb", "no-ud-bokmaal", "pl-ud-pdb", "pt-ud-gsd", "ro-ud-rrt", "ru-ud-gsd",
62-
"sr-ud-set", "sk-ud-snk", "sl-ud-ssj", "sv-ud-talbanken", "tr-ud-boun", "uk-ud-iu"})
59+
"bg-ud-btb", "ca-ud-ancora", "cs-ud-pdtc", "da-ud-ddt", "el-ud-gdt", "es-ud-gsd", "et-ud-edt",
60+
"eu-ud-bdt", "fi-ud-tdt", "hr-ud-set", "hy-ud-bsut", "is-ud-icepahc", "ka-ud-glc", "kk-ud-ktb",
61+
"ko-ud-kaist", "lv-ud-lvtb", "no-ud-bokmaal", "pl-ud-pdb", "pt-ud-gsd", "ro-ud-rrt", "ru-ud-gsd",
62+
"sr-ud-set", "sk-ud-snk", "sl-ud-ssj", "sv-ud-talbanken", "tr-ud-boun", "uk-ud-iu"})
6363
public void testLoadModelByLanguage(String langModel) throws IOException {
6464
String modelName = "opennlp-" + langModel + "-pos-" + VER + BIN;
6565
POSModel model = loader.loadModel(Files.newInputStream(OPENNLP_DIR.resolve(modelName)));

opennlp-tools/src/test/java/opennlp/tools/cmdline/sentdetect/SentenceModelLoaderIT.java

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -56,10 +56,10 @@ public void setup() {
5656

5757
@ParameterizedTest(name = "Verify \"{0}\" sentence model loading")
5858
@ValueSource(strings = {"en-ud-ewt", "fr-ud-gsd", "de-ud-gsd", "it-ud-vit", "nl-ud-alpino",
59-
"bg-ud-btb", "ca-ud-ancora", "cs-ud-pdt", "da-ud-ddt", "el-ud-gdt", "es-ud-gsd", "et-ud-edt",
60-
"eu-ud-bdt", "fi-ud-tdt", "hr-ud-set", "hy-ud-bsut", "is-ud-icepahc", "ka-ud-glc", "kk-ud-ktb",
61-
"ko-ud-kaist", "lv-ud-lvtb", "no-ud-bokmaal", "pl-ud-pdb", "pt-ud-gsd", "ro-ud-rrt", "ru-ud-gsd",
62-
"sr-ud-set", "sk-ud-snk", "sl-ud-ssj", "sv-ud-talbanken", "tr-ud-boun", "uk-ud-iu"})
59+
"bg-ud-btb", "ca-ud-ancora", "cs-ud-pdtc", "da-ud-ddt", "el-ud-gdt", "es-ud-gsd", "et-ud-edt",
60+
"eu-ud-bdt", "fi-ud-tdt", "hr-ud-set", "hy-ud-bsut", "is-ud-icepahc", "ka-ud-glc", "kk-ud-ktb",
61+
"ko-ud-kaist", "lv-ud-lvtb", "no-ud-bokmaal", "pl-ud-pdb", "pt-ud-gsd", "ro-ud-rrt", "ru-ud-gsd",
62+
"sr-ud-set", "sk-ud-snk", "sl-ud-ssj", "sv-ud-talbanken", "tr-ud-boun", "uk-ud-iu"})
6363
public void testLoadModelByLanguage(String langModel) throws IOException {
6464
String modelName = "opennlp-" + langModel + "-sentence-" + VER + BIN;
6565
SentenceModel model = loader.loadModel(Files.newInputStream(OPENNLP_DIR.resolve(modelName)));

0 commit comments

Comments
 (0)