Skip to content

Commit c52dea8

Browse files
committed
OPENNLP-1745: SentenceDetector - Add Junit test for useTokenEnd = false
- adapts PR #792 for OpenNLP 2.x
1 parent 9dc1140 commit c52dea8

File tree

4 files changed

+173
-126
lines changed

4 files changed

+173
-126
lines changed

opennlp-docs/src/docbkx/sentdetect.xml

Lines changed: 105 additions & 102 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
<?xml version="1.0" encoding="UTF-8"?>
22
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
3-
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
4-
]>
3+
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
4+
]>
55
<!--
66
Licensed to the Apache Software Foundation (ASF) under one
77
or more contributor license agreements. See the NOTICE file
@@ -28,99 +28,99 @@ under the License.
2828
<section id="tools.sentdetect.detection">
2929
<title>Sentence Detection</title>
3030
<para>
31-
The OpenNLP Sentence Detector can detect that a punctuation character
32-
marks the end of a sentence or not. In this sense a sentence is defined
33-
as the longest white space trimmed character sequence between two punctuation
34-
marks. The first and last sentence make an exception to this rule. The first
35-
non whitespace character is assumed to be the start of a sentence, and the
36-
last non whitespace character is assumed to be a sentence end.
37-
The sample text below should be segmented into its sentences.
38-
<screen>
31+
The OpenNLP Sentence Detector can detect that a punctuation character
32+
marks the end of a sentence or not. In this sense a sentence is defined
33+
as the longest white space trimmed character sequence between two punctuation
34+
marks. The first and last sentence make an exception to this rule. The first
35+
non whitespace character is assumed to be the start of a sentence, and the
36+
last non whitespace character is assumed to be a sentence end.
37+
The sample text below should be segmented into its sentences.
38+
<screen>
3939
<![CDATA[
4040
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is
4141
chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years
4242
old and former chairman of Consolidated Gold Fields PLC, was named a director of this
4343
British industrial conglomerate.]]>
44-
</screen>
45-
After detecting the sentence boundaries each sentence is written in its own line.
46-
<screen>
44+
</screen>
45+
After detecting the sentence boundaries each sentence is written in its own line.
46+
<screen>
4747
<![CDATA[
4848
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
4949
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
5050
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC,
5151
was named a director of this British industrial conglomerate.]]>
52-
</screen>
53-
Usually Sentence Detection is done before the text is tokenized and that's the way the pre-trained models on the website are trained,
54-
but it is also possible to perform tokenization first and let the Sentence Detector process the already tokenized text.
55-
The OpenNLP Sentence Detector cannot identify sentence boundaries based on the contents of the sentence. A prominent example is the first sentence in an article where the title is mistakenly identified to be the first part of the first sentence.
56-
Most components in OpenNLP expect input which is segmented into sentences.
52+
</screen>
53+
Usually Sentence Detection is done before the text is tokenized and that's the way the pre-trained models on the website are trained,
54+
but it is also possible to perform tokenization first and let the Sentence Detector process the already tokenized text.
55+
The OpenNLP Sentence Detector cannot identify sentence boundaries based on the contents of the sentence. A prominent example is the first sentence in an article where the title is mistakenly identified to be the first part of the first sentence.
56+
Most components in OpenNLP expect input which is segmented into sentences.
5757
</para>
58-
58+
5959
<section id="tools.sentdetect.detection.cmdline">
60-
<title>Sentence Detection Tool</title>
61-
<para>
62-
The easiest way to try out the Sentence Detector is the command line tool. The tool is only intended for demonstration and testing.
63-
Download the english sentence detector model and start the Sentence Detector Tool with this command:
64-
<screen>
65-
<![CDATA[
60+
<title>Sentence Detection Tool</title>
61+
<para>
62+
The easiest way to try out the Sentence Detector is the command line tool. The tool is only intended for demonstration and testing.
63+
Download the english sentence detector model and start the Sentence Detector Tool with this command:
64+
<screen>
65+
<![CDATA[
6666
$ opennlp SentenceDetector opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin]]>
67-
</screen>
68-
Just copy the sample text from above to the console. The Sentence Detector will read it and echo one sentence per line to the console.
69-
Usually the input is read from a file and the output is redirected to another file. This can be achieved with the following command.
70-
<screen>
71-
<![CDATA[
67+
</screen>
68+
Just copy the sample text from above to the console. The Sentence Detector will read it and echo one sentence per line to the console.
69+
Usually the input is read from a file and the output is redirected to another file. This can be achieved with the following command.
70+
<screen>
71+
<![CDATA[
7272
$ opennlp SentenceDetector opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin < input.txt > output.txt]]>
73-
</screen>
74-
For the english sentence model from the website the input text should not be tokenized.
75-
</para>
73+
</screen>
74+
For the english sentence model from the website the input text should not be tokenized.
75+
</para>
7676
</section>
7777
<section id="tools.sentdetect.detection.api">
78-
<title>Sentence Detection API</title>
79-
<para>
80-
The Sentence Detector can be easily integrated into an application via its API.
81-
To instantiate the Sentence Detector the sentence model must be loaded first.
82-
<programlisting language="java">
83-
<![CDATA[
78+
<title>Sentence Detection API</title>
79+
<para>
80+
The Sentence Detector can be easily integrated into an application via its API.
81+
To instantiate the Sentence Detector the sentence model must be loaded first.
82+
<programlisting language="java">
83+
<![CDATA[
8484
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin")) {
8585
SentenceModel model = new SentenceModel(modelIn);
8686
}]]>
87-
</programlisting>
88-
After the model is loaded the SentenceDetectorME can be instantiated.
89-
<programlisting language="java">
90-
<![CDATA[
87+
</programlisting>
88+
After the model is loaded the SentenceDetectorME can be instantiated.
89+
<programlisting language="java">
90+
<![CDATA[
9191
SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);]]>
92-
</programlisting>
93-
The Sentence Detector can output an array of Strings, where each String is one sentence.
92+
</programlisting>
93+
The Sentence Detector can output an array of Strings, where each String is one sentence.
9494
<programlisting language="java">
95-
<![CDATA[
95+
<![CDATA[
9696
String[] sentences = sentenceDetector.sentDetect(" First sentence. Second sentence. ");]]>
97-
</programlisting>
98-
The result array now contains two entries. The first String is "First sentence." and the
99-
second String is "Second sentence." The whitespace before, between and after the input String is removed.
100-
The API also offers a method which simply returns the span of the sentence in the input string.
101-
<programlisting language="java">
102-
<![CDATA[
97+
</programlisting>
98+
The result array now contains two entries. The first String is "First sentence." and the
99+
second String is "Second sentence." The whitespace before, between and after the input String is removed.
100+
The API also offers a method which simply returns the span of the sentence in the input string.
101+
<programlisting language="java">
102+
<![CDATA[
103103
Span[] sentences = sentenceDetector.sentPosDetect(" First sentence. Second sentence. ");]]>
104-
</programlisting>
105-
The result array again contains two entries. The first span beings at index 2 and ends at
106-
17. The second span begins at 18 and ends at 34. The utility method Span.getCoveredText can be used to create a substring which only covers the chars in the span.
107-
</para>
104+
</programlisting>
105+
The result array again contains two entries. The first span beings at index 2 and ends at
106+
17. The second span begins at 18 and ends at 34. The utility method Span.getCoveredText can be used to create a substring which only covers the chars in the span.
107+
</para>
108108
</section>
109109
</section>
110110
<section id="tools.sentdetect.training">
111111
<title>Sentence Detector Training</title>
112112
<para/>
113113
<section id="tools.sentdetect.training.tool">
114-
<title>Training Tool</title>
115-
<para>
116-
OpenNLP has a command line tool which is used to train the models available from the model
117-
download page on various corpora. The data must be converted to the OpenNLP Sentence Detector
118-
training format. Which is one sentence per line. An empty line indicates a document boundary.
119-
In case the document boundary is unknown, it's recommended to have an empty line every few ten
120-
sentences. Exactly like the output in the sample above.
121-
Usage of the tool:
122-
<screen>
123-
<![CDATA[
114+
<title>Training Tool</title>
115+
<para>
116+
OpenNLP has a command line tool which is used to train the models available from the model
117+
download page on various corpora. The data must be converted to the OpenNLP Sentence Detector
118+
training format. Which is one sentence per line. An empty line indicates a document boundary.
119+
In case the document boundary is unknown, it's recommended to have an empty line every few ten
120+
sentences. Exactly like the output in the sample above.
121+
Usage of the tool:
122+
<screen>
123+
<![CDATA[
124124
$ opennlp SentenceDetectorTrainer
125125
Usage: opennlp SentenceDetectorTrainer[.namefinder|.conllx|.pos] [-abbDict path] \
126126
[-params paramsFile] [-iterations num] [-cutoff num] -model modelFile \
@@ -142,17 +142,20 @@ Arguments description:
142142
-data sampleData
143143
data to be used, usually a file name.
144144
-encoding charsetName
145-
encoding for reading and writing text, if absent the system default is used.]]>
146-
</screen>
147-
To train an English sentence detector use the following command:
148-
<screen>
149-
<![CDATA[
145+
encoding for reading and writing text, if absent the system default is used.
146+
-useTokenEnd boolean flag
147+
set to false when the next sentence in the test dataset doesn't start with a blank space post completion of
148+
the previous sentence. If absent, it is defaulted to true.]]>
149+
</screen>
150+
To train an English sentence detector use the following command:
151+
<screen>
152+
<![CDATA[
150153
$ opennlp SentenceDetectorTrainer -model en-custom-sent.bin -lang en -data en-custom-sent.train -encoding UTF-8
151154
]]>
152-
</screen>
153-
It should produce the following output:
154-
<screen>
155-
<![CDATA[
155+
</screen>
156+
It should produce the following output:
157+
<screen>
158+
<![CDATA[
156159
Indexing events using cutoff of 5
157160
158161
Computing event counts... done. 4883 events
@@ -184,28 +187,28 @@ Performing 100 iterations.
184187
Wrote sentence detector model.
185188
Path: en-custom-sent.bin
186189
]]>
187-
</screen>
188-
</para>
190+
</screen>
191+
</para>
189192
</section>
190193
<section id="tools.sentdetect.training.api">
191-
<title>Training API</title>
192-
<para>
193-
The Sentence Detector also offers an API to train a new sentence detection model.
194-
Basically three steps are necessary to train it:
195-
<itemizedlist>
196-
<listitem>
197-
<para>The application must open a sample data stream</para>
198-
</listitem>
199-
<listitem>
200-
<para>Call the SentenceDetectorME.train method</para>
201-
</listitem>
202-
<listitem>
203-
<para>Save the SentenceModel to a file or directly use it</para>
204-
</listitem>
205-
</itemizedlist>
206-
The following sample code illustrates these steps:
207-
<programlisting language="java">
208-
<![CDATA[
194+
<title>Training API</title>
195+
<para>
196+
The Sentence Detector also offers an API to train a new sentence detection model.
197+
Basically three steps are necessary to train it:
198+
<itemizedlist>
199+
<listitem>
200+
<para>The application must open a sample data stream</para>
201+
</listitem>
202+
<listitem>
203+
<para>Call the SentenceDetectorME.train method</para>
204+
</listitem>
205+
<listitem>
206+
<para>Save the SentenceModel to a file or directly use it</para>
207+
</listitem>
208+
</itemizedlist>
209+
The following sample code illustrates these steps:
210+
<programlisting language="java">
211+
<![CDATA[
209212
210213
ObjectStream<String> lineStream =
211214
new PlainTextByLineStream(new MarkableFileInputStreamFactory(new File("en-custom-sent.train")), StandardCharsets.UTF_8);
@@ -220,8 +223,8 @@ try (ObjectStream<SentenceSample> sampleStream = new SentenceSampleStream(lineSt
220223
try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelFile))) {
221224
model.serialize(modelOut);
222225
}]]>
223-
</programlisting>
224-
</para>
226+
</programlisting>
227+
</para>
225228
</section>
226229
</section>
227230
<section id="tools.sentdetect.eval">
@@ -231,9 +234,9 @@ try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(model
231234
<section id="tools.sentdetect.eval.tool">
232235
<title>Evaluation Tool</title>
233236
<para>
234-
The command shows how the evaluator tool can be run:
235-
<screen>
236-
<![CDATA[
237+
The command shows how the evaluator tool can be run:
238+
<screen>
239+
<![CDATA[
237240
$ opennlp SentenceDetectorEvaluator -model en-custom-sent.bin -data en-custom-sent.eval -encoding UTF-8
238241
239242
Loading model ... done
@@ -242,8 +245,8 @@ Evaluating ... done
242245
Precision: 0.9465737514518002
243246
Recall: 0.9095982142857143
244247
F-Measure: 0.9277177006260672]]>
245-
</screen>
246-
The en-custom-sent.eval file has the same format as the training data.
248+
</screen>
249+
The en-custom-sent.eval file has the same format as the training data.
247250
</para>
248251
</section>
249252
</section>

opennlp-tools/src/main/java/opennlp/tools/cmdline/sentdetect/SentenceDetectorTrainerTool.java

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
import opennlp.tools.util.model.ModelUtil;
3939

4040
public final class SentenceDetectorTrainerTool
41-
extends AbstractTrainerTool<SentenceSample, TrainerToolParams> {
41+
extends AbstractTrainerTool<SentenceSample, TrainerToolParams> {
4242

4343
interface TrainerToolParams extends TrainingParams, TrainingToolParams {
4444
}
@@ -83,7 +83,7 @@ public void run(String format, String[] args) {
8383
char[] eos = null;
8484
if (params.getEosChars() != null) {
8585
String eosString = SentenceSampleStream.replaceNewLineEscapeTags(
86-
params.getEosChars());
86+
params.getEosChars());
8787
eos = eosString.toCharArray();
8888
}
8989

@@ -92,9 +92,9 @@ public void run(String format, String[] args) {
9292
try {
9393
Dictionary dict = loadDict(params.getAbbDict());
9494
SentenceDetectorFactory sdFactory = SentenceDetectorFactory.create(
95-
params.getFactory(), params.getLang(), true, dict, eos);
95+
params.getFactory(), params.getLang(), params.getUseTokenEnd(), dict, eos);
9696
model = SentenceDetectorME.train(params.getLang(), sampleStream,
97-
sdFactory, mlParams);
97+
sdFactory, mlParams);
9898
} catch (IOException e) {
9999
throw createTerminationIOException(e);
100100
}

opennlp-tools/src/main/java/opennlp/tools/cmdline/sentdetect/TrainingParams.java

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,4 +44,9 @@ interface TrainingParams extends BasicTrainingParams {
4444
description = "A sub-class of SentenceDetectorFactory where to get implementation and resources.")
4545
@OptionalParameter
4646
String getFactory();
47+
48+
@ParameterDescription(valueName = "useTokenEnd",
49+
description = "A boolean parameter to detect the start index of the next sentence in the test data.")
50+
@OptionalParameter(defaultValue = "true")
51+
Boolean getUseTokenEnd();
4752
}

0 commit comments

Comments
 (0)