You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Licensed to the Apache Software Foundation (ASF) under one
7
7
or more contributor license agreements. See the NOTICE file
@@ -28,99 +28,99 @@ under the License.
28
28
<sectionid="tools.sentdetect.detection">
29
29
<title>Sentence Detection</title>
30
30
<para>
31
-
The OpenNLP Sentence Detector can detect that a punctuation character
32
-
marks the end of a sentence or not. In this sense a sentence is defined
33
-
as the longest white space trimmed character sequence between two punctuation
34
-
marks. The first and last sentence make an exception to this rule. The first
35
-
non whitespace character is assumed to be the start of a sentence, and the
36
-
last non whitespace character is assumed to be a sentence end.
37
-
The sample text below should be segmented into its sentences.
38
-
<screen>
31
+
The OpenNLP Sentence Detector can detect that a punctuation character
32
+
marks the end of a sentence or not. In this sense a sentence is defined
33
+
as the longest white space trimmed character sequence between two punctuation
34
+
marks. The first and last sentence make an exception to this rule. The first
35
+
non whitespace character is assumed to be the start of a sentence, and the
36
+
last non whitespace character is assumed to be a sentence end.
37
+
The sample text below should be segmented into its sentences.
38
+
<screen>
39
39
<![CDATA[
40
40
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is
41
41
chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years
42
42
old and former chairman of Consolidated Gold Fields PLC, was named a director of this
43
43
British industrial conglomerate.]]>
44
-
</screen>
45
-
After detecting the sentence boundaries each sentence is written in its own line.
46
-
<screen>
44
+
</screen>
45
+
After detecting the sentence boundaries each sentence is written in its own line.
46
+
<screen>
47
47
<![CDATA[
48
48
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
49
49
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
50
50
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC,
51
51
was named a director of this British industrial conglomerate.]]>
52
-
</screen>
53
-
Usually Sentence Detection is done before the text is tokenized and that's the way the pre-trained models on the website are trained,
54
-
but it is also possible to perform tokenization first and let the Sentence Detector process the already tokenized text.
55
-
The OpenNLP Sentence Detector cannot identify sentence boundaries based on the contents of the sentence. A prominent example is the first sentence in an article where the title is mistakenly identified to be the first part of the first sentence.
56
-
Most components in OpenNLP expect input which is segmented into sentences.
52
+
</screen>
53
+
Usually Sentence Detection is done before the text is tokenized and that's the way the pre-trained models on the website are trained,
54
+
but it is also possible to perform tokenization first and let the Sentence Detector process the already tokenized text.
55
+
The OpenNLP Sentence Detector cannot identify sentence boundaries based on the contents of the sentence. A prominent example is the first sentence in an article where the title is mistakenly identified to be the first part of the first sentence.
56
+
Most components in OpenNLP expect input which is segmented into sentences.
57
57
</para>
58
-
58
+
59
59
<sectionid="tools.sentdetect.detection.cmdline">
60
-
<title>Sentence Detection Tool</title>
61
-
<para>
62
-
The easiest way to try out the Sentence Detector is the command line tool. The tool is only intended for demonstration and testing.
63
-
Download the english sentence detector model and start the Sentence Detector Tool with this command:
64
-
<screen>
65
-
<![CDATA[
60
+
<title>Sentence Detection Tool</title>
61
+
<para>
62
+
The easiest way to try out the Sentence Detector is the command line tool. The tool is only intended for demonstration and testing.
63
+
Download the english sentence detector model and start the Sentence Detector Tool with this command:
For the english sentence model from the website the input text should not be tokenized.
75
-
</para>
73
+
</screen>
74
+
For the english sentence model from the website the input text should not be tokenized.
75
+
</para>
76
76
</section>
77
77
<sectionid="tools.sentdetect.detection.api">
78
-
<title>Sentence Detection API</title>
79
-
<para>
80
-
The Sentence Detector can be easily integrated into an application via its API.
81
-
To instantiate the Sentence Detector the sentence model must be loaded first.
82
-
<programlistinglanguage="java">
83
-
<![CDATA[
78
+
<title>Sentence Detection API</title>
79
+
<para>
80
+
The Sentence Detector can be easily integrated into an application via its API.
81
+
To instantiate the Sentence Detector the sentence model must be loaded first.
82
+
<programlistinglanguage="java">
83
+
<![CDATA[
84
84
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-sentence-1.2-2.5.0.bin")) {
85
85
SentenceModel model = new SentenceModel(modelIn);
86
86
}]]>
87
-
</programlisting>
88
-
After the model is loaded the SentenceDetectorME can be instantiated.
89
-
<programlistinglanguage="java">
90
-
<![CDATA[
87
+
</programlisting>
88
+
After the model is loaded the SentenceDetectorME can be instantiated.
89
+
<programlistinglanguage="java">
90
+
<![CDATA[
91
91
SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);]]>
92
-
</programlisting>
93
-
The Sentence Detector can output an array of Strings, where each String is one sentence.
92
+
</programlisting>
93
+
The Sentence Detector can output an array of Strings, where each String is one sentence.
94
94
<programlistinglanguage="java">
95
-
<![CDATA[
95
+
<![CDATA[
96
96
String[] sentences = sentenceDetector.sentDetect(" First sentence. Second sentence. ");]]>
97
-
</programlisting>
98
-
The result array now contains two entries. The first String is "First sentence." and the
99
-
second String is "Second sentence." The whitespace before, between and after the input String is removed.
100
-
The API also offers a method which simply returns the span of the sentence in the input string.
101
-
<programlistinglanguage="java">
102
-
<![CDATA[
97
+
</programlisting>
98
+
The result array now contains two entries. The first String is "First sentence." and the
99
+
second String is "Second sentence." The whitespace before, between and after the input String is removed.
100
+
The API also offers a method which simply returns the span of the sentence in the input string.
101
+
<programlistinglanguage="java">
102
+
<![CDATA[
103
103
Span[] sentences = sentenceDetector.sentPosDetect(" First sentence. Second sentence. ");]]>
104
-
</programlisting>
105
-
The result array again contains two entries. The first span beings at index 2 and ends at
106
-
17. The second span begins at 18 and ends at 34. The utility method Span.getCoveredText can be used to create a substring which only covers the chars in the span.
107
-
</para>
104
+
</programlisting>
105
+
The result array again contains two entries. The first span beings at index 2 and ends at
106
+
17. The second span begins at 18 and ends at 34. The utility method Span.getCoveredText can be used to create a substring which only covers the chars in the span.
107
+
</para>
108
108
</section>
109
109
</section>
110
110
<sectionid="tools.sentdetect.training">
111
111
<title>Sentence Detector Training</title>
112
112
<para/>
113
113
<sectionid="tools.sentdetect.training.tool">
114
-
<title>Training Tool</title>
115
-
<para>
116
-
OpenNLP has a command line tool which is used to train the models available from the model
117
-
download page on various corpora. The data must be converted to the OpenNLP Sentence Detector
118
-
training format. Which is one sentence per line. An empty line indicates a document boundary.
119
-
In case the document boundary is unknown, it's recommended to have an empty line every few ten
120
-
sentences. Exactly like the output in the sample above.
121
-
Usage of the tool:
122
-
<screen>
123
-
<![CDATA[
114
+
<title>Training Tool</title>
115
+
<para>
116
+
OpenNLP has a command line tool which is used to train the models available from the model
117
+
download page on various corpora. The data must be converted to the OpenNLP Sentence Detector
118
+
training format. Which is one sentence per line. An empty line indicates a document boundary.
119
+
In case the document boundary is unknown, it's recommended to have an empty line every few ten
120
+
sentences. Exactly like the output in the sample above.
0 commit comments