opennlp/opennlp-docs/src/docbkx/postagger.xml at c865e77de4f9be5aaa7e9ff1adf1aa10d8ce638b · apache/opennlp · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->

<chapter id="tools.postagger">
<title>Part-of-Speech Tagger</title>
	<section id="tools.postagger.tagging">
		<title>Tagging</title>
		<para>
		The Part of Speech Tagger marks tokens with their corresponding word type
		based on the token itself and the context of the token. A token might have
		multiple pos tags depending on the token and the context. The OpenNLP POS Tagger
		uses a probability model to predict the correct pos tag out of the tag set.
		To limit the possible tags for a token a tag dictionary can be used which increases
		the tagging and runtime performance of the tagger.
		</para>
			<section id="tools.postagger.tagging.cmdline">
		<title>POS Tagger Tool</title>
		<para>
		The easiest way to try out the POS Tagger is the command line tool. The tool is
		only intended for demonstration and testing.
		Download the English maxent pos model and start the POS Tagger Tool with this command:
		<screen>
			<![CDATA[
$ opennlp POSTagger opennlp-en-ud-ewt-pos-1.3-2.5.4.bin]]>
		 </screen>
		The POS Tagger now reads a tokenized sentence per line from stdin.
		Copy these two sentences to the console:
		<screen>
			<![CDATA[
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .]]>
		 </screen>
		 The POS Tagger will now echo the sentences with pos tags to the console:
		<screen>
			<![CDATA[
Pierre_PROPN Vinken_PROPN ,_PUNCT 61_NUM years_NOUN old_ADJ ,_PUNCT will_AUX join_VERB the_DET board_NOUN as_ADP
		a_DET nonexecutive_ADJ director_NOUN Nov._PROPN 29_NUM ._PUNCT
Mr._PROPN Vinken_PROPN is_AUX chairman_NOUN of_ADP Elsevier_ADJ N.V._PROPN ,_PUNCT the_DET Dutch_PROPN publishing_VERB group_NOUN .]]>
		 </screen>
		 The tag set used by the English pos model is the <ulink url="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">Penn Treebank tag set</ulink>.
		</para>
      </section>

		<section id="tools.postagger.tagging.api">
		<title>POS Tagger API</title>
		<para>
		    The POS Tagger can be embedded into an application via its API.
			First the pos model must be loaded into memory from disk or another source.
			In the sample below it is loaded from disk.
			<programlisting language="java">
				<![CDATA[
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-pos-1.3-2.5.4.bin"){
  POSModel model = new POSModel(modelIn);
}]]>
			</programlisting>
			After the model is loaded the POSTaggerME can be instantiated.
			<programlisting language="java">
				<![CDATA[
POSTaggerME tagger = new POSTaggerME(model);]]>
			</programlisting>
			The POS Tagger instance is now ready to tag data. It expects a tokenized sentence
			as input, which is represented as a String array, each String object in the array
			is one token.
	   </para>
	   <para>
	   The following code shows how to determine the most likely pos tag sequence for a sentence.
	   	<programlisting language="java">
		  <![CDATA[
String[] sent = new String[]{"Most", "large", "cities", "in", "the", "US", "had",
                             "morning", "and", "afternoon", "newspapers", "."};
String[] tags = tagger.tag(sent);]]>
			</programlisting>
			The tags array contains one part-of-speech tag for each token in the input array. The corresponding
			tag can be found at the same index as the token has in the input array.
			The confidence scores for the returned tags can be easily retrieved from
			a POSTaggerME with the following method call:
				   	<programlisting language="java">
		  <![CDATA[
double[] probs = tagger.probs();]]>
			</programlisting>
			The call to probs is stateful and will always return the probabilities of the last
			tagged sentence. The probs method should only be called when the tag method
			was called before, otherwise the behavior is undefined.
			</para>
			<para>
			Some applications need to retrieve the n-best pos tag sequences and not
			only the best sequence.
			The topKSequences method is capable of returning the top sequences.
			It can be called in a similar way as tag.
			<programlisting language="java">
		  <![CDATA[
Sequence[] topSequences = tagger.topKSequences(sent);]]>
			</programlisting>
			Each Sequence object contains one sequence. The sequence can be retrieved
			via Sequence.getOutcomes() which returns a tags array
			and Sequence.getProbs() returns the probability array for this sequence.
	  		 </para>
	</section>
	</section>
		<section id="tools.postagger.training">
		<title>Training</title>
		<para>
			The POS Tagger can be trained on annotated training material. The training material
			is a collection of tokenized sentences where each token has the assigned part-of-speech tag.
			The native POS Tagger training material looks like this:
			<screen>
		  <![CDATA[
About_ADV 10_NUM Euro_PROPN ,_PUNCT I_PRON reckon._PUNCT
That_PRON sounds_VERB good_ADJ ._PUNCT]]>
			</screen>
			Each sentence must be in one line. The token/tag pairs are combined with "_".
			The token/tag pairs are whitespace separated. The data format does not
			define a document boundary. If a document boundary should be included in the
			training material it is suggested to use an empty line.
		</para>
		<para>The Part-of-Speech Tagger can either be trained with a command line tool,
		or via a training API.
		</para>

		<section id="tools.postagger.training.tool">
		<title>Training Tool</title>
		<para>
			OpenNLP has a command line tool which is used to train the models available from the model
			download page on various corpora.
		</para>
		<para>
		    Usage of the tool:
            <screen>
				<![CDATA[
$ opennlp POSTaggerTrainer
Usage: opennlp POSTaggerTrainer[.conllx] [-type maxent|perceptron|perceptron_sequence] \
               [-dict dictionaryPath] [-ngram cutoff] [-params paramsFile] [-iterations num] \
               [-cutoff num] -model modelFile -lang language -data sampleData \
               [-encoding charsetName]

Arguments description:
        -type maxent|perceptron|perceptron_sequence
                The type of the token name finder model. One of maxent|perceptron|perceptron_sequence.
        -dict dictionaryPath
                The XML tag dictionary file
        -ngram cutoff
                NGram cutoff. If not specified will not create ngram dictionary.
        -params paramsFile
                training parameters file.
        -iterations num
                number of training iterations, ignored if -params is used.
        -cutoff num
                minimal number of times a feature must be seen, ignored if -params is used.
        -model modelFile
                output model file.
        -lang language
                language which is being processed.
        -data sampleData
                data to be used, usually a file name.
        -encoding charsetName
                encoding for reading and writing text, if absent the system default is used.]]>
			 </screen>
		</para>
		<para>
		    The following command illustrates how an English part-of-speech model can be trained:
		    <screen>
		  <![CDATA[
$ opennlp POSTaggerTrainer -type maxent -model en-custom-pos-maxent.bin \
                           -lang en -data en-custom-pos.train -encoding UTF-8]]>
		    </screen>
		</para>
		</section>
		<section id="tools.postagger.training.api">
		<title>Training API</title>
		<para>
		The Part-of-Speech Tagger training API supports the training of a new pos model.
		Basically three steps are necessary to train it:
		<itemizedlist>
			<listitem>
				<para>The application must open a sample data stream</para>
			</listitem>
			<listitem>
				<para>Call the 'POSTagger.train' method</para>
			</listitem>
			<listitem>
				<para>Save the POSModel to a file</para>
			</listitem>
		</itemizedlist>
		The following code illustrates that:
		<programlisting language="java">
				<![CDATA[
POSModel model = null;

try {
  ObjectStream<String> lineStream = new PlainTextByLineStream(
  	new MarkableFileInputStreamFactory(new File("en-custom-pos-maxent.bin")), StandardCharsets.UTF_8);

  ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);

  model = POSTaggerME.train("eng", sampleStream, TrainingParameters.defaultParams(), new POSTaggerFactory());
} catch (IOException e) {
  e.printStackTrace();
}]]>
	</programlisting>
	The above code performs the first two steps, opening the data and training
	the model. The trained model must still be saved into an OutputStream, in
	the sample below it is written into a file.
	<programlisting language="java">
				<![CDATA[
try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelFile))){
  model.serialize(modelOut);
}]]>
		</programlisting>
		</para>
		</section>
		<section id="tools.postagger.training.tagdict">
		<title>Tag Dictionary</title>
		<para>
		The tag dictionary is a word dictionary which specifies which tags a specific token can have. Using a tag
		dictionary has two advantages, inappropriate tags can not been assigned to tokens in the dictionary and the
		beam search algorithm has to consider fewer possibilities and can search faster.
		</para>
		<para>
		The dictionary is defined in a xml format and can be created and stored with the POSDictionary class.
		Below is an example to train a custom model using a tag dictionary.
		</para>
		<para>
		Sample POS Training material (file : en-custom-pos.train)
			<screen>
				<![CDATA[
It_PRON is_OTHER spring_PROPN season_NOUN. The_DET flowers_NOUN are_OTHER red_ADJ and_CCONJ yellow_ADJ ._PUNCT
Red_NOUN is_OTHER my_DET favourite_ADJ colour_NOUN ._PUNCT]]>
			</screen>
		</para>
		<para>
		Sample Tag Dictionary (file : dictionary.xml)
			<programlisting language="xml">
				<![CDATA[
<?xml version="1.0" encoding="UTF-8"?>
 <dictionary case_sensitive="false">
  <entry tags="PRON">
    <token>It</token>
  </entry>
  <entry tags="OTHER">
    <token>is</token>
  </entry>
  <entry tags="PROPN">
    <token>Spring</token>
  </entry>
  <entry tags="NOUN">
    <token>season</token>
  </entry>
  <entry tags="DET">
    <token>the</token>
  </entry>
  <entry tags="NOUN">
    <token>flowers</token>
  </entry>
  <entry tags="OTHER">
    <token>are</token>
  </entry>
  <entry tags="NOUN">
    <token>red</token>
  </entry>
  <entry tags="CCONJ">
    <token>and</token>
  </entry>
  <entry tags="NOUN">
    <token>yellow</token>
  </entry>
  <entry tags="PRON">
    <token>my</token>
  </entry>
  <entry tags="ADJ">
    <token>favourite</token>
  </entry>
  <entry tags="NOUN">
    <token>colour</token>
  </entry>
  <entry tags="PUNCT">
    <token>.</token>
  </entry>
</dictionary>]]>
			</programlisting>
		</para>
		<para>Sample code to train a model using above tag dictionary
			<programlisting language="java">
			<![CDATA[
POSModel model = null;
	try {
		ObjectStream<String> lineStream = new PlainTextByLineStream(
				new MarkableFileInputStreamFactory(new File("en-custom-pos.train")), StandardCharsets.UTF_8);

		ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);

		TrainingParameters params = ModelUtil.createDefaultTrainingParameters();
		params.put(TrainingParameters.CUTOFF_PARAM, 0);

		POSTaggerFactory factory = new POSTaggerFactory();
		TagDictionary dict = factory.createTagDictionary(new File("dictionary.xml"));
		factory.setTagDictionary(dict);

		model = POSTaggerME.train("eng", sampleStream, params, factory);

		OutputStream modelOut = new BufferedOutputStream(new FileOutputStream("en-custom-pos-maxent.bin"));
		model.serialize(modelOut);

	} catch (IOException e) {
		e.printStackTrace();
	}]]>
			</programlisting>
		</para>
		<para>
		The custom model is then used to tag a sequence.
		<programlisting language="java">
			<![CDATA[
String[] sent = new String[]{"Spring", "is", "my", "favourite", "season", "."};
String[] tags = tagger.tag(sent);
Arrays.stream(tags).forEach(k -> System.out.print(k + " "));]]>
		</programlisting>
		</para>
		<para>
			<literallayout>
				Input
				    Sentence:	Spring is my favourite season.

				Output
				    POS Tags using the custom model (en-custom-pos-maxent.bin): PROPN OTHER PRON ADJ NOUN PUNCT

				Output with the default model
				    POS Tags using the default model (opennlp-en-ud-ewt-pos-1.3-2.5.4.bin):	NOUN AUX PRON ADJ NOUN PUNCT
			</literallayout>
		</para>
		</section>
		</section>

		<section id="tools.postagger.eval">
		<title>Evaluation</title>
		<para>
		The built-in evaluation can measure the accuracy of the pos tagger.
		The accuracy can be measured on a test data set or via cross validation.
		</para>
		<section id="tools.postagger.eval.tool">
		<title>Evaluation Tool</title>
		<para>
		There is a command line tool to evaluate a given model on a test data set.
		The following command shows how the tool can be run:
		<screen>
				<![CDATA[
$ opennlp POSTaggerEvaluator -model pt.postagger.bin -data pt.postagger.test -encoding utf-8]]>
			 </screen>
			 This will display the resulting accuracy score, e.g.:
			 <screen>
				<![CDATA[
Loading model ... done
Evaluating ... done

Accuracy: 0.9659110277825124]]>
			 </screen>
		</para>
            <para>
            There is a command line tool for cross-validation of the test data set.
            The following command shows how the tool can be run:
            <screen>
                    <![CDATA[
$ opennlp POSTaggerCrossValidator -lang pt -data pt.postagger.test -encoding utf-8]]>
                 </screen>
                 This will display the resulting accuracy score, e.g.:
                 <screen>
                    <![CDATA[
Accuracy: 0.9659110277825124]]>
                 </screen>
            </para>

		</section>
		</section>
</chapter>