opennlp/opennlp-docs/src/docbkx/lemmatizer.xml at f4ed8fe65c0fdb8dde05f39743077db7ae2d3569 · apache/opennlp · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
]>
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
	license agreements. See the NOTICE file distributed with this work for additional
	information regarding copyright ownership. The ASF licenses this file to
	you under the Apache License, Version 2.0 (the "License"); you may not use
	this file except in compliance with the License. You may obtain a copy of
	the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
	by applicable law or agreed to in writing, software distributed under the
	License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
	OF ANY KIND, either express or implied. See the License for the specific
	language governing permissions and limitations under the License. -->

<chapter id="tools.lemmatizer">
	<title>Lemmatizer</title>
		<para>
			The lemmatizer returns, for a given word form (token) and Part of Speech
			tag,
			the dictionary form of a word, which is usually referred to as its
			lemma. A token could
			ambiguously be derived from several basic forms or dictionary words which is why
			the
			postag of the word is required to find the lemma. For example, the form
			`show' may refer
			to either the verb "to show" or to the noun "show".
			Currently, OpenNLP implement statistical and dictionary-based lemmatizers.
		</para>
		<section id="tools.lemmatizer.tagging.cmdline">
			<title>Lemmatizer Tool</title>
			<para>
				The easiest way to try out the Lemmatizer is the command line tool,
				which provides access to the statistical
				lemmatizer. Note that the tool is only intended for demonstration and testing.
			</para>
			<para>
				Once you have trained a lemmatizer model (see below for instructions),
				you can start the Lemmatizer Tool with this command:
			</para>
			<para>
				<screen>
		   <![CDATA[
$ opennlp LemmatizerME opennlp-en-ud-ewt-lemmas-1.3-2.5.4.bin < sentences]]>
		  </screen>
				The Lemmatizer now reads a pos tagged sentence(s) per line from
				standard input. For example, you can copy this sentence to the
				console:
				<screen>
		    <![CDATA[
Rockwell_PROPN International_ADJ Corp_NOUN 's_PUNCT Tulsa_PROPN unit_NOUN said_VERB it_PRON
signed_VERB a_DET tentative_NOUN agreement_NOUN extending_VERB its_PRON contract_NOUN
with_ADP Boeing_PROPN Co._NOUN to_PART provide_VERB structural_ADJ parts_NOUN for_ADP
Boeing_PROPN 's_PUNCT 747_NUM jetliners_NOUN ._PUNCT]]>
		  </screen>
				The Lemmatizer will now echo the lemmas for each word postag pair to
				the console:
				<screen>
		    <![CDATA[
Rockwell	PROPN	rockwell
International	ADJ	international
Corp	NOUN	corp
's	PUNCT	's
Tulsa	PROPN	tulsa
unit	NOUN	unit
said	VERB	say
it	PRON	it
signed	VERB	sign
...
]]>
		  </screen>
			</para>
		</section>
		<section id="tools.lemmatizer.tagging.api">
			<title>Lemmatizer API</title>
			<para>
				The Lemmatizer can be embedded into an application via its API.
				Currently, a statistical
				and DictionaryLemmatizer are available. Note that these two methods are
				complementary and
				the DictionaryLemmatizer can also be used as a way of post-processing
				the output of the statistical
				lemmatizer.
			</para>
			<para>
				The statistical lemmatizer requires that a trained model is loaded
				into memory from disk or from another source.
				In the example below it is loaded from disk:
				<programlisting language="java">
		<![CDATA[
LemmatizerModel model = null;
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-lemmas-1.3-2.5.4.bin"))) {
  model = new LemmatizerModel(modelIn);
}
]]>
			</programlisting>
				After the model is loaded a LemmatizerME can be instantiated.
				<programlisting language="java">
				<![CDATA[
LemmatizerME lemmatizer = new LemmatizerME(model);]]>
			</programlisting>
				The Lemmatizer instance is now ready to lemmatize data. It expects a
				tokenized sentence
				as input, which is represented as a String array, each String object
				in the array
				is one token, and the POS tags associated with each token.
			</para>
			<para>
				The following code shows how to determine the most likely lemma for
				a sentence.
				<programlisting language="java">
		  <![CDATA[
String[] tokens = new String[] { "Rockwell", "International", "Corp.", "'s",
    "Tulsa", "unit", "said", "it", "signed", "a", "tentative", "agreement",
    "extending", "its", "contract", "with", "Boeing", "Co.", "to",
    "provide", "structural", "parts", "for", "Boeing", "'s", "747",
    "jetliners", "." };

String[] postags = new String[] { "PROPN", "ADJ", "NOUN", "PUNCT", "PROPN", "NOUN",
    "VERB", "PRON", "VERB", "DET", "NOUN", "NOUN", "VERB", "PRON", "NOUN", "ADP",
    "PROPN", "NOUN", "PART", "VERB", "ADJ", "NOUN", "ADP", "PROPN", "PUNCT", "NUM", "NOUN",
    "PUNCT" };

String[] lemmas = lemmatizer.lemmatize(tokens, postags);]]>
		</programlisting>
				The lemmas array contains one lemma for each token in the
				input array. The corresponding
				tag and lemma can be found at the same index as the token has in the
				input array.
			</para>

			<para>
				The DictionaryLemmatizer is constructed
				by passing the InputStream of a lemmatizer dictionary. Such dictionary
				consists of a text file containing, for each row, a word, its postag and the
				corresponding lemma, each column separated by a tab character.
				<screen>
		<![CDATA[
show		NOUN	show
showcase	NOUN	showcase
showcases	NOUN	showcase
showdown	NOUN	showdown
showdowns	NOUN	showdown
shower		NOUN	shower
showers		NOUN	shower
showman		NOUN	showman
showmanship	NOUN	showmanship
showmen		NOUN	showman
showroom	NOUN	showroom
showrooms	NOUN	showroom
shows		NOUN	show
shrapnel	NOUN	shrapnel
		]]>
		</screen>
				Alternatively, if a (word,postag) pair can output multiple lemmas, the
				the lemmatizer dictionary would consist of a text file containing, for
				each row, a word, its postag and the corresponding lemmas separated by "#":
				<screen>
		<![CDATA[
muestras	NOUN	muestra
cantaba		VERB	cantar
fue		VERB	ir#ser
entramos	VERB	entrar
		]]>
					</screen>
				First the dictionary must be loaded into memory from disk or another
				source.
				In the sample below it is loaded from disk.
				<programlisting language="java">
				<![CDATA[
InputStream dictLemmatizer = null;

try (dictLemmatizer = new FileInputStream("english-dict-lemmatizer.txt")) {

}
]]>
			</programlisting>
				After the dictionary is loaded the DictionaryLemmatizer can be
				instantiated.
				<programlisting language="java">
			  <![CDATA[
DictionaryLemmatizer lemmatizer = new DictionaryLemmatizer(dictLemmatizer);]]>
			</programlisting>
				The DictionaryLemmatizer instance is now ready. It expects two
				String arrays as input,
				a containing the tokens and another one their respective postags.
			</para>
			<para>
				The following code shows how to find a lemma using a
				DictionaryLemmatizer.
				<programlisting language="java">
		  <![CDATA[
String[] tokens = new String[]{"Most", "large", "cities", "in", "the", "US", "had",
                             "morning", "and", "afternoon", "newspapers", "."};
String[] tags = tagger.tag(sent);
String[] lemmas = lemmatizer.lemmatize(tokens, postags);
]]>
			</programlisting>
				The tags array contains one part-of-speech tag for each token in the
				input array. The corresponding
				tag and lemmas can be found at the same index as the token has in the
				input array.
			</para>
		</section>
		<section id="tools.lemmatizer.training">
			<title>Lemmatizer Training</title>
			<para>
				The training data consist of three columns separated by tabs. Each
				word has been put on a
				separate line and there is an empty line after each sentence. The first
				column contains
				the current word, the second its part-of-speech tag and the third its
				lemma.
				Here is an example of the file format:
			</para>
			<para>
				Sample sentence of the training data:
				<screen>
		<![CDATA[
He        PRON  he
reckons   VERB  reckon
the       DET   the
current   ADJ   current
accounts  NOUN  account
deficit   NOUN   deficit
will      AUX   will
narrow    VERB   narrow
to        PART   to
only      ADV   only
#         #    #
1.8       NUM   1.8
millions  NOUN   million
in        ADP   in
September PROPN  september
.         PUNCT   O]]>
		</screen>
				The Universal Dependencies Treebank and the CoNLL 2009 datasets
				distribute training data for many languages.
			</para>
			<section id="tools.lemmatizer.training.tool">
				<title>Training Tool</title>
				<para>
					OpenNLP has a command line tool which is used to train the models on
					various corpora.
				</para>
				<para>
					Usage of the tool:
					<screen>
		<![CDATA[
$ opennlp LemmatizerTrainerME
Usage: opennlp LemmatizerTrainerME [-factory factoryName] [-params paramsFile] -lang language -model modelFile -data sampleData [-encoding charsetName]

Arguments description:
	-factory factoryName
		A sub-class of LemmatizerFactory where to get implementation and resources.
	-params paramsFile
		training parameters file.
	-lang language
		language which is being processed.
	-model modelFile
		output model file.
	-data sampleData
		data to be used, usually a file name.
	-encoding charsetName
	encoding for reading and writing text, if absent the system default is used.
		]]>
		</screen>
					Its now assumed that the english lemmatizer model should be trained
					from a file called
					'en-custom-lemmatizer.train' which is encoded as UTF-8. The following command will train the
					lemmatizer and write the model to en-custom-lemmatizer.bin:
					<screen>
		<![CDATA[
$ opennlp LemmatizerTrainerME -model en-custom-lemmatizer.bin -params PerceptronTrainerParams.txt -lang en -data en-custom-lemmatizer.train -encoding UTF-8]]>
		</screen>
				</para>
			</section>
			<section id="tools.lemmatizer.training.api">
				<title>Training API</title>
				<para>
					The Lemmatizer offers an API to train a new lemmatizer model. First
					a training parameters
					file needs to be instantiated:
					<programlisting language="java">
                    <![CDATA[
 TrainingParameters mlParams = CmdLineUtil.loadTrainingParameters(params.getParams(), false);
 if (mlParams == null) {
   mlParams = ModelUtil.createDefaultTrainingParameters();
 }]]>
                </programlisting>
					Then we read the training data:
					<programlisting language="java">
                    <![CDATA[
InputStreamFactory inputStreamFactory = null;
    try {
      inputStreamFactory = new MarkableFileInputStreamFactory(
          new File(en-custom-lemmatizer.train));
    } catch (FileNotFoundException e) {
      e.printStackTrace();
    }
    ObjectStream<String> lineStream = null;
    LemmaSampleStream lemmaStream = null;
    try {
      lineStream = new PlainTextByLineStream(
      (inputStreamFactory), StandardCharsets.UTF_8);
      lemmaStream = new LemmaSampleStream(lineStream);
    } catch (IOException e) {
      CmdLineUtil.handleCreateObjectStreamError(e);
    }
]]>
                </programlisting>
					The following step proceeds to train the model:
					<programlisting>
    LemmatizerModel model;
    try {
      LemmatizerFactory lemmatizerFactory = LemmatizerFactory
          .create(params.getFactory());
      model = LemmatizerME.train(params.getLang(), lemmaStream, mlParams,
          lemmatizerFactory);
    } catch (IOException e) {
      throw new TerminateToolException(-1,
          "IO error while reading training data or indexing data: "
              + e.getMessage(),
          e);
    } finally {
      try {
        sampleStream.close();
      } catch (IOException e) {
      }
    }
		</programlisting>
				</para>
			</section>
			</section>
			<section id="tools.lemmatizer.evaluation">
				<title>Lemmatizer Evaluation</title>
				<para>
					The built in evaluation can measure the accuracy of the statistical
					lemmatizer.
					The accuracy can be measured on a test data set.
				</para>
				<para>
					There is a command line tool to evaluate a given model on a test
					data set.
					The following command shows how the tool can be run:
					<screen>
				<![CDATA[
$ opennlp LemmatizerEvaluator -model en-custom-lemmatizer.bin -data en-custom-lemmatizer.test -encoding utf-8]]>
			 </screen>
					This will display the resulting accuracy score, e.g.:
					<screen>
				<![CDATA[
Loading model ... done
Evaluating ... done

Accuracy: 0.9659110277825124]]>
			 </screen>
				</para>
		</section>
</chapter>