-
Notifications
You must be signed in to change notification settings - Fork 492
Expand file tree
/
Copy pathlemmatizer.xml
More file actions
360 lines (352 loc) · 12.4 KB
/
lemmatizer.xml
File metadata and controls
360 lines (352 loc) · 12.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
]>
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
license agreements. See the NOTICE file distributed with this work for additional
information regarding copyright ownership. The ASF licenses this file to
you under the Apache License, Version 2.0 (the "License"); you may not use
this file except in compliance with the License. You may obtain a copy of
the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
by applicable law or agreed to in writing, software distributed under the
License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
OF ANY KIND, either express or implied. See the License for the specific
language governing permissions and limitations under the License. -->
<chapter id="tools.lemmatizer">
<title>Lemmatizer</title>
<para>
The lemmatizer returns, for a given word form (token) and Part of Speech
tag,
the dictionary form of a word, which is usually referred to as its
lemma. A token could
ambiguously be derived from several basic forms or dictionary words which is why
the
postag of the word is required to find the lemma. For example, the form
`show' may refer
to either the verb "to show" or to the noun "show".
Currently, OpenNLP implement statistical and dictionary-based lemmatizers.
</para>
<section id="tools.lemmatizer.tagging.cmdline">
<title>Lemmatizer Tool</title>
<para>
The easiest way to try out the Lemmatizer is the command line tool,
which provides access to the statistical
lemmatizer. Note that the tool is only intended for demonstration and testing.
</para>
<para>
Once you have trained a lemmatizer model (see below for instructions),
you can start the Lemmatizer Tool with this command:
</para>
<para>
<screen>
<![CDATA[
$ opennlp LemmatizerME opennlp-en-ud-ewt-lemmas-1.3-2.5.4.bin < sentences]]>
</screen>
The Lemmatizer now reads a pos tagged sentence(s) per line from
standard input. For example, you can copy this sentence to the
console:
<screen>
<![CDATA[
Rockwell_PROPN International_ADJ Corp_NOUN 's_PUNCT Tulsa_PROPN unit_NOUN said_VERB it_PRON
signed_VERB a_DET tentative_NOUN agreement_NOUN extending_VERB its_PRON contract_NOUN
with_ADP Boeing_PROPN Co._NOUN to_PART provide_VERB structural_ADJ parts_NOUN for_ADP
Boeing_PROPN 's_PUNCT 747_NUM jetliners_NOUN ._PUNCT]]>
</screen>
The Lemmatizer will now echo the lemmas for each word postag pair to
the console:
<screen>
<![CDATA[
Rockwell PROPN rockwell
International ADJ international
Corp NOUN corp
's PUNCT 's
Tulsa PROPN tulsa
unit NOUN unit
said VERB say
it PRON it
signed VERB sign
...
]]>
</screen>
</para>
</section>
<section id="tools.lemmatizer.tagging.api">
<title>Lemmatizer API</title>
<para>
The Lemmatizer can be embedded into an application via its API.
Currently, a statistical
and DictionaryLemmatizer are available. Note that these two methods are
complementary and
the DictionaryLemmatizer can also be used as a way of post-processing
the output of the statistical
lemmatizer.
</para>
<para>
The statistical lemmatizer requires that a trained model is loaded
into memory from disk or from another source.
In the example below it is loaded from disk:
<programlisting language="java">
<![CDATA[
LemmatizerModel model = null;
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-lemmas-1.3-2.5.4.bin"))) {
model = new LemmatizerModel(modelIn);
}
]]>
</programlisting>
After the model is loaded a LemmatizerME can be instantiated.
<programlisting language="java">
<![CDATA[
LemmatizerME lemmatizer = new LemmatizerME(model);]]>
</programlisting>
The Lemmatizer instance is now ready to lemmatize data. It expects a
tokenized sentence
as input, which is represented as a String array, each String object
in the array
is one token, and the POS tags associated with each token.
</para>
<para>
The following code shows how to determine the most likely lemma for
a sentence.
<programlisting language="java">
<![CDATA[
String[] tokens = new String[] { "Rockwell", "International", "Corp.", "'s",
"Tulsa", "unit", "said", "it", "signed", "a", "tentative", "agreement",
"extending", "its", "contract", "with", "Boeing", "Co.", "to",
"provide", "structural", "parts", "for", "Boeing", "'s", "747",
"jetliners", "." };
String[] postags = new String[] { "PROPN", "ADJ", "NOUN", "PUNCT", "PROPN", "NOUN",
"VERB", "PRON", "VERB", "DET", "NOUN", "NOUN", "VERB", "PRON", "NOUN", "ADP",
"PROPN", "NOUN", "PART", "VERB", "ADJ", "NOUN", "ADP", "PROPN", "PUNCT", "NUM", "NOUN",
"PUNCT" };
String[] lemmas = lemmatizer.lemmatize(tokens, postags);]]>
</programlisting>
The lemmas array contains one lemma for each token in the
input array. The corresponding
tag and lemma can be found at the same index as the token has in the
input array.
</para>
<para>
The DictionaryLemmatizer is constructed
by passing the InputStream of a lemmatizer dictionary. Such dictionary
consists of a text file containing, for each row, a word, its postag and the
corresponding lemma, each column separated by a tab character.
<screen>
<![CDATA[
show NOUN show
showcase NOUN showcase
showcases NOUN showcase
showdown NOUN showdown
showdowns NOUN showdown
shower NOUN shower
showers NOUN shower
showman NOUN showman
showmanship NOUN showmanship
showmen NOUN showman
showroom NOUN showroom
showrooms NOUN showroom
shows NOUN show
shrapnel NOUN shrapnel
]]>
</screen>
Alternatively, if a (word,postag) pair can output multiple lemmas, the
the lemmatizer dictionary would consist of a text file containing, for
each row, a word, its postag and the corresponding lemmas separated by "#":
<screen>
<![CDATA[
muestras NOUN muestra
cantaba VERB cantar
fue VERB ir#ser
entramos VERB entrar
]]>
</screen>
First the dictionary must be loaded into memory from disk or another
source.
In the sample below it is loaded from disk.
<programlisting language="java">
<![CDATA[
InputStream dictLemmatizer = null;
try (dictLemmatizer = new FileInputStream("english-dict-lemmatizer.txt")) {
}
]]>
</programlisting>
After the dictionary is loaded the DictionaryLemmatizer can be
instantiated.
<programlisting language="java">
<![CDATA[
DictionaryLemmatizer lemmatizer = new DictionaryLemmatizer(dictLemmatizer);]]>
</programlisting>
The DictionaryLemmatizer instance is now ready. It expects two
String arrays as input,
a containing the tokens and another one their respective postags.
</para>
<para>
The following code shows how to find a lemma using a
DictionaryLemmatizer.
<programlisting language="java">
<![CDATA[
String[] tokens = new String[]{"Most", "large", "cities", "in", "the", "US", "had",
"morning", "and", "afternoon", "newspapers", "."};
String[] tags = tagger.tag(sent);
String[] lemmas = lemmatizer.lemmatize(tokens, postags);
]]>
</programlisting>
The tags array contains one part-of-speech tag for each token in the
input array. The corresponding
tag and lemmas can be found at the same index as the token has in the
input array.
</para>
</section>
<section id="tools.lemmatizer.training">
<title>Lemmatizer Training</title>
<para>
The training data consist of three columns separated by tabs. Each
word has been put on a
separate line and there is an empty line after each sentence. The first
column contains
the current word, the second its part-of-speech tag and the third its
lemma.
Here is an example of the file format:
</para>
<para>
Sample sentence of the training data:
<screen>
<![CDATA[
He PRON he
reckons VERB reckon
the DET the
current ADJ current
accounts NOUN account
deficit NOUN deficit
will AUX will
narrow VERB narrow
to PART to
only ADV only
# # #
1.8 NUM 1.8
millions NOUN million
in ADP in
September PROPN september
. PUNCT O]]>
</screen>
The Universal Dependencies Treebank and the CoNLL 2009 datasets
distribute training data for many languages.
</para>
<section id="tools.lemmatizer.training.tool">
<title>Training Tool</title>
<para>
OpenNLP has a command line tool which is used to train the models on
various corpora.
</para>
<para>
Usage of the tool:
<screen>
<![CDATA[
$ opennlp LemmatizerTrainerME
Usage: opennlp LemmatizerTrainerME [-factory factoryName] [-params paramsFile] -lang language -model modelFile -data sampleData [-encoding charsetName]
Arguments description:
-factory factoryName
A sub-class of LemmatizerFactory where to get implementation and resources.
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-model modelFile
output model file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
Its now assumed that the english lemmatizer model should be trained
from a file called
'en-custom-lemmatizer.train' which is encoded as UTF-8. The following command will train the
lemmatizer and write the model to en-custom-lemmatizer.bin:
<screen>
<![CDATA[
$ opennlp LemmatizerTrainerME -model en-custom-lemmatizer.bin -params PerceptronTrainerParams.txt -lang en -data en-custom-lemmatizer.train -encoding UTF-8]]>
</screen>
</para>
</section>
<section id="tools.lemmatizer.training.api">
<title>Training API</title>
<para>
The Lemmatizer offers an API to train a new lemmatizer model. First
a training parameters
file needs to be instantiated:
<programlisting language="java">
<![CDATA[
TrainingParameters mlParams = CmdLineUtil.loadTrainingParameters(params.getParams(), false);
if (mlParams == null) {
mlParams = ModelUtil.createDefaultTrainingParameters();
}]]>
</programlisting>
Then we read the training data:
<programlisting language="java">
<![CDATA[
InputStreamFactory inputStreamFactory = null;
try {
inputStreamFactory = new MarkableFileInputStreamFactory(
new File(en-custom-lemmatizer.train));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
ObjectStream<String> lineStream = null;
LemmaSampleStream lemmaStream = null;
try {
lineStream = new PlainTextByLineStream(
(inputStreamFactory), StandardCharsets.UTF_8);
lemmaStream = new LemmaSampleStream(lineStream);
} catch (IOException e) {
CmdLineUtil.handleCreateObjectStreamError(e);
}
]]>
</programlisting>
The following step proceeds to train the model:
<programlisting>
LemmatizerModel model;
try {
LemmatizerFactory lemmatizerFactory = LemmatizerFactory
.create(params.getFactory());
model = LemmatizerME.train(params.getLang(), lemmaStream, mlParams,
lemmatizerFactory);
} catch (IOException e) {
throw new TerminateToolException(-1,
"IO error while reading training data or indexing data: "
+ e.getMessage(),
e);
} finally {
try {
sampleStream.close();
} catch (IOException e) {
}
}
</programlisting>
</para>
</section>
</section>
<section id="tools.lemmatizer.evaluation">
<title>Lemmatizer Evaluation</title>
<para>
The built in evaluation can measure the accuracy of the statistical
lemmatizer.
The accuracy can be measured on a test data set.
</para>
<para>
There is a command line tool to evaluate a given model on a test
data set.
The following command shows how the tool can be run:
<screen>
<![CDATA[
$ opennlp LemmatizerEvaluator -model en-custom-lemmatizer.bin -data en-custom-lemmatizer.test -encoding utf-8]]>
</screen>
This will display the resulting accuracy score, e.g.:
<screen>
<![CDATA[
Loading model ... done
Evaluating ... done
Accuracy: 0.9659110277825124]]>
</screen>
</para>
</section>
</chapter>