-
Notifications
You must be signed in to change notification settings - Fork 492
Expand file tree
/
Copy pathtokenizer.xml
More file actions
473 lines (444 loc) · 18.2 KB
/
tokenizer.xml
File metadata and controls
473 lines (444 loc) · 18.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
]>
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
license agreements. See the NOTICE file distributed with this work for additional
information regarding copyright ownership. The ASF licenses this file to
you under the Apache License, Version 2.0 (the "License"); you may not use
this file except in compliance with the License. You may obtain a copy of
the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
by applicable law or agreed to in writing, software distributed under the
License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
OF ANY KIND, either express or implied. See the License for the specific
language governing permissions and limitations under the License. -->
<chapter id="tools.tokenizer">
<title>Tokenizer</title>
<section id="tools.tokenizer.introduction">
<title>Tokenization</title>
<para>
The OpenNLP Tokenizers segment an input character sequence into
tokens. Tokens are usually
words, punctuation, numbers, etc.
<screen>
<![CDATA[
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields
PLC, was named a director of this British industrial conglomerate.
]]>
</screen>
The following result shows the individual tokens in a whitespace
separated representation.
<screen>
<![CDATA[
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC ,
was named a nonexecutive director of this British industrial conglomerate .
A form of asbestos once used to make Kent cigarette filters has caused a high
percentage of cancer deaths among a group of workers exposed to it more than 30 years ago ,
researchers reported .
]]>
</screen>
OpenNLP offers multiple tokenizer implementations:
<itemizedlist>
<listitem>
<para>Whitespace Tokenizer - A whitespace tokenizer, non whitespace
sequences are identified as tokens</para>
</listitem>
<listitem>
<para>Simple Tokenizer - A character class tokenizer, sequences of
the same character class are tokens</para>
</listitem>
<listitem>
<para>Learnable Tokenizer - A maximum entropy tokenizer, detects
token boundaries based on probability model</para>
</listitem>
</itemizedlist>
Most part-of-speech taggers, parsers and so on, work with text
tokenized in this manner. It is important to ensure that your
tokenizer produces tokens of the type expected by your later text
processing components.
</para>
<para>
With OpenNLP (as with many systems), tokenization is a two-stage
process:
first, sentence boundaries are identified, then tokens within
each sentence are identified.
</para>
<section id="tools.tokenizer.cmdline">
<title>Tokenizer Tools</title>
<para>The easiest way to try out the tokenizers are the command line
tools. The tools are only intended for demonstration and testing.
</para>
<para>There are two tools, one for the Simple Tokenizer and one for
the learnable tokenizer. A command line tool the for the Whitespace
Tokenizer does not exist, because the whitespace separated output
would be identical to the input.</para>
<para>
The following command shows how to use the Simple Tokenizer Tool.
<screen>
<![CDATA[
$ opennlp SimpleTokenizer]]>
</screen>
To use the learnable tokenizer download the english token model from
our website.
<screen>
<![CDATA[
$ opennlp TokenizerME opennlp-en-ud-ewt-tokens-1.3-2.5.4.bin]]>
</screen>
To test the tokenizer copy the sample from above to the console. The
whitespace separated tokens will be written back to the
console.
</para>
<para>
Usually the input is read from a file and written to a file.
<screen>
<![CDATA[
$ opennlp TokenizerME opennlp-en-ud-ewt-tokens-1.3-2.5.4.bin < article.txt > article-tokenized.txt]]>
</screen>
It can be done in the same way for the Simple Tokenizer.
</para>
<para>
Since most text comes truly raw and doesn't have sentence boundaries
and such, it's possible to create a pipe which first performs sentence
boundary detection and tokenization. The following sample illustrates
that.
<screen>
<![CDATA[
$ opennlp SentenceDetector sentdetect.model < article.txt | opennlp TokenizerME tokenize.model | more
Loading model ... Loading model ... done
done
Showa Shell gained 20 to 1,570 and Mitsubishi Oil rose 50 to 1,500.
Sumitomo Metal Mining fell five yen to 692 and Nippon Mining added 15 to 960 .
Among other winners Wednesday was Nippon Shokubai , which was up 80 at 2,410 .
Marubeni advanced 11 to 890 .
London share prices were bolstered largely by continued gains on Wall Street and technical
factors affecting demand for London 's blue-chip stocks .
...etc...]]>
</screen>
Of course this is all on the command line. Many people use the models
directly in their Java code by creating SentenceDetector and
Tokenizer objects and calling their methods as appropriate. The
following section will explain how the Tokenizers can be used
directly from java.
</para>
</section>
<section id="tools.tokenizer.api">
<title>Tokenizer API</title>
<para>
The Tokenizers can be integrated into an application by the defined
API.
The shared instance of the WhitespaceTokenizer can be retrieved from a
static field WhitespaceTokenizer.INSTANCE. The shared instance of the
SimpleTokenizer can be retrieved in the same way from
SimpleTokenizer.INSTANCE.
To instantiate the TokenizerME (the learnable tokenizer) a Token Model
must be created first. The following code sample shows how a model
can be loaded.
<programlisting language="java">
<![CDATA[
try (InputStream modelIn = new FileInputStream("opennlp-en-ud-ewt-tokens-1.3-2.5.4.bin")) {
TokenizerModel model = new TokenizerModel(modelIn);
}]]>
</programlisting>
After the model is loaded the TokenizerME can be instantiated.
<programlisting language="java">
<![CDATA[
Tokenizer tokenizer = new TokenizerME(model);]]>
</programlisting>
The tokenizer offers two tokenize methods, both expect an input
String object which contains the untokenized text. If possible it
should be a sentence, but depending on the training of the learnable
tokenizer this is not required. The first returns an array of
Strings, where each String is one token.
<programlisting language="java">
<![CDATA[
String[] tokens = tokenizer.tokenize("An input sample sentence.");]]>
</programlisting>
The output will be an array with these tokens.
<programlisting>
<![CDATA[
"An", "input", "sample", "sentence", "."]]>
</programlisting>
The second method, tokenizePos returns an array of Spans, each Span
contain the start and end character offsets of the token in the input
String.
<programlisting language="java">
<![CDATA[
Span[] tokenSpans = tokenizer.tokenizePos("An input sample sentence.");]]>
</programlisting>
The tokenSpans array now contain 5 elements. To get the text for one
span call Span.getCoveredText which takes a span and the input text.
The TokenizerME is able to output the probabilities for the detected
tokens. The getTokenProbabilities method must be called directly
after one of the tokenize methods was called.
<programlisting language="java">
<![CDATA[
TokenizerME tokenizer = ...
String[] tokens = tokenizer.tokenize(...);
double[] tokenProbs = tokenizer.getTokenProbabilities();]]>
</programlisting>
The tokenProbs array now contains one double value per token, the
value is between 0 and 1, where 1 is the highest possible probability
and 0 the lowest possible probability.
</para>
</section>
</section>
<section id="tools.tokenizer.training">
<title>Tokenizer Training</title>
<section id="tools.tokenizer.training.tool">
<title>Training Tool</title>
<para>
OpenNLP has a command line tool which is used to train the models
available from the model download page on various corpora. The data
can be converted to the OpenNLP Tokenizer training format or used directly.
The OpenNLP format contains one sentence per line. Tokens are either separated by a
whitespace or by a special <SPLIT> tag. Tokens are split automatically on whitespace
and at least one <SPLIT> tag must be present in the training text.
The following sample shows the sample from above in the correct format.
<screen>
<![CDATA[
Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a nonexecutive director Nov. 29<SPLIT>.
Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing group<SPLIT>.
Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated Gold Fields PLC<SPLIT>,
was named a nonexecutive director of this British industrial conglomerate<SPLIT>.]]>
</screen>
Usage of the tool:
<screen>
<![CDATA[
$ opennlp TokenizerTrainer
Usage: opennlp TokenizerTrainer[.namefinder|.conllx|.pos] [-abbDict path] \
[-alphaNumOpt isAlphaNumOpt] [-params paramsFile] [-iterations num] \
[-cutoff num] -model modelFile -lang language -data sampleData \
[-encoding charsetName]
Arguments description:
-abbDict path
abbreviation dictionary in XML format.
-alphaNumOpt isAlphaNumOpt
Optimization flag to skip alpha numeric tokens for further tokenization
-params paramsFile
training parameters file.
-iterations num
number of training iterations, ignored if -params is used.
-cutoff num
minimal number of times a feature must be seen, ignored if -params is used.
-model modelFile
output model file.
-lang language
language which is being processed.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.]]>
</screen>
To train the english tokenizer use the following command:
<screen>
<![CDATA[
$ opennlp TokenizerTrainer -model en-custom-token.bin -alphaNumOpt true -lang en -data en-custom-token.train -encoding UTF-8
Indexing events with TwoPass using cutoff of 5
Computing event counts... done. 45 events
Indexing... done.
Sorting and merging events... done. Reduced 45 events to 25.
Done indexing in 0,09 s.
Incorporating indexed data for training...
done.
Number of Event Tokens: 25
Number of Outcomes: 2
Number of Predicates: 18
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-31.191623125197527 0.8222222222222222
2: ... loglikelihood=-21.036561339080343 0.8666666666666667
3: ... loglikelihood=-16.397882721809086 0.9333333333333333
4: ... loglikelihood=-13.624159882595462 0.9333333333333333
5: ... loglikelihood=-11.762067054883842 0.9777777777777777
...<skipping a bunch of iterations>...
95: ... loglikelihood=-2.0234942537226366 1.0
96: ... loglikelihood=-2.0107265117555935 1.0
97: ... loglikelihood=-1.998139365828305 1.0
98: ... loglikelihood=-1.9857283791639697 1.0
99: ... loglikelihood=-1.9734892753591327 1.0
100: ... loglikelihood=-1.9614179307958106 1.0
Writing tokenizer model ... done (0,044s)
Wrote tokenizer model to
Path: en-custom-token.bin]]>
</screen>
</para>
</section>
<section id="tools.tokenizer.training.api">
<title>Training API</title>
<para>
The Tokenizer offers an API to train a new tokenization model. Basically three steps
are necessary to train it:
<itemizedlist>
<listitem>
<para>The application must open a sample data stream</para>
</listitem>
<listitem>
<para>Call the TokenizerME.train method</para>
</listitem>
<listitem>
<para>Save the TokenizerModel to a file or directly use it</para>
</listitem>
</itemizedlist>
The following sample code illustrates these steps:
<programlisting language="java">
<![CDATA[
ObjectStream<String> lineStream = new PlainTextByLineStream(new MarkableFileInputStreamFactory(new File("en-custom-sent.train")),
StandardCharsets.UTF_8);
ObjectStream<TokenSample> sampleStream = new TokenSampleStream(lineStream);
TokenizerModel model;
try {
model = TokenizerME.train(sampleStream,
TokenizerFactory.create(null, "eng", null, true, null), TrainingParameters.defaultParams());
}
finally {
sampleStream.close();
}
OutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}]]>
</programlisting>
</para>
</section>
</section>
<section id="tools.tokenizer.detokenizing">
<title>Detokenizing</title>
<para>
Detokenizing is simple the opposite of tokenization, the original non-tokenized string should
be constructed out of a token sequence. The OpenNLP implementation was created to undo the tokenization
of training data for the tokenizer. It can also be used to undo the tokenization of such a trained
tokenizer. The implementation is strictly rule based and defines how tokens should be attached
to a sentence wise character sequence.
</para>
<para>
The rule dictionary assign to every token an operation which describes how it should be attached
to one continuous character sequence.
</para>
<para>
The following rules can be assigned to a token:
<itemizedlist>
<listitem>
<para>MERGE_TO_LEFT - Merges the token to the left side.</para>
</listitem>
<listitem>
<para>MERGE_TO_RIGHT - Merges the token to the right side.</para>
</listitem>
<listitem>
<para>RIGHT_LEFT_MATCHING - Merges the token to the right side on first occurrence
and to the left side on second occurrence.</para>
</listitem>
</itemizedlist>
The following sample will illustrate how the detokenizer with a small
rule dictionary (illustration format, not the xml data format):
<programlisting>
<![CDATA[
. MERGE_TO_LEFT
" RIGHT_LEFT_MATCHING]]>
</programlisting>
The dictionary should be used to de-tokenize the following whitespace tokenized sentence:
<programlisting>
<![CDATA[
He said " This is a test " .]]>
</programlisting>
The tokens would get these tags based on the dictionary:
<programlisting>
<![CDATA[
He -> NO_OPERATION
said -> NO_OPERATION
" -> MERGE_TO_RIGHT
This -> NO_OPERATION
is -> NO_OPERATION
a -> NO_OPERATION
test -> NO_OPERATION
" -> MERGE_TO_LEFT
. -> MERGE_TO_LEFT]]>
</programlisting>
That will result in the following character sequence:
<programlisting>
<![CDATA[
He said "This is a test".]]>
</programlisting>
</para>
<section id="tools.tokenizer.detokenizing.api">
<title>Detokenizing API</title>
<para>
The Detokenizer can be used to detokenize the tokens to String.
To instantiate the Detokenizer (a rule based detokenizer)
a DetokenizationDictionary (the rule of dictionary) must be created first.
The following code sample shows how a rule dictionary can be loaded.
<programlisting language="java">
<![CDATA[
InputStream dictIn = new FileInputStream("latin-detokenizer.xml");
DetokenizationDictionary dict = new DetokenizationDictionary(dictIn);]]>
</programlisting>
After the rule dictionary is loaded the DictionaryDetokenizer can be instantiated.
<programlisting language="java">
<![CDATA[
Detokenizer detokenizer = new DictionaryDetokenizer(dict);]]>
</programlisting>
The detokenizer offers two detokenize methods, the first detokenize the input tokens into a String.
<programlisting language="java">
<![CDATA[
String[] tokens = new String[]{"A", "co", "-", "worker", "helped", "."};
String sentence = detokenizer.detokenize(tokens, null);
Assert.assertEquals("A co-worker helped.", sentence);]]>
</programlisting>
Tokens which are connected without a space in-between can be separated by a split marker.
<programlisting language="java">
<![CDATA[
String sentence = detokenizer.detokenize(tokens, "<SPLIT>");
Assert.assertEquals("A co<SPLIT>-<SPLIT>worker helped<SPLIT>.", sentence);]]>
</programlisting>
The API also offers a method which simply returns operations array in the input tokens array.
<programlisting language="java">
<![CDATA[
DetokenizationOperation[] operations = detokenizer.detokenize(tokens);
for (DetokenizationOperation operation : operations) {
System.out.println(operation);
}]]>
</programlisting>
Output:
<programlisting>
<![CDATA[
NO_OPERATION
NO_OPERATION
MERGE_BOTH
NO_OPERATION
NO_OPERATION
MERGE_TO_LEFT]]>
</programlisting>
</para>
</section>
<section id="tools.tokenizer.detokenizing.dict">
<title>Detokenizer Dictionary</title>
<para>
Detokenization Dictionary is the rule dictionary about detokenizer.
tokens - an array of tokens that should be detokenized according to an operation.
operations - an array of operations which specifies which operation
should be used for the provided tokens.
The following code sample shows how a rule dictionary can be created.
<programlisting language="java">
<![CDATA[
String[] tokens = new String[]{".", "!", "(", ")", "\"", "-"};
Operation[] operations = new Operation[]{
Operation.MOVE_LEFT,
Operation.MOVE_LEFT,
Operation.MOVE_RIGHT,
Operation.MOVE_LEFT,
Operation.RIGHT_LEFT_MATCHING,
Operation.MOVE_BOTH};
DetokenizationDictionary dict = new DetokenizationDictionary(tokens, operations);]]>
</programlisting>
</para>
</section>
</section>
</chapter>