Skip to content

Commit 3012e7f

Browse files
committed
OPENNLP-287 Extend POS Tagger documentation with more information about the tag dictionary
1 parent d7e097d commit 3012e7f

File tree

1 file changed

+107
-4
lines changed

1 file changed

+107
-4
lines changed

opennlp-docs/src/docbkx/postagger.xml

Lines changed: 107 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -237,11 +237,114 @@ try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(model
237237
</para>
238238
<para>
239239
The dictionary is defined in a xml format and can be created and stored with the POSDictionary class.
240-
Please for now checkout the javadoc and source code of that class.
240+
Below is an example to train a custom model using a tag dictionary.
241241
</para>
242-
<para>Note: The format should be documented and sample code should show how to use the dictionary.
243-
Any contributions are very welcome. If you want to contribute please contact us on the mailing list
244-
or comment on the jira issue <ulink url="https://issues.apache.org/jira/browse/OPENNLP-287">OPENNLP-287</ulink>.
242+
<para>
243+
Sample POS Training material (file : en-custom-pos.train)
244+
<screen>
245+
<![CDATA[
246+
It_PRON is_OTHER spring_PROPN season_NOUN. The_DET flowers_NOUN are_OTHER red_ADJ and_CCONJ yellow_ADJ ._PUNCT
247+
Red_NOUN is_OTHER my_DET favourite_ADJ colour_NOUN ._PUNCT]]>
248+
</screen>
249+
</para>
250+
<para>
251+
Sample Tag Dictionary (file : dictionary.xml)
252+
<programlisting language="xml">
253+
<![CDATA[
254+
<?xml version="1.0" encoding="UTF-8"?>
255+
<dictionary case_sensitive="false">
256+
<entry tags="PRON">
257+
<token>It</token>
258+
</entry>
259+
<entry tags="OTHER">
260+
<token>is</token>
261+
</entry>
262+
<entry tags="PROPN">
263+
<token>Spring</token>
264+
</entry>
265+
<entry tags="NOUN">
266+
<token>season</token>
267+
</entry>
268+
<entry tags="DET">
269+
<token>the</token>
270+
</entry>
271+
<entry tags="NOUN">
272+
<token>flowers</token>
273+
</entry>
274+
<entry tags="OTHER">
275+
<token>are</token>
276+
</entry>
277+
<entry tags="NOUN">
278+
<token>red</token>
279+
</entry>
280+
<entry tags="CCONJ">
281+
<token>and</token>
282+
</entry>
283+
<entry tags="NOUN">
284+
<token>yellow</token>
285+
</entry>
286+
<entry tags="PRON">
287+
<token>my</token>
288+
</entry>
289+
<entry tags="ADJ">
290+
<token>favourite</token>
291+
</entry>
292+
<entry tags="NOUN">
293+
<token>colour</token>
294+
</entry>
295+
<entry tags="PUNCT">
296+
<token>.</token>
297+
</entry>
298+
</dictionary>]]>
299+
</programlisting>
300+
</para>
301+
<para>Sample code to train a model using above tag dictionary
302+
<programlisting language="java">
303+
<![CDATA[
304+
POSModel model = null;
305+
try {
306+
ObjectStream<String> lineStream = new PlainTextByLineStream(
307+
new MarkableFileInputStreamFactory(new File("en-custom-pos.train")), StandardCharsets.UTF_8);
308+
309+
ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
310+
311+
TrainingParameters params = ModelUtil.createDefaultTrainingParameters();
312+
params.put(TrainingParameters.CUTOFF_PARAM, 0);
313+
314+
POSTaggerFactory factory = new POSTaggerFactory();
315+
TagDictionary dict = factory.createTagDictionary(new File("dictionary.xml"));
316+
factory.setTagDictionary(dict);
317+
318+
model = POSTaggerME.train("eng", sampleStream, params, factory);
319+
320+
OutputStream modelOut = new BufferedOutputStream(new FileOutputStream("en-custom-pos-maxent.bin"));
321+
model.serialize(modelOut);
322+
323+
} catch (IOException e) {
324+
e.printStackTrace();
325+
}]]>
326+
</programlisting>
327+
</para>
328+
<para>
329+
The custom model is then used to tag a sequence.
330+
<programlisting language="java">
331+
<![CDATA[
332+
String[] sent = new String[]{"Spring", "is", "my", "favourite", "season", "."};
333+
String[] tags = tagger.tag(sent);
334+
Arrays.stream(tags).forEach(k -> System.out.print(k + " "));]]>
335+
</programlisting>
336+
</para>
337+
<para>
338+
<literallayout>
339+
Input
340+
Sentence: Spring is my favourite season.
341+
342+
Output
343+
POS Tags using the custom model (en-custom-pos-maxent.bin): PROPN OTHER PRON ADJ NOUN PUNCT
344+
345+
Output with the default model
346+
POS Tags using the default model (opennlp-en-ud-ewt-pos-1.2-2.5.0.bin): NOUN AUX PRON ADJ NOUN PUNCT
347+
</literallayout>
245348
</para>
246349
</section>
247350
</section>

0 commit comments

Comments
 (0)