@@ -237,11 +237,114 @@ try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(model
237237 </para >
238238 <para >
239239 The dictionary is defined in a xml format and can be created and stored with the POSDictionary class.
240- Please for now checkout the javadoc and source code of that class .
240+ Below is an example to train a custom model using a tag dictionary .
241241 </para >
242- <para >Note: The format should be documented and sample code should show how to use the dictionary.
243- Any contributions are very welcome. If you want to contribute please contact us on the mailing list
244- or comment on the jira issue <ulink url =" https://issues.apache.org/jira/browse/OPENNLP-287" >OPENNLP-287</ulink >.
242+ <para >
243+ Sample POS Training material (file : en-custom-pos.train)
244+ <screen >
245+ <![CDATA[
246+ It_PRON is_OTHER spring_PROPN season_NOUN. The_DET flowers_NOUN are_OTHER red_ADJ and_CCONJ yellow_ADJ ._PUNCT
247+ Red_NOUN is_OTHER my_DET favourite_ADJ colour_NOUN ._PUNCT]]>
248+ </screen >
249+ </para >
250+ <para >
251+ Sample Tag Dictionary (file : dictionary.xml)
252+ <programlisting language =" xml" >
253+ <![CDATA[
254+ <?xml version="1.0" encoding="UTF-8"?>
255+ <dictionary case_sensitive="false">
256+ <entry tags="PRON">
257+ <token>It</token>
258+ </entry>
259+ <entry tags="OTHER">
260+ <token>is</token>
261+ </entry>
262+ <entry tags="PROPN">
263+ <token>Spring</token>
264+ </entry>
265+ <entry tags="NOUN">
266+ <token>season</token>
267+ </entry>
268+ <entry tags="DET">
269+ <token>the</token>
270+ </entry>
271+ <entry tags="NOUN">
272+ <token>flowers</token>
273+ </entry>
274+ <entry tags="OTHER">
275+ <token>are</token>
276+ </entry>
277+ <entry tags="NOUN">
278+ <token>red</token>
279+ </entry>
280+ <entry tags="CCONJ">
281+ <token>and</token>
282+ </entry>
283+ <entry tags="NOUN">
284+ <token>yellow</token>
285+ </entry>
286+ <entry tags="PRON">
287+ <token>my</token>
288+ </entry>
289+ <entry tags="ADJ">
290+ <token>favourite</token>
291+ </entry>
292+ <entry tags="NOUN">
293+ <token>colour</token>
294+ </entry>
295+ <entry tags="PUNCT">
296+ <token>.</token>
297+ </entry>
298+ </dictionary>]]>
299+ </programlisting >
300+ </para >
301+ <para >Sample code to train a model using above tag dictionary
302+ <programlisting language =" java" >
303+ <![CDATA[
304+ POSModel model = null;
305+ try {
306+ ObjectStream<String> lineStream = new PlainTextByLineStream(
307+ new MarkableFileInputStreamFactory(new File("en-custom-pos.train")), StandardCharsets.UTF_8);
308+
309+ ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
310+
311+ TrainingParameters params = ModelUtil.createDefaultTrainingParameters();
312+ params.put(TrainingParameters.CUTOFF_PARAM, 0);
313+
314+ POSTaggerFactory factory = new POSTaggerFactory();
315+ TagDictionary dict = factory.createTagDictionary(new File("dictionary.xml"));
316+ factory.setTagDictionary(dict);
317+
318+ model = POSTaggerME.train("eng", sampleStream, params, factory);
319+
320+ OutputStream modelOut = new BufferedOutputStream(new FileOutputStream("en-custom-pos-maxent.bin"));
321+ model.serialize(modelOut);
322+
323+ } catch (IOException e) {
324+ e.printStackTrace();
325+ }]]>
326+ </programlisting >
327+ </para >
328+ <para >
329+ The custom model is then used to tag a sequence.
330+ <programlisting language =" java" >
331+ <![CDATA[
332+ String[] sent = new String[]{"Spring", "is", "my", "favourite", "season", "."};
333+ String[] tags = tagger.tag(sent);
334+ Arrays.stream(tags).forEach(k -> System.out.print(k + " "));]]>
335+ </programlisting >
336+ </para >
337+ <para >
338+ <literallayout >
339+ Input
340+ Sentence: Spring is my favourite season.
341+
342+ Output
343+ POS Tags using the custom model (en-custom-pos-maxent.bin): PROPN OTHER PRON ADJ NOUN PUNCT
344+
345+ Output with the default model
346+ POS Tags using the default model (opennlp-en-ud-ewt-pos-1.2-2.5.0.bin): NOUN AUX PRON ADJ NOUN PUNCT
347+ </literallayout >
245348 </para >
246349 </section >
247350 </section >
0 commit comments