Skip to content

Commit 5149dd3

Browse files
committed
OPENNLP-855: Update docs
1 parent 12e0d01 commit 5149dd3

File tree

3 files changed

+191
-3
lines changed

3 files changed

+191
-3
lines changed

opennlp-docs/src/docbkx/introduction.xml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@ under the License.
2828
<para>
2929
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.
3030
It supports the most common NLP tasks, such as tokenization, sentence segmentation,
31-
part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution.
31+
part-of-speech tagging, named entity extraction, chunking, parsing, coreference resolution,
32+
and sentiment analysis.
3233
These tasks are usually required to build more advanced text processing services.
3334
OpenNLP also includes maximum entropy and perceptron based machine learning.
3435
</para>
@@ -45,8 +46,8 @@ under the License.
4546
<para>The Apache OpenNLP library contains several components, enabling one to build
4647
a full natural language processing pipeline. These components
4748
include: sentence detector, tokenizer,
48-
name finder, document categorizer, part-of-speech tagger, chunker, parser,
49-
coreference resolution. Components contain parts which enable one to execute the
49+
name finder, document categorizer, sentiment analyzer, part-of-speech tagger,
50+
chunker, parser, coreference resolution. Components contain parts which enable one to execute the
5051
respective natural language processing task, to train a model and often also to evaluate a
5152
model. Each of these facilities is accessible via its application program
5253
interface (API). In addition, a command line interface (CLI) is provided for convenience

opennlp-docs/src/docbkx/opennlp.xml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,7 @@ under the License.
103103
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./tokenizer.xml" />
104104
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./namefinder.xml" />
105105
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./doccat.xml" />
106+
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./sentiment.xml" />
106107
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./postagger.xml" />
107108
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./lemmatizer.xml" />
108109
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./chunker.xml" />
Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V5.0//EN"
3+
"https://cdn.docbook.org/schema/5.0/dtd/docbook.dtd"[
4+
]>
5+
<!--
6+
Licensed to the Apache Software Foundation (ASF) under one
7+
or more contributor license agreements. See the NOTICE file
8+
distributed with this work for additional information
9+
regarding copyright ownership. The ASF licenses this file
10+
to you under the Apache License, Version 2.0 (the
11+
"License"); you may not use this file except in compliance
12+
with the License. You may obtain a copy of the License at
13+
14+
http://www.apache.org/licenses/LICENSE-2.0
15+
16+
Unless required by applicable law or agreed to in writing,
17+
software distributed under the License is distributed on an
18+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
19+
KIND, either express or implied. See the License for the
20+
specific language governing permissions and limitations
21+
under the License.
22+
-->
23+
24+
<chapter xml:id="tools.sentiment" xmlns:xlink="http://www.w3.org/1999/xlink">
25+
<title>Sentiment Analysis</title>
26+
27+
<section xml:id="tools.sentiment.classifying">
28+
<title>Classifying</title>
29+
<para>
30+
The OpenNLP Sentiment Analyzer can classify text into sentiment categories such as
31+
"positive" or "negative". It is based on the maximum entropy framework.
32+
For example, the text below could be classified as <emphasis>positive</emphasis>:
33+
<screen>
34+
<![CDATA[I love this product it is absolutely wonderful and amazing]]>
35+
</screen>
36+
and the text below could be classified as <emphasis>negative</emphasis>:
37+
<screen>
38+
<![CDATA[Terrible experience the worst customer service I have ever had]]>
39+
</screen>
40+
To be able to classify text, the sentiment analyzer needs a model. The sentiment
41+
categories are requirements-specific and defined by the training data. There are no
42+
pre-built models for sentiment analysis under the OpenNLP project.
43+
</para>
44+
45+
<section xml:id="tools.sentiment.classifying.cmdline">
46+
<title>Sentiment Analysis Tool</title>
47+
<para>
48+
The easiest way to try out the sentiment analyzer is the command line tool.
49+
The tool is only intended for demonstration and testing. The following command
50+
shows how to use the sentiment analysis tool:
51+
<screen>
52+
<![CDATA[$ opennlp Sentiment model]]>
53+
</screen>
54+
The input is read from standard input and the predicted sentiment is written
55+
to standard output.
56+
</para>
57+
</section>
58+
59+
<section xml:id="tools.sentiment.classifying.api">
60+
<title>Sentiment Analysis API</title>
61+
<para>
62+
To perform sentiment classification you will need a model encapsulated in the
63+
<code>SentimentModel</code> class. First, load the model from an
64+
<code>InputStream</code>:
65+
<programlisting language="java">
66+
<![CDATA[InputStream is = ...
67+
SentimentModel model = new SentimentModel(is);]]>
68+
</programlisting>
69+
With the <code>SentimentModel</code> in hand, create a
70+
<code>SentimentME</code> instance and predict sentiments:
71+
<programlisting language="java">
72+
<![CDATA[SentimentME sentiment = new SentimentME(model);
73+
74+
// Predict from a raw sentence string (tokenized internally)
75+
String result = sentiment.predict("I love this product");
76+
77+
// Or predict from pre-tokenized input
78+
String[] tokens = new String[]{"I", "love", "this", "product"};
79+
String result2 = sentiment.predict(tokens);
80+
81+
// Access the probability distribution over sentiment categories
82+
double[] probabilities = sentiment.probabilities(tokens);
83+
String bestSentiment = sentiment.getBestSentiment(probabilities);]]>
84+
</programlisting>
85+
</para>
86+
</section>
87+
</section>
88+
89+
<section xml:id="tools.sentiment.training">
90+
<title>Training</title>
91+
<para>
92+
The Sentiment Analyzer can be trained on annotated training material. The data
93+
format is one sample per line, containing the sentiment category followed by the
94+
text tokens, all separated by whitespace. The following sample shows the required format:
95+
<screen>
96+
<![CDATA[positive I love this movie it is absolutely wonderful and amazing
97+
positive This product is great and I am very happy with it
98+
negative I hate this product it broke after one day of use
99+
negative Terrible experience the worst customer service I have ever had]]>
100+
</screen>
101+
</para>
102+
103+
<section xml:id="tools.sentiment.training.tool">
104+
<title>Training Tool</title>
105+
<para>
106+
The following command will train the sentiment analyzer and write the model
107+
to <code>en-sentiment.bin</code>:
108+
<screen>
109+
<![CDATA[$ opennlp SentimentTrainer -model en-sentiment.bin -lang en -data en-sentiment.train -encoding UTF-8]]>
110+
</screen>
111+
</para>
112+
</section>
113+
114+
<section xml:id="tools.sentiment.training.api">
115+
<title>Training API</title>
116+
<para>
117+
To train a sentiment model programmatically, prepare an
118+
<code>ObjectStream</code> of <code>SentimentSample</code> objects and
119+
call the <code>SentimentME.train()</code> method:
120+
<programlisting language="java">
121+
<![CDATA[SentimentModel model;
122+
123+
InputStreamFactory dataIn = new MarkableFileInputStreamFactory(
124+
new File("en-sentiment.train"));
125+
126+
ObjectStream<String> lineStream =
127+
new PlainTextByLineStream(dataIn, StandardCharsets.UTF_8);
128+
ObjectStream<SentimentSample> sampleStream =
129+
new SentimentSampleStream(lineStream);
130+
131+
model = SentimentME.train("eng", sampleStream,
132+
TrainingParameters.defaultParams(), new SentimentFactory());]]>
133+
</programlisting>
134+
Once trained, the model can be serialized for later use:
135+
<programlisting language="java">
136+
<![CDATA[try (OutputStream modelOut = new BufferedOutputStream(
137+
new FileOutputStream("en-sentiment.bin"))) {
138+
model.serialize(modelOut);
139+
}]]>
140+
</programlisting>
141+
</para>
142+
</section>
143+
</section>
144+
145+
<section xml:id="tools.sentiment.evaluation">
146+
<title>Evaluation</title>
147+
148+
<section xml:id="tools.sentiment.evaluation.tool">
149+
<title>Evaluation Tool</title>
150+
<para>
151+
The sentiment analyzer can be evaluated against test data using the command line tool:
152+
<screen>
153+
<![CDATA[$ opennlp SentimentEvaluator -model en-sentiment.bin -data en-sentiment.test -encoding UTF-8]]>
154+
</screen>
155+
This will output precision, recall, and F-measure statistics.
156+
</para>
157+
</section>
158+
159+
<section xml:id="tools.sentiment.evaluation.crossvalidation">
160+
<title>Cross Validation</title>
161+
<para>
162+
K-fold cross validation can be performed to evaluate the model without a
163+
separate test set:
164+
<screen>
165+
<![CDATA[$ opennlp SentimentCrossValidator -lang en -data en-sentiment.train -encoding UTF-8 -folds 10]]>
166+
</screen>
167+
</para>
168+
</section>
169+
170+
<section xml:id="tools.sentiment.evaluation.api">
171+
<title>Evaluation API</title>
172+
<para>
173+
The evaluation API allows programmatic evaluation against a set of
174+
<code>SentimentSample</code> references:
175+
<programlisting language="java">
176+
<![CDATA[SentimentME sentiment = new SentimentME(model);
177+
SentimentEvaluator evaluator = new SentimentEvaluator(sentiment);
178+
evaluator.evaluate(testSampleStream);
179+
180+
System.out.println(evaluator.getFMeasure());]]>
181+
</programlisting>
182+
</para>
183+
</section>
184+
</section>
185+
186+
</chapter>

0 commit comments

Comments
 (0)