|
| 1 | +<?xml version="1.0" encoding="UTF-8"?> |
| 2 | +<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V5.0//EN" |
| 3 | +"https://cdn.docbook.org/schema/5.0/dtd/docbook.dtd"[ |
| 4 | +]> |
| 5 | +<!-- |
| 6 | +Licensed to the Apache Software Foundation (ASF) under one |
| 7 | +or more contributor license agreements. See the NOTICE file |
| 8 | +distributed with this work for additional information |
| 9 | +regarding copyright ownership. The ASF licenses this file |
| 10 | +to you under the Apache License, Version 2.0 (the |
| 11 | +"License"); you may not use this file except in compliance |
| 12 | +with the License. You may obtain a copy of the License at |
| 13 | +
|
| 14 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 15 | +
|
| 16 | +Unless required by applicable law or agreed to in writing, |
| 17 | +software distributed under the License is distributed on an |
| 18 | +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 19 | +KIND, either express or implied. See the License for the |
| 20 | +specific language governing permissions and limitations |
| 21 | +under the License. |
| 22 | +--> |
| 23 | + |
| 24 | +<chapter xml:id="tools.sentiment" xmlns:xlink="http://www.w3.org/1999/xlink"> |
| 25 | +<title>Sentiment Analysis</title> |
| 26 | + |
| 27 | + <section xml:id="tools.sentiment.classifying"> |
| 28 | + <title>Classifying</title> |
| 29 | + <para> |
| 30 | + The OpenNLP Sentiment Analyzer can classify text into sentiment categories such as |
| 31 | + "positive" or "negative". It is based on the maximum entropy framework. |
| 32 | + For example, the text below could be classified as <emphasis>positive</emphasis>: |
| 33 | + <screen> |
| 34 | +<![CDATA[I love this product it is absolutely wonderful and amazing]]> |
| 35 | + </screen> |
| 36 | + and the text below could be classified as <emphasis>negative</emphasis>: |
| 37 | + <screen> |
| 38 | +<![CDATA[Terrible experience the worst customer service I have ever had]]> |
| 39 | + </screen> |
| 40 | + To be able to classify text, the sentiment analyzer needs a model. The sentiment |
| 41 | + categories are requirements-specific and defined by the training data. There are no |
| 42 | + pre-built models for sentiment analysis under the OpenNLP project. |
| 43 | + </para> |
| 44 | + |
| 45 | + <section xml:id="tools.sentiment.classifying.cmdline"> |
| 46 | + <title>Sentiment Analysis Tool</title> |
| 47 | + <para> |
| 48 | + The easiest way to try out the sentiment analyzer is the command line tool. |
| 49 | + The tool is only intended for demonstration and testing. The following command |
| 50 | + shows how to use the sentiment analysis tool: |
| 51 | + <screen> |
| 52 | +<![CDATA[$ opennlp Sentiment model]]> |
| 53 | + </screen> |
| 54 | + The input is read from standard input and the predicted sentiment is written |
| 55 | + to standard output. |
| 56 | + </para> |
| 57 | + </section> |
| 58 | + |
| 59 | + <section xml:id="tools.sentiment.classifying.api"> |
| 60 | + <title>Sentiment Analysis API</title> |
| 61 | + <para> |
| 62 | + To perform sentiment classification you will need a model encapsulated in the |
| 63 | + <code>SentimentModel</code> class. First, load the model from an |
| 64 | + <code>InputStream</code>: |
| 65 | + <programlisting language="java"> |
| 66 | +<![CDATA[InputStream is = ... |
| 67 | +SentimentModel model = new SentimentModel(is);]]> |
| 68 | + </programlisting> |
| 69 | + With the <code>SentimentModel</code> in hand, create a |
| 70 | + <code>SentimentME</code> instance and predict sentiments: |
| 71 | + <programlisting language="java"> |
| 72 | +<![CDATA[SentimentME sentiment = new SentimentME(model); |
| 73 | +
|
| 74 | +// Predict from a raw sentence string (tokenized internally) |
| 75 | +String result = sentiment.predict("I love this product"); |
| 76 | +
|
| 77 | +// Or predict from pre-tokenized input |
| 78 | +String[] tokens = new String[]{"I", "love", "this", "product"}; |
| 79 | +String result2 = sentiment.predict(tokens); |
| 80 | +
|
| 81 | +// Access the probability distribution over sentiment categories |
| 82 | +double[] probabilities = sentiment.probabilities(tokens); |
| 83 | +String bestSentiment = sentiment.getBestSentiment(probabilities);]]> |
| 84 | + </programlisting> |
| 85 | + </para> |
| 86 | + </section> |
| 87 | + </section> |
| 88 | + |
| 89 | + <section xml:id="tools.sentiment.training"> |
| 90 | + <title>Training</title> |
| 91 | + <para> |
| 92 | + The Sentiment Analyzer can be trained on annotated training material. The data |
| 93 | + format is one sample per line, containing the sentiment category followed by the |
| 94 | + text tokens, all separated by whitespace. The following sample shows the required format: |
| 95 | + <screen> |
| 96 | +<![CDATA[positive I love this movie it is absolutely wonderful and amazing |
| 97 | +positive This product is great and I am very happy with it |
| 98 | +negative I hate this product it broke after one day of use |
| 99 | +negative Terrible experience the worst customer service I have ever had]]> |
| 100 | + </screen> |
| 101 | + </para> |
| 102 | + |
| 103 | + <section xml:id="tools.sentiment.training.tool"> |
| 104 | + <title>Training Tool</title> |
| 105 | + <para> |
| 106 | + The following command will train the sentiment analyzer and write the model |
| 107 | + to <code>en-sentiment.bin</code>: |
| 108 | + <screen> |
| 109 | +<![CDATA[$ opennlp SentimentTrainer -model en-sentiment.bin -lang en -data en-sentiment.train -encoding UTF-8]]> |
| 110 | + </screen> |
| 111 | + </para> |
| 112 | + </section> |
| 113 | + |
| 114 | + <section xml:id="tools.sentiment.training.api"> |
| 115 | + <title>Training API</title> |
| 116 | + <para> |
| 117 | + To train a sentiment model programmatically, prepare an |
| 118 | + <code>ObjectStream</code> of <code>SentimentSample</code> objects and |
| 119 | + call the <code>SentimentME.train()</code> method: |
| 120 | + <programlisting language="java"> |
| 121 | +<![CDATA[SentimentModel model; |
| 122 | +
|
| 123 | +InputStreamFactory dataIn = new MarkableFileInputStreamFactory( |
| 124 | + new File("en-sentiment.train")); |
| 125 | +
|
| 126 | +ObjectStream<String> lineStream = |
| 127 | + new PlainTextByLineStream(dataIn, StandardCharsets.UTF_8); |
| 128 | +ObjectStream<SentimentSample> sampleStream = |
| 129 | + new SentimentSampleStream(lineStream); |
| 130 | +
|
| 131 | +model = SentimentME.train("eng", sampleStream, |
| 132 | + TrainingParameters.defaultParams(), new SentimentFactory());]]> |
| 133 | + </programlisting> |
| 134 | + Once trained, the model can be serialized for later use: |
| 135 | + <programlisting language="java"> |
| 136 | +<![CDATA[try (OutputStream modelOut = new BufferedOutputStream( |
| 137 | + new FileOutputStream("en-sentiment.bin"))) { |
| 138 | + model.serialize(modelOut); |
| 139 | +}]]> |
| 140 | + </programlisting> |
| 141 | + </para> |
| 142 | + </section> |
| 143 | + </section> |
| 144 | + |
| 145 | + <section xml:id="tools.sentiment.evaluation"> |
| 146 | + <title>Evaluation</title> |
| 147 | + |
| 148 | + <section xml:id="tools.sentiment.evaluation.tool"> |
| 149 | + <title>Evaluation Tool</title> |
| 150 | + <para> |
| 151 | + The sentiment analyzer can be evaluated against test data using the command line tool: |
| 152 | + <screen> |
| 153 | +<![CDATA[$ opennlp SentimentEvaluator -model en-sentiment.bin -data en-sentiment.test -encoding UTF-8]]> |
| 154 | + </screen> |
| 155 | + This will output precision, recall, and F-measure statistics. |
| 156 | + </para> |
| 157 | + </section> |
| 158 | + |
| 159 | + <section xml:id="tools.sentiment.evaluation.crossvalidation"> |
| 160 | + <title>Cross Validation</title> |
| 161 | + <para> |
| 162 | + K-fold cross validation can be performed to evaluate the model without a |
| 163 | + separate test set: |
| 164 | + <screen> |
| 165 | +<![CDATA[$ opennlp SentimentCrossValidator -lang en -data en-sentiment.train -encoding UTF-8 -folds 10]]> |
| 166 | + </screen> |
| 167 | + </para> |
| 168 | + </section> |
| 169 | + |
| 170 | + <section xml:id="tools.sentiment.evaluation.api"> |
| 171 | + <title>Evaluation API</title> |
| 172 | + <para> |
| 173 | + The evaluation API allows programmatic evaluation against a set of |
| 174 | + <code>SentimentSample</code> references: |
| 175 | + <programlisting language="java"> |
| 176 | +<![CDATA[SentimentME sentiment = new SentimentME(model); |
| 177 | +SentimentEvaluator evaluator = new SentimentEvaluator(sentiment); |
| 178 | +evaluator.evaluate(testSampleStream); |
| 179 | +
|
| 180 | +System.out.println(evaluator.getFMeasure());]]> |
| 181 | + </programlisting> |
| 182 | + </para> |
| 183 | + </section> |
| 184 | + </section> |
| 185 | + |
| 186 | +</chapter> |
0 commit comments