Skip to content

Commit 9bfd807

Browse files
authored
OPENNLP-1808: Add SVM-based document categorization via zlibsvm (#981)
* OPENNLP-1808: Add SVM-based document categorization via zlibsvm Introduces opennlp-ml-libsvm, a new ML module providing SVM-based text classification through the zlibsvm library. The module implements the DocumentCategorizer interface and includes: - Configurable term weighting (binary, TF, TF-IDF, log-normalized TF) - Feature selection (information gain, chi-square, TF, DF) - Feature scaling with configurable range - Full SVM parameter control (kernel, cost, gamma, etc.) - Model serialization/deserialization - CLI tools (DoccatSVM, DoccatSVMTrainer, DoccatSVMEvaluator) - 86 unit tests across 8 test classes - Documentation in doccat.xml, project-structure.xml, and README * OPENNLP-1808: Add SVM-based document categorization via zlibsvm Introduces opennlp-ml-libsvm, a new ML module providing SVM-based text classification through the zlibsvm library. The module implements the DocumentCategorizer interface and includes: - Configurable term weighting (binary, TF, TF-IDF, log-normalized TF) - Feature selection (information gain, chi-square, TF, DF) - Feature scaling with configurable range - Full SVM parameter control (kernel, cost, gamma, etc.) - Model serialization/deserialization - CLI tools (DoccatSVM, DoccatSVMTrainer, DoccatSVMEvaluator) - 86 unit tests across 8 test classes - Documentation in doccat.xml, project-structure.xml, and README
1 parent dfe0a5a commit 9bfd807

31 files changed

+3464
-4
lines changed

LICENSE

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -278,6 +278,38 @@ The following license applies to the ONNX Runtime:
278278
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
279279
SOFTWARE.
280280

281+
The following license applies to libsvm (used via zlibsvm in opennlp-ml-libsvm):
282+
283+
Copyright (c) 2000-2023 Chih-Chung Chang and Chih-Jen Lin
284+
All rights reserved.
285+
286+
Redistribution and use in source and binary forms, with or without
287+
modification, are permitted provided that the following conditions
288+
are met:
289+
290+
1. Redistributions of source code must retain the above copyright
291+
notice, this list of conditions and the following disclaimer.
292+
293+
2. Redistributions in binary form must reproduce the above copyright
294+
notice, this list of conditions and the following disclaimer in the
295+
documentation and/or other materials provided with the distribution.
296+
297+
3. Neither name of copyright holders nor the names of its contributors
298+
may be used to endorse or promote products derived from this software
299+
without specific prior written permission.
300+
301+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
302+
``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
303+
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
304+
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR
305+
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
306+
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
307+
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
308+
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
309+
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
310+
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
311+
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
312+
281313
The following license applies to the SLF4J API:
282314

283315
MIT license

NOTICE

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,17 @@ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
3939
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
4040
SOFTWARE.
4141

42+
============================================================================
43+
44+
The SVM-based document categorizer in opennlp-ml-libsvm uses zlibsvm
45+
(https://github.com/rzo1/zlibsvm), an object-oriented Java binding for
46+
LIBSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/).
47+
48+
zlibsvm is licensed under the Apache License, Version 2.0.
49+
LIBSVM is licensed under the BSD 3-Clause License.
50+
51+
Copyright (c) 2000-2023 Chih-Chung Chang and Chih-Jen Lin
52+
4253
============================================================================
4354
List of third-party dependencies grouped by their license type.
4455

@@ -51,6 +62,8 @@ List of third-party dependencies grouped by their license type.
5162
* HPPC Collections (com.carrotsearch:hppc:0.7.2 - http://labs.carrotsearch.com/hppc.html/hppc)
5263
* jcommander (com.beust:jcommander:1.78 - https://jcommander.org)
5364
* SLF4J 2 Provider for Log4j API (org.apache.logging.log4j:log4j-slf4j2-impl:2.25.3 - https://logging.apache.org/log4j/2.x/)
65+
* zlibsvm API (de.hs-heilbronn.mi:zlibsvm-api:2.1.2 - https://github.com/rzo1/zlibsvm)
66+
* zlibsvm Core (de.hs-heilbronn.mi:zlibsvm-core:2.1.2 - https://github.com/rzo1/zlibsvm)
5467

5568
BSD License
5669

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ The goal of the OpenNLP project is to be a mature toolkit for the above mentione
3939
An additional goal is to provide a large number of pre-built models for a variety of languages, as
4040
well as the annotated text resources that those models are derived from.
4141

42-
Presently, OpenNLP includes common classifiers such as Maximum Entropy, Perceptron and Naive Bayes.
42+
Presently, OpenNLP includes common classifiers such as Maximum Entropy, Perceptron, Naive Bayes and Support Vector Machines (SVM).
4343

4444
OpenNLP can be used both programmatically through its Java API or from a terminal through its CLI.
4545
OpenNLP API can be easily plugged into distributed streaming data pipelines like Apache Flink, Apache NiFi, Apache Spark.
@@ -74,6 +74,7 @@ Currently, the library has different modules:
7474
* `opennlp-ml-maxent` : Maximum Entropy (MaxEnt) machine learning implementation.
7575
* `opennlp-ml-bayes` : Naive Bayes machine learning implementation.
7676
* `opennlp-ml-perceptron` : Perceptron-based machine learning implementation.
77+
* `opennlp-ml-libsvm` : Support Vector Machine (SVM) based text classification via [zlibsvm](https://github.com/rzo1/zlibsvm).
7778
* `opennlp-dl` : Apache OpenNLP adapter for [ONNX](https://onnx.ai) models using the `onnxruntime` dependency.
7879
* `opennlp-dl-gpu` : Replaces `onnxruntime` with the `onnxruntime_gpu` dependency to support GPU acceleration.
7980
* `opennlp-model-resolver` : Classes for discovering and loading Apache OpenNLP models from the classpath.

opennlp-core/opennlp-cli/pom.xml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,11 @@
5151
<scope>runtime</scope>
5252
</dependency>
5353

54+
<dependency>
55+
<artifactId>opennlp-ml-libsvm</artifactId>
56+
<groupId>${project.groupId}</groupId>
57+
</dependency>
58+
5459
<!-- External dependencies -->
5560
<dependency>
5661
<groupId>org.slf4j</groupId>

opennlp-core/opennlp-cli/src/main/java/opennlp/tools/cmdline/CLI.java

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,9 @@
3737
import opennlp.tools.cmdline.doccat.DoccatConverterTool;
3838
import opennlp.tools.cmdline.doccat.DoccatCrossValidatorTool;
3939
import opennlp.tools.cmdline.doccat.DoccatEvaluatorTool;
40+
import opennlp.tools.cmdline.doccat.DoccatSVMEvaluatorTool;
41+
import opennlp.tools.cmdline.doccat.DoccatSVMTool;
42+
import opennlp.tools.cmdline.doccat.DoccatSVMTrainerTool;
4043
import opennlp.tools.cmdline.doccat.DoccatTool;
4144
import opennlp.tools.cmdline.doccat.DoccatTrainerTool;
4245
import opennlp.tools.cmdline.entitylinker.EntityLinkerTool;
@@ -100,6 +103,11 @@ public final class CLI {
100103
tools.add(new DoccatCrossValidatorTool());
101104
tools.add(new DoccatConverterTool());
102105

106+
// Document Categorizer (SVM)
107+
tools.add(new DoccatSVMTool());
108+
tools.add(new DoccatSVMTrainerTool());
109+
tools.add(new DoccatSVMEvaluatorTool());
110+
103111
// Language Detector
104112
tools.add(new LanguageDetectorTool());
105113
tools.add(new LanguageDetectorTrainerTool());
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package opennlp.tools.cmdline.doccat;
19+
20+
import java.io.FileInputStream;
21+
import java.io.IOException;
22+
import java.util.LinkedList;
23+
import java.util.List;
24+
25+
import org.slf4j.Logger;
26+
import org.slf4j.LoggerFactory;
27+
28+
import opennlp.tools.cmdline.AbstractEvaluatorTool;
29+
import opennlp.tools.cmdline.PerformanceMonitor;
30+
import opennlp.tools.cmdline.TerminateToolException;
31+
import opennlp.tools.cmdline.doccat.DoccatSVMEvaluatorTool.EvalToolParams;
32+
import opennlp.tools.cmdline.params.EvaluatorParams;
33+
import opennlp.tools.doccat.BagOfWordsFeatureGenerator;
34+
import opennlp.tools.doccat.DoccatEvaluationMonitor;
35+
import opennlp.tools.doccat.DocumentCategorizerEvaluator;
36+
import opennlp.tools.doccat.DocumentSample;
37+
import opennlp.tools.ml.libsvm.doccat.SvmDoccatModel;
38+
import opennlp.tools.util.ObjectStream;
39+
import opennlp.tools.util.eval.EvaluationMonitor;
40+
41+
/**
42+
* CLI tool for evaluating an SVM-based document categorization model.
43+
* <p>
44+
* Usage: {@code opennlp DoccatSVMEvaluator -model model -data testData}
45+
*/
46+
public final class DoccatSVMEvaluatorTool extends
47+
AbstractEvaluatorTool<DocumentSample, EvalToolParams> {
48+
49+
interface EvalToolParams extends EvaluatorParams {
50+
}
51+
52+
private static final Logger logger = LoggerFactory.getLogger(DoccatSVMEvaluatorTool.class);
53+
54+
public DoccatSVMEvaluatorTool() {
55+
super(DocumentSample.class, EvalToolParams.class);
56+
}
57+
58+
@Override
59+
public String getShortDescription() {
60+
return "Measures the performance of the SVM Doccat model with the reference data";
61+
}
62+
63+
@Override
64+
public void run(String format, String[] args) {
65+
super.run(format, args);
66+
67+
SvmDoccatModel model;
68+
try (FileInputStream in = new FileInputStream(params.getModel())) {
69+
model = SvmDoccatModel.deserialize(in);
70+
} catch (IOException | ClassNotFoundException e) {
71+
throw new TerminateToolException(-1,
72+
"Failed to load SVM Doccat model: " + e.getMessage(), e);
73+
}
74+
75+
List<EvaluationMonitor<DocumentSample>> listeners = new LinkedList<>();
76+
if (params.getMisclassified()) {
77+
listeners.add(new DoccatEvaluationErrorListener());
78+
}
79+
80+
opennlp.tools.ml.libsvm.doccat.DocumentCategorizerSVM categorizer =
81+
new opennlp.tools.ml.libsvm.doccat.DocumentCategorizerSVM(
82+
model, new BagOfWordsFeatureGenerator());
83+
84+
DocumentCategorizerEvaluator evaluator = new DocumentCategorizerEvaluator(
85+
categorizer, listeners.toArray(new DoccatEvaluationMonitor[0]));
86+
87+
final PerformanceMonitor monitor = new PerformanceMonitor("doc");
88+
89+
try (ObjectStream<DocumentSample> measuredSampleStream = new ObjectStream<>() {
90+
@Override
91+
public DocumentSample read() throws IOException {
92+
monitor.incrementCounter();
93+
return sampleStream.read();
94+
}
95+
96+
@Override
97+
public void reset() throws IOException {
98+
sampleStream.reset();
99+
}
100+
101+
@Override
102+
public void close() throws IOException {
103+
sampleStream.close();
104+
}
105+
}) {
106+
monitor.startAndPrintThroughput();
107+
evaluator.evaluate(measuredSampleStream);
108+
} catch (IOException e) {
109+
throw new TerminateToolException(-1,
110+
"IO error while reading test data: " + e.getMessage(), e);
111+
}
112+
113+
monitor.stopAndPrintFinalResult();
114+
115+
logger.info(evaluator.toString());
116+
}
117+
}
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to You under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package opennlp.tools.cmdline.doccat;
19+
20+
import java.io.File;
21+
import java.io.FileInputStream;
22+
import java.io.IOException;
23+
24+
import org.slf4j.Logger;
25+
import org.slf4j.LoggerFactory;
26+
27+
import opennlp.tools.cmdline.BasicCmdLineTool;
28+
import opennlp.tools.cmdline.CLI;
29+
import opennlp.tools.cmdline.CmdLineUtil;
30+
import opennlp.tools.cmdline.PerformanceMonitor;
31+
import opennlp.tools.cmdline.SystemInputStreamFactory;
32+
import opennlp.tools.doccat.BagOfWordsFeatureGenerator;
33+
import opennlp.tools.doccat.DocumentSample;
34+
import opennlp.tools.ml.libsvm.doccat.SvmDoccatModel;
35+
import opennlp.tools.tokenize.WhitespaceTokenizer;
36+
import opennlp.tools.util.ObjectStream;
37+
import opennlp.tools.util.ParagraphStream;
38+
import opennlp.tools.util.PlainTextByLineStream;
39+
40+
/**
41+
* CLI tool for classifying documents using an SVM-based document categorization model.
42+
* <p>
43+
* Usage: {@code opennlp DoccatSVM model < documents}
44+
*/
45+
public class DoccatSVMTool extends BasicCmdLineTool {
46+
47+
private static final Logger logger = LoggerFactory.getLogger(DoccatSVMTool.class);
48+
49+
@Override
50+
public String getShortDescription() {
51+
return "SVM-based document categorizer";
52+
}
53+
54+
@Override
55+
public String getHelp() {
56+
return "Usage: " + CLI.CMD + " " + getName() + " model < documents";
57+
}
58+
59+
@Override
60+
public void run(String[] args) {
61+
62+
if (0 == args.length) {
63+
logger.info(getHelp());
64+
} else {
65+
66+
File modelFile = new File(args[0]);
67+
CmdLineUtil.checkInputFile("SVM Doccat model", modelFile);
68+
69+
SvmDoccatModel model;
70+
try (FileInputStream in = new FileInputStream(modelFile)) {
71+
model = SvmDoccatModel.deserialize(in);
72+
} catch (IOException | ClassNotFoundException e) {
73+
throw new RuntimeException("Failed to load SVM Doccat model: " + e.getMessage(), e);
74+
}
75+
76+
opennlp.tools.ml.libsvm.doccat.DocumentCategorizerSVM categorizer =
77+
new opennlp.tools.ml.libsvm.doccat.DocumentCategorizerSVM(
78+
model, new BagOfWordsFeatureGenerator());
79+
80+
ObjectStream<String> documentStream;
81+
82+
PerformanceMonitor perfMon = new PerformanceMonitor("doc");
83+
perfMon.start();
84+
85+
try {
86+
documentStream = new ParagraphStream(new PlainTextByLineStream(
87+
new SystemInputStreamFactory(), SystemInputStreamFactory.encoding()));
88+
String document;
89+
while ((document = documentStream.read()) != null) {
90+
String[] tokens = WhitespaceTokenizer.INSTANCE.tokenize(document);
91+
92+
double[] prob = categorizer.categorize(tokens);
93+
String category = categorizer.getBestCategory(prob);
94+
95+
DocumentSample sample = new DocumentSample(category, tokens);
96+
logger.info(sample.toString());
97+
98+
perfMon.incrementCounter();
99+
}
100+
} catch (IOException e) {
101+
CmdLineUtil.handleStdinIoError(e);
102+
}
103+
104+
perfMon.stopAndPrintFinalResult();
105+
}
106+
}
107+
}

0 commit comments

Comments
 (0)