|
| 1 | +<?xml version="1.0" encoding="UTF-8"?> |
| 2 | +<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V5.0//EN" |
| 3 | +"https://cdn.docbook.org/schema/5.0/dtd/docbook.dtd"[ |
| 4 | +]> |
| 5 | +<!-- |
| 6 | +Licensed to the Apache Software Foundation (ASF) under one |
| 7 | +or more contributor license agreements. See the NOTICE file |
| 8 | +distributed with this work for additional information |
| 9 | +regarding copyright ownership. The ASF licenses this file |
| 10 | +to you under the Apache License, Version 2.0 (the |
| 11 | +"License"); you may not use this file except in compliance |
| 12 | +with the License. You may obtain a copy of the License at |
| 13 | +
|
| 14 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 15 | +
|
| 16 | +Unless required by applicable law or agreed to in writing, |
| 17 | +software distributed under the License is distributed on an |
| 18 | +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 19 | +KIND, either express or implied. See the License for the |
| 20 | +specific language governing permissions and limitations |
| 21 | +under the License. |
| 22 | +--> |
| 23 | + |
| 24 | +<chapter xml:id="tools.project.structure" xmlns:xlink="http://www.w3.org/1999/xlink"> |
| 25 | +<title>Project Structure</title> |
| 26 | + |
| 27 | + <section xml:id="tools.project.structure.overview"> |
| 28 | + <title>Overview</title> |
| 29 | + <para> |
| 30 | + Starting with version 3.0, Apache OpenNLP has been reorganized from a single monolithic |
| 31 | + <code>opennlp-tools</code> artifact into a set of fine-grained modules. This modularization |
| 32 | + allows users to depend only on the components they actually need, resulting in a smaller |
| 33 | + dependency footprint. At the same time, the public API remains stable and fully compatible |
| 34 | + with previous 2.x releases. |
| 35 | + </para> |
| 36 | + <para> |
| 37 | + The following sections describe each module, its purpose, and when to include it as a dependency. |
| 38 | + </para> |
| 39 | + </section> |
| 40 | + |
| 41 | + <section xml:id="tools.project.structure.api"> |
| 42 | + <title>API Module</title> |
| 43 | + <para> |
| 44 | + The <code>opennlp-api</code> module defines the public interfaces and abstractions |
| 45 | + that form the contract between OpenNLP and its users. It contains the core interfaces |
| 46 | + such as <code>Tokenizer</code>, <code>SentenceDetector</code>, <code>POSTagger</code>, |
| 47 | + <code>TokenNameFinder</code>, <code>Chunker</code>, <code>Parser</code>, |
| 48 | + <code>LanguageDetector</code>, <code>Lemmatizer</code>, and <code>DocumentCategorizer</code>. |
| 49 | + </para> |
| 50 | + <para> |
| 51 | + This module also provides shared base classes such as <code>BaseModel</code>, |
| 52 | + the <code>ObjectStream</code> abstraction for data processing, the command-line |
| 53 | + argument parsing framework, and common utility types. It is a transitive dependency |
| 54 | + of <code>opennlp-runtime</code> and typically does not need to be declared explicitly. |
| 55 | + </para> |
| 56 | + |
| 57 | + <programlisting language="xml"> |
| 58 | +<![CDATA[<dependency> |
| 59 | + <groupId>org.apache.opennlp</groupId> |
| 60 | + <artifactId>opennlp-api</artifactId> |
| 61 | + <version>CURRENT_OPENNLP_VERSION</version> |
| 62 | +</dependency>]]> |
| 63 | + </programlisting> |
| 64 | + </section> |
| 65 | + |
| 66 | + <section xml:id="tools.project.structure.runtime"> |
| 67 | + <title>Runtime Module</title> |
| 68 | + <para> |
| 69 | + The <code>opennlp-runtime</code> module is the primary dependency for most users. It |
| 70 | + contains the core NLP tool implementations including sentence detection, tokenization, |
| 71 | + part-of-speech tagging, named entity recognition, chunking, parsing, language detection, |
| 72 | + lemmatization, and document categorization. |
| 73 | + </para> |
| 74 | + <para> |
| 75 | + By default, <code>opennlp-runtime</code> ships with the Maximum Entropy machine |
| 76 | + learning implementation. If you need other ML algorithms, add the corresponding |
| 77 | + ML module as described below. |
| 78 | + </para> |
| 79 | + |
| 80 | + <programlisting language="xml"> |
| 81 | +<![CDATA[<dependency> |
| 82 | + <groupId>org.apache.opennlp</groupId> |
| 83 | + <artifactId>opennlp-runtime</artifactId> |
| 84 | + <version>CURRENT_OPENNLP_VERSION</version> |
| 85 | +</dependency>]]> |
| 86 | + </programlisting> |
| 87 | + </section> |
| 88 | + |
| 89 | + <section xml:id="tools.project.structure.ml"> |
| 90 | + <title>Machine Learning Modules</title> |
| 91 | + <para> |
| 92 | + The machine learning implementations have been separated into individual modules so that |
| 93 | + applications can include only the algorithms they use. Each module provides a specific |
| 94 | + ML algorithm and is loaded at runtime via the <code>ExtensionLoader</code> service |
| 95 | + discovery mechanism. |
| 96 | + </para> |
| 97 | + |
| 98 | + <itemizedlist> |
| 99 | + <listitem> |
| 100 | + <para> |
| 101 | + <code>opennlp-ml-commons</code> — Shared ML utilities and base classes used |
| 102 | + by all ML algorithm modules. This is a transitive dependency of each ML module |
| 103 | + and does not need to be declared explicitly. |
| 104 | + </para> |
| 105 | + </listitem> |
| 106 | + <listitem> |
| 107 | + <para> |
| 108 | + <code>opennlp-ml-maxent</code> — Maximum Entropy classifier. This is the default |
| 109 | + algorithm and is included transitively via <code>opennlp-runtime</code>. |
| 110 | + </para> |
| 111 | + </listitem> |
| 112 | + <listitem> |
| 113 | + <para> |
| 114 | + <code>opennlp-ml-perceptron</code> — Perceptron-based learning algorithm. |
| 115 | + Add this dependency if your models use the Perceptron or Perceptron Sequence trainer. |
| 116 | + </para> |
| 117 | + </listitem> |
| 118 | + <listitem> |
| 119 | + <para> |
| 120 | + <code>opennlp-ml-bayes</code> — Naive Bayes classifier. |
| 121 | + Add this dependency if your models use the Naive Bayes trainer. |
| 122 | + </para> |
| 123 | + </listitem> |
| 124 | + </itemizedlist> |
| 125 | + |
| 126 | + <para> |
| 127 | + For example, to use the Perceptron trainer alongside the default Maximum Entropy, add: |
| 128 | + </para> |
| 129 | + |
| 130 | + <programlisting language="xml"> |
| 131 | +<![CDATA[<dependency> |
| 132 | + <groupId>org.apache.opennlp</groupId> |
| 133 | + <artifactId>opennlp-ml-perceptron</artifactId> |
| 134 | + <version>CURRENT_OPENNLP_VERSION</version> |
| 135 | +</dependency>]]> |
| 136 | + </programlisting> |
| 137 | + </section> |
| 138 | + |
| 139 | + <section xml:id="tools.project.structure.models"> |
| 140 | + <title>Models Module</title> |
| 141 | + <para> |
| 142 | + The <code>opennlp-models</code> module provides classpath-based model discovery and |
| 143 | + loading. It enables applications to bundle pre-trained OpenNLP models as JAR files and |
| 144 | + load them at runtime without explicit file path references. |
| 145 | + See <xref linkend="tools.model"/> for details on classpath model loading. |
| 146 | + </para> |
| 147 | + |
| 148 | + <programlisting language="xml"> |
| 149 | +<![CDATA[<dependency> |
| 150 | + <groupId>org.apache.opennlp</groupId> |
| 151 | + <artifactId>opennlp-models</artifactId> |
| 152 | + <version>CURRENT_OPENNLP_VERSION</version> |
| 153 | +</dependency>]]> |
| 154 | + </programlisting> |
| 155 | + </section> |
| 156 | + |
| 157 | + <section xml:id="tools.project.structure.formats"> |
| 158 | + <title>Formats Module</title> |
| 159 | + <para> |
| 160 | + The <code>opennlp-formats</code> module supports reading and writing various NLP |
| 161 | + training and evaluation data formats, including CoNLL, BioNLP, BRAT, AD (Floresta), |
| 162 | + Leipzig, and others. Include this module if you need to train models from data in |
| 163 | + non-native OpenNLP formats. |
| 164 | + </para> |
| 165 | + |
| 166 | + <programlisting language="xml"> |
| 167 | +<![CDATA[<dependency> |
| 168 | + <groupId>org.apache.opennlp</groupId> |
| 169 | + <artifactId>opennlp-formats</artifactId> |
| 170 | + <version>CURRENT_OPENNLP_VERSION</version> |
| 171 | +</dependency>]]> |
| 172 | + </programlisting> |
| 173 | + </section> |
| 174 | + |
| 175 | + <section xml:id="tools.project.structure.dl"> |
| 176 | + <title>Deep Learning Modules</title> |
| 177 | + <para> |
| 178 | + OpenNLP provides optional support for ONNX-based neural models via two modules: |
| 179 | + </para> |
| 180 | + |
| 181 | + <itemizedlist> |
| 182 | + <listitem> |
| 183 | + <para> |
| 184 | + <code>opennlp-dl</code> — Integrates the ONNX Runtime for CPU-based inference. |
| 185 | + This module enables the use of models trained by external frameworks such as |
| 186 | + PyTorch or TensorFlow, exported in the ONNX format. |
| 187 | + </para> |
| 188 | + </listitem> |
| 189 | + <listitem> |
| 190 | + <para> |
| 191 | + <code>opennlp-dl-gpu</code> — Replaces the CPU ONNX Runtime with the |
| 192 | + GPU-accelerated variant for systems with supported GPU hardware. |
| 193 | + Use this module instead of <code>opennlp-dl</code> when GPU acceleration |
| 194 | + is available and desired. |
| 195 | + </para> |
| 196 | + </listitem> |
| 197 | + </itemizedlist> |
| 198 | + |
| 199 | + <programlisting language="xml"> |
| 200 | +<![CDATA[<!-- CPU variant --> |
| 201 | +<dependency> |
| 202 | + <groupId>org.apache.opennlp</groupId> |
| 203 | + <artifactId>opennlp-dl</artifactId> |
| 204 | + <version>CURRENT_OPENNLP_VERSION</version> |
| 205 | +</dependency> |
| 206 | +
|
| 207 | +<!-- OR GPU variant (do not include both) --> |
| 208 | +<dependency> |
| 209 | + <groupId>org.apache.opennlp</groupId> |
| 210 | + <artifactId>opennlp-dl-gpu</artifactId> |
| 211 | + <version>CURRENT_OPENNLP_VERSION</version> |
| 212 | +</dependency>]]> |
| 213 | + </programlisting> |
| 214 | + </section> |
| 215 | + |
| 216 | + <section xml:id="tools.project.structure.cli"> |
| 217 | + <title>CLI Module</title> |
| 218 | + <para> |
| 219 | + The <code>opennlp-cli</code> module provides the command-line tools for training, |
| 220 | + evaluating, and running OpenNLP models from a terminal. It is included in the binary |
| 221 | + distribution and not typically needed as a library dependency. |
| 222 | + See <xref linkend="tools.cli"/> for details on available CLI commands. |
| 223 | + </para> |
| 224 | + </section> |
| 225 | + |
| 226 | + <section xml:id="tools.project.structure.tools"> |
| 227 | + <title>Tools Module (Aggregated Jar)</title> |
| 228 | + <para> |
| 229 | + The <code>opennlp-tools</code> module is an aggregated artifact that bundles |
| 230 | + all core modules (<code>opennlp-api</code>, <code>opennlp-runtime</code>, |
| 231 | + all ML modules, <code>opennlp-models</code>, <code>opennlp-formats</code>, |
| 232 | + and <code>opennlp-cli</code>) into a single JAR. It is provided for backwards |
| 233 | + compatibility with 2.x and for the binary distribution. |
| 234 | + </para> |
| 235 | + <para> |
| 236 | + For new projects, we recommend depending on <code>opennlp-runtime</code> |
| 237 | + plus only the specific additional modules you need, rather than pulling in |
| 238 | + the full <code>opennlp-tools</code> artifact. |
| 239 | + </para> |
| 240 | + </section> |
| 241 | + |
| 242 | + <section xml:id="tools.project.structure.extensions"> |
| 243 | + <title>Extension Modules</title> |
| 244 | + <para> |
| 245 | + OpenNLP provides optional extension modules for integration with external frameworks: |
| 246 | + </para> |
| 247 | + |
| 248 | + <itemizedlist> |
| 249 | + <listitem> |
| 250 | + <para> |
| 251 | + <code>opennlp-morfologik</code> — Integrates the |
| 252 | + <link xlink:href="https://github.com/morfologik">Morfologik</link> |
| 253 | + library for dictionary-based stemming and lemmatization. |
| 254 | + See <xref linkend="tools.morfologik"/> for usage details. |
| 255 | + </para> |
| 256 | + </listitem> |
| 257 | + <listitem> |
| 258 | + <para> |
| 259 | + <code>opennlp-uima</code> — Provides a set of |
| 260 | + <link xlink:href="https://uima.apache.org">Apache UIMA</link> |
| 261 | + annotators that wrap OpenNLP components for use in UIMA pipelines. |
| 262 | + See <xref linkend="tools.uima"/> for integration details. |
| 263 | + </para> |
| 264 | + </listitem> |
| 265 | + </itemizedlist> |
| 266 | + </section> |
| 267 | + |
| 268 | + <section xml:id="tools.project.structure.migration"> |
| 269 | + <title>Migrating from 2.x to 3.x</title> |
| 270 | + <para> |
| 271 | + The 3.x release introduces no known breaking API changes. Existing code using the |
| 272 | + <code>opennlp-tools</code> artifact will continue to work without modification. |
| 273 | + However, we strongly recommend migrating to the modular dependency structure for a |
| 274 | + smaller footprint. |
| 275 | + </para> |
| 276 | + <para> |
| 277 | + A minimal migration replaces: |
| 278 | + </para> |
| 279 | + |
| 280 | + <programlisting language="xml"> |
| 281 | +<![CDATA[<!-- 2.x: single monolithic dependency --> |
| 282 | +<dependency> |
| 283 | + <groupId>org.apache.opennlp</groupId> |
| 284 | + <artifactId>opennlp-tools</artifactId> |
| 285 | + <version>2.x.y</version> |
| 286 | +</dependency>]]> |
| 287 | + </programlisting> |
| 288 | + |
| 289 | + <para> |
| 290 | + with: |
| 291 | + </para> |
| 292 | + |
| 293 | + <programlisting language="xml"> |
| 294 | +<![CDATA[<!-- 3.x: modular dependencies — add only what you need --> |
| 295 | +<dependency> |
| 296 | + <groupId>org.apache.opennlp</groupId> |
| 297 | + <artifactId>opennlp-runtime</artifactId> |
| 298 | + <version>CURRENT_OPENNLP_VERSION</version> |
| 299 | +</dependency> |
| 300 | +<!-- Add opennlp-models, opennlp-ml-perceptron, opennlp-dl, etc. as needed -->]]> |
| 301 | + </programlisting> |
| 302 | + |
| 303 | + <note> |
| 304 | + <para> |
| 305 | + The <code>opennlp-runtime</code> module includes the Maximum Entropy ML |
| 306 | + implementation by default. If your models were trained with the Perceptron |
| 307 | + or Naive Bayes algorithm, add the corresponding <code>opennlp-ml-perceptron</code> |
| 308 | + or <code>opennlp-ml-bayes</code> dependency. |
| 309 | + </para> |
| 310 | + </note> |
| 311 | + </section> |
| 312 | + |
| 313 | +</chapter> |
0 commit comments