JThaiNLP

Thai Natural Language Processing library for Java

JThaiNLP is a Java library for Thai text processing with a PyThaiNLP-compatible API. It provides word tokenization using the newmm (New Maximum Matching) algorithm.

Features

Word Tokenization: Dictionary-based word segmentation using the newmm algorithm
Thai Character Cluster (TCC): Proper handling of Thai character boundaries
PyThaiNLP-compatible API: Familiar interface for users of PyThaiNLP
Custom Dictionary Support: Use your own word dictionary
Maven Support: Easy integration into Java projects

Installation

Maven

Add this dependency to your pom.xml:

<dependency>
    <groupId>com.github.wannaphong</groupId>
    <artifactId>jthainlp</artifactId>
    <version>1.0.0</version>
</dependency>

Or build from source:

git clone https://github.com/wannaphong/JThaiNLP.git
cd JThaiNLP
mvn clean install

Usage

Basic Tokenization

import com.github.wannaphong.jthainlp.tokenize.Tokenize;
import java.util.List;

// Basic word tokenization
String text = "ไทยและภาษา";
List<String> tokens = Tokenize.wordTokenize(text);
System.out.println(tokens); // Output: [ไทย, และ, ภาษา]

Custom Dictionary

import com.github.wannaphong.jthainlp.tokenize.Tokenize;
import com.github.wannaphong.jthainlp.util.Trie;
import java.util.Arrays;
import java.util.List;

// Use custom dictionary
Trie customDict = new Trie(Arrays.asList("ไทย", "และ", "ภาษา", "ภาษาไทย"));
String text = "ไทยและภาษาไทย";
List<String> tokens = Tokenize.wordTokenize(text, customDict);
System.out.println(tokens); // Output: [ไทย, และ, ภาษาไทย]

PyThaiNLP-Compatible API

import com.github.wannaphong.jthainlp.tokenize.Tokenize;

// Similar to PyThaiNLP's word_tokenize function
List<String> tokens1 = Tokenize.wordTokenize(text);
List<String> tokens2 = Tokenize.wordTokenize(text, customDict);
List<String> tokens3 = Tokenize.wordTokenize(text, customDict, "newmm");
List<String> tokens4 = Tokenize.wordTokenize(text, customDict, "newmm", true);

Running Examples

mvn compile exec:java -Dexec.mainClass="com.github.wannaphong.jthainlp.example.TokenizerExample"

API Reference

Tokenize.wordTokenize()

Main tokenization function with PyThaiNLP-compatible interface.

Parameters:

text (String): Text to tokenize
customDict (Trie, optional): Custom dictionary (null for default)
engine (String, optional): Tokenization engine ("newmm" is default and currently only supported)
keepWhitespace (boolean, optional): Whether to keep whitespace in output (default: true)

Returns:

List<String>: List of word tokens

Building and Testing

# Build the project
mvn clean compile

# Run tests
mvn test

# Package as JAR
mvn package

Implementation Details

Newmm Algorithm

The newmm (New Maximum Matching) tokenizer implements:

Dictionary-based maximal matching word segmentation
Constrained by Thai Character Cluster (TCC) boundaries
Graph-based approach with configurable cutoff to avoid exponential time
Handling of non-Thai text (English, numbers, punctuation)

Thai Character Cluster (TCC)

Thai text has special rules for where words can be broken. The TCC implementation respects Thai vowel and tone mark placement rules.

License

Apache License 2.0 - See LICENSE file for details.

References

PyThaiNLP - Thai NLP library for Python
Original newmm implementation by Korakot Chaovavanich

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JThaiNLP

Features

Installation

Maven

Usage

Basic Tokenization

Custom Dictionary

PyThaiNLP-Compatible API

Running Examples

API Reference

Tokenize.wordTokenize()

Building and Testing

Implementation Details

Newmm Algorithm

Thai Character Cluster (TCC)

License

References

Contributing

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

PyThaiNLP/JThaiNLP

Folders and files

Latest commit

History

Repository files navigation

JThaiNLP

Features

Installation

Maven

Usage

Basic Tokenization

Custom Dictionary

PyThaiNLP-Compatible API

Running Examples

API Reference

Tokenize.wordTokenize()

Building and Testing

Implementation Details

Newmm Algorithm

Thai Character Cluster (TCC)

License

References

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages