Thai NLP in .NET
- newmm - Dictionary-based maximal matching word segmentation constrained by Thai Character Cluster (TCC) boundaries
- API similar to PyThaiNLP for easy migration from Python
- TCC (Thai Character Cluster) tokenization for breaking text into character clusters
- NumToThaiWord - Convert numbers to Thai text representation
- BahtText - Convert numbers to Thai currency format (Baht and Satang)
dotnet add package ThaiNLP.NETOr via Package Manager:
Install-Package ThaiNLP.NET
Build the project:
dotnet buildBasic usage:
using Thainlp;
// Simple tokenization
var tokens = WordTokenizer.Tokenize("ประเทศไทยมีอากาศดี");
// Output: ["ประเทศ", "ไทย", "มี", "อากาศ", "ดี"]
// With more options
var tokens = WordTokenizer.WordTokenize(
text: "โอเคบ่พวกเรารักภาษาบ้านเกิด",
engine: "newmm",
keepWhitespace: true
);
// Output: ["โอเค", "บ่", "พวกเรา", "รัก", "ภาษา", "บ้านเกิด"]using Thainlp;
using System.Collections.Generic;
// Create custom dictionary
var customWords = new List<string> { "ชินโซ", "อาเบะ" };
var customDict = new Trie(customWords);
// Use with tokenizer
var tokens = WordTokenizer.WordTokenize(
"ชินโซ อาเบะ เกิด 21 กันยายน",
customDict: customDict
);using Thainlp;
// Tokenize into character clusters
var clusters = TCC.Segment("ประเทศไทย");
// Output: ["ป", "ระ", "เท", "ศ", "ไ", "ท", "ย"]
// Get cluster positions
var positions = TCC.GetPositions("ประเทศไทย");using Thainlp;
// Original TCC implementation
var clusters = Subword.tcc("ประเทศไทย");
var positions = Subword.tcc_pos("ประเทศไทย");using Thainlp;
// Convert number to Thai words
string text = NumToWord.NumToThaiWord(112);
// Output: หนึ่งร้อยสิบสอง
string negative = NumToWord.NumToThaiWord(-273);
// Output: ลบสองร้อยเจ็ดสิบสาม
// Convert to Thai Baht currency format
string baht = NumToWord.BahtText(5611116.50);
// Output: ห้าล้านหกแสนหนึ่งหมื่นหนึ่งพันหนึ่งร้อยสิบหกบาทห้าสิบสตางค์
string simple = NumToWord.BahtText(116);
// Output: หนึ่งร้อยสิบหกบาทถ้วนThis library provides an API similar to PyThaiNLP:
| PyThaiNLP | thainlp.net |
|---|---|
word_tokenize(text) |
WordTokenizer.WordTokenize(text) |
word_tokenize(text, engine="newmm") |
WordTokenizer.WordTokenize(text, engine: "newmm") |
word_tokenize(text, custom_dict=trie) |
WordTokenizer.WordTokenize(text, customDict: trie) |
word_tokenize(text, keep_whitespace=False) |
WordTokenizer.WordTokenize(text, keepWhitespace: false) |
num_to_thaiword(number) |
NumToWord.NumToThaiWord(number) |
bahttext(number) |
NumToWord.BahtText(number) |
Run the test suite:
dotnet testThe project is configured to automatically create GitHub releases and publish to NuGet when a version tag is pushed.
- Create a NuGet API key at nuget.org
- Add the API key as a secret in your GitHub repository settings:
- Go to Settings → Secrets and variables → Actions
- Add a new repository secret named
NUGET_API_KEY - Paste your NuGet API key as the value
-
Update the version in
thainlp/Thainlp.csproj:<Version>0.1.0</Version>
-
Commit your changes:
git commit -am "Bump version to 0.1.0" git push -
Create and push a version tag:
git tag v0.1.0 git push origin v0.1.0
The GitHub Actions workflow will automatically:
- Build the project
- Run tests
- Create the NuGet package
- Create a GitHub release with the package attached
- Publish to NuGet
Every push to any branch triggers the CI workflow which:
- Builds the project
- Runs tests
- Creates the NuGet package as an artifact (not published)
See LICENSE file for details.