Kagome C++ - Japanese Morphological Analyzer

Sponsored by TwoFive Inc. - Japan's leading email security company, specializing in messaging systems and security solutions.

A modern C++ implementation of the Japanese morphological analyzer kagome, originally written in Go by ikawaha. This implementation uses C++23 features and modern libraries for high-performance Japanese text tokenization and provides seamless integration with Rspamd mail processing system.

Features

Modern C++23: Uses the latest C++ standard features including concepts, ranges, and format library
High Performance: Efficient hash tables from unordered_dense
Unicode Support: Proper UTF-8/Unicode handling with libicu
Multiple Tokenization Modes: Normal, Search, and Extended modes
Lattice-based Analysis: Uses Viterbi algorithm for optimal path selection
Dictionary Support: System and user dictionary support
Memory Efficient: Object pooling for optimal memory usage

Dependencies

C++23 compatible compiler (GCC 12+, Clang 15+, or MSVC 2022+)
CMake 3.25+
libfmt: For advanced string formatting
libicu: For Unicode and UTF-8 handling
unordered_dense: Modern hash table implementation (automatically fetched)

Ubuntu/Debian Installation

sudo apt update
sudo apt install build-essential cmake pkg-config
sudo apt install libfmt-dev libicu-dev

macOS Installation

brew install cmake fmt icu4c pkg-config

Building

# Clone the repository
git clone <repository-url>
cd kagome-cxx

# Create build directory
mkdir build && cd build

# Configure with CMake
cmake ..

# Build
make -j$(nproc)

# Run tests
ctest

# Or run tests directly
./kagome_tests

Usage

Basic Tokenization

#include "kagome/tokenizer/tokenizer.hpp"
#include "kagome/dict/dict.hpp"

int main() {
    // Create dictionary
    auto dict = kagome::dict::factory::create_ipa_dict();
    
    // Create tokenizer
    kagome::tokenizer::TokenizerConfig config;
    config.omit_bos_eos = true;
    kagome::tokenizer::Tokenizer tokenizer(dict, config);
    
    // Tokenize text
    auto tokens = tokenizer.tokenize("すもももももももものうち");
    
    for (const auto& token : tokens) {
        auto features = token.features();
        std::string features_str = features.empty() ? "" : 
            std::accumulate(features.begin() + 1, features.end(), features[0],
                           [](const std::string& a, const std::string& b) {
                               return a + "," + b;
                           });
        
        std::cout << kagome::format("{}\t{}\n", token.surface(), features_str);
    }
    
    return 0;
}

Command Line Interface

# Interactive mode
./kagome_main

# Tokenize specific text
./kagome_main "すもももももももものうち"

# Different modes
./kagome_main -m search "関西国際空港"
./kagome_main -m extended "デジカメを買った"

# Wakati mode (surface forms only)
./kagome_main -w "すもももももももものうち"

# JSON output
./kagome_main -j "猫"

API Examples

Different Tokenization Modes

// Normal mode - regular segmentation
auto tokens = tokenizer.analyze(text, kagome::tokenizer::TokenizeMode::Normal);

// Search mode - additional segmentation for search
auto tokens = tokenizer.analyze(text, kagome::tokenizer::TokenizeMode::Search);

// Extended mode - unigram unknown words
auto tokens = tokenizer.analyze(text, kagome::tokenizer::TokenizeMode::Extended);

Wakati Tokenization

// Get only surface strings
auto wakati_tokens = tokenizer.wakati("すもももももももものうち");
for (const auto& token : wakati_tokens) {
    std::cout << token.surface() << " ";
}

Token Information

for (const auto& token : tokens) {
    std::cout << "Surface: " << token.surface() << "\n";
    std::cout << "POS: ";
    for (const auto& pos : token.pos()) {
        std::cout << pos << " ";
    }
    std::cout << "\n";
    
    if (auto reading = token.reading()) {
        std::cout << "Reading: " << *reading << "\n";
    }
    
    if (auto pronunciation = token.pronunciation()) {
        std::cout << "Pronunciation: " << *pronunciation << "\n";
    }
}

Lattice Visualization

std::ofstream dot_file("lattice.dot");
auto tokens = tokenizer.analyze_graph(dot_file, "私は猫", 
                                     kagome::tokenizer::TokenizeMode::Normal);

// Convert to PNG using Graphviz
// dot -Tpng lattice.dot -o lattice.png

Architecture

Core Components

Tokenizer: Main interface for morphological analysis
Lattice: Viterbi algorithm implementation for optimal path finding
Dictionary: System and user dictionary management
Token: Morphological unit with features and metadata

Modern C++ Features Used

Concepts: Type constraints for template parameters
Ranges: STL ranges for efficient iteration
fmt formatting: Type-safe string formatting via libfmt
std::optional: Null-safe optional values
Smart Pointers: RAII memory management
Move Semantics: Efficient resource management

Hash Tables

Uses unordered_dense for:

4x faster than std::unordered_map
Lower memory usage
Better cache performance
Same API as standard containers

Unicode Support

Leverages libicu for:

Proper UTF-8 handling
Character classification (Hiragana, Katakana, Kanji, etc.)
Script detection
Unicode normalization

Rspamd Integration

This library provides a shared library plugin for Rspamd mail processing system, enabling Japanese text tokenization for spam detection and email classification.

Quick Setup

# Build the plugin
mkdir build && cd build
cmake ..
make -j4

# Install dictionary (place next to the library)
cp ../data/ipa/ipa.dict ./ipa.dict

# Configure Rspamd
echo 'custom_tokenizers {
    kagome {
        enabled = true;
        path = "/path/to/kagome_rspamd_tokenizer.so";
        priority = 60.0;
    }
}' >> /etc/rspamd/local.d/options.inc

# Restart Rspamd
sudo systemctl restart rspamd

For detailed integration instructions, see RSPAMD_INTEGRATION.md.

Contributing

We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines.

Project Sponsors: See SPONSORS.md for our amazing sponsors who make this project possible.

Quick Start for Contributors:

Follow modern C++ best practices (C++23 features encouraged)
Maintain API compatibility with the original Go version
Add comprehensive tests for new features
Document public interfaces with examples
Ensure CI passes (see GITHUB_ACTIONS.md for CI details)

Development Workflow:

# Fork and clone the repository
git clone https://github.com/YOUR_USERNAME/kagome-cxx.git
cd kagome-cxx

# Create feature branch
git checkout -b feature/your-feature-name

# Build and test
mkdir build && cd build
cmake ..
make -j$(nproc)
ctest

# Submit pull request

License

Licensed under the Apache License, Version 2.0. See LICENSE file for details.

This project is based on the original kagome project by ikawaha.

Acknowledgments

Sponsorship: TwoFive Inc. for supporting Japanese email security innovation
Original Implementation: kagome project by ikawaha
Performance Libraries: unordered_dense by martinus
Formatting: fmtlib formatting library
Unicode Support: ICU Unicode library

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github		.github
data/ipa		data/ipa
debian		debian
include/kagome		include/kagome
src		src
tests		tests
.clang-format		.clang-format
BUILD.md		BUILD.md
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
DICTIONARY.md		DICTIONARY.md
GITHUB_ACTIONS.md		GITHUB_ACTIONS.md
LICENSE		LICENSE
README.md		README.md
RSPAMD_INTEGRATION.md		RSPAMD_INTEGRATION.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kagome C++ - Japanese Morphological Analyzer

Features

Dependencies

Ubuntu/Debian Installation

macOS Installation

Building

Usage

Basic Tokenization

Command Line Interface

API Examples

Different Tokenization Modes

Wakati Tokenization

Token Information

Lattice Visualization

Architecture

Core Components

Modern C++ Features Used

Hash Tables

Unicode Support

Rspamd Integration

Quick Setup

Contributing

License

Sponsors

TwoFive Inc.

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

rspamd/kagome-cxx

Folders and files

Latest commit

History

Repository files navigation

Kagome C++ - Japanese Morphological Analyzer

Features

Dependencies

Ubuntu/Debian Installation

macOS Installation

Building

Usage

Basic Tokenization

Command Line Interface

API Examples

Different Tokenization Modes

Wakati Tokenization

Token Information

Lattice Visualization

Architecture

Core Components

Modern C++ Features Used

Hash Tables

Unicode Support

Rspamd Integration

Quick Setup

Contributing

License

Sponsors

TwoFive Inc.

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages