Sponsored by TwoFive Inc. - Japan's leading email security company, specializing in messaging systems and security solutions.
A modern C++ implementation of the Japanese morphological analyzer kagome, originally written in Go by ikawaha. This implementation uses C++23 features and modern libraries for high-performance Japanese text tokenization and provides seamless integration with Rspamd mail processing system.
- Modern C++23: Uses the latest C++ standard features including concepts, ranges, and format library
- High Performance: Efficient hash tables from unordered_dense
- Unicode Support: Proper UTF-8/Unicode handling with libicu
- Multiple Tokenization Modes: Normal, Search, and Extended modes
- Lattice-based Analysis: Uses Viterbi algorithm for optimal path selection
- Dictionary Support: System and user dictionary support
- Memory Efficient: Object pooling for optimal memory usage
- C++23 compatible compiler (GCC 12+, Clang 15+, or MSVC 2022+)
- CMake 3.25+
- libfmt: For advanced string formatting
- libicu: For Unicode and UTF-8 handling
- unordered_dense: Modern hash table implementation (automatically fetched)
sudo apt update
sudo apt install build-essential cmake pkg-config
sudo apt install libfmt-dev libicu-dev
brew install cmake fmt icu4c pkg-config
# Clone the repository
git clone <repository-url>
cd kagome-cxx
# Create build directory
mkdir build && cd build
# Configure with CMake
cmake ..
# Build
make -j$(nproc)
# Run tests
ctest
# Or run tests directly
./kagome_tests
#include "kagome/tokenizer/tokenizer.hpp"
#include "kagome/dict/dict.hpp"
int main() {
// Create dictionary
auto dict = kagome::dict::factory::create_ipa_dict();
// Create tokenizer
kagome::tokenizer::TokenizerConfig config;
config.omit_bos_eos = true;
kagome::tokenizer::Tokenizer tokenizer(dict, config);
// Tokenize text
auto tokens = tokenizer.tokenize("すもももももももものうち");
for (const auto& token : tokens) {
auto features = token.features();
std::string features_str = features.empty() ? "" :
std::accumulate(features.begin() + 1, features.end(), features[0],
[](const std::string& a, const std::string& b) {
return a + "," + b;
});
std::cout << kagome::format("{}\t{}\n", token.surface(), features_str);
}
return 0;
}
# Interactive mode
./kagome_main
# Tokenize specific text
./kagome_main "すもももももももものうち"
# Different modes
./kagome_main -m search "関西国際空港"
./kagome_main -m extended "デジカメを買った"
# Wakati mode (surface forms only)
./kagome_main -w "すもももももももものうち"
# JSON output
./kagome_main -j "猫"
// Normal mode - regular segmentation
auto tokens = tokenizer.analyze(text, kagome::tokenizer::TokenizeMode::Normal);
// Search mode - additional segmentation for search
auto tokens = tokenizer.analyze(text, kagome::tokenizer::TokenizeMode::Search);
// Extended mode - unigram unknown words
auto tokens = tokenizer.analyze(text, kagome::tokenizer::TokenizeMode::Extended);
// Get only surface strings
auto wakati_tokens = tokenizer.wakati("すもももももももものうち");
for (const auto& token : wakati_tokens) {
std::cout << token.surface() << " ";
}
for (const auto& token : tokens) {
std::cout << "Surface: " << token.surface() << "\n";
std::cout << "POS: ";
for (const auto& pos : token.pos()) {
std::cout << pos << " ";
}
std::cout << "\n";
if (auto reading = token.reading()) {
std::cout << "Reading: " << *reading << "\n";
}
if (auto pronunciation = token.pronunciation()) {
std::cout << "Pronunciation: " << *pronunciation << "\n";
}
}
std::ofstream dot_file("lattice.dot");
auto tokens = tokenizer.analyze_graph(dot_file, "私は猫",
kagome::tokenizer::TokenizeMode::Normal);
// Convert to PNG using Graphviz
// dot -Tpng lattice.dot -o lattice.png
- Tokenizer: Main interface for morphological analysis
- Lattice: Viterbi algorithm implementation for optimal path finding
- Dictionary: System and user dictionary management
- Token: Morphological unit with features and metadata
- Concepts: Type constraints for template parameters
- Ranges: STL ranges for efficient iteration
- fmt formatting: Type-safe string formatting via libfmt
- std::optional: Null-safe optional values
- Smart Pointers: RAII memory management
- Move Semantics: Efficient resource management
Uses unordered_dense for:
- 4x faster than std::unordered_map
- Lower memory usage
- Better cache performance
- Same API as standard containers
Leverages libicu for:
- Proper UTF-8 handling
- Character classification (Hiragana, Katakana, Kanji, etc.)
- Script detection
- Unicode normalization
This library provides a shared library plugin for Rspamd mail processing system, enabling Japanese text tokenization for spam detection and email classification.
# Build the plugin
mkdir build && cd build
cmake ..
make -j4
# Install dictionary (place next to the library)
cp ../data/ipa/ipa.dict ./ipa.dict
# Configure Rspamd
echo 'custom_tokenizers {
kagome {
enabled = true;
path = "/path/to/kagome_rspamd_tokenizer.so";
priority = 60.0;
}
}' >> /etc/rspamd/local.d/options.inc
# Restart Rspamd
sudo systemctl restart rspamd
For detailed integration instructions, see RSPAMD_INTEGRATION.md.
We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines.
Project Sponsors: See SPONSORS.md for our amazing sponsors who make this project possible.
Quick Start for Contributors:
- Follow modern C++ best practices (C++23 features encouraged)
- Maintain API compatibility with the original Go version
- Add comprehensive tests for new features
- Document public interfaces with examples
- Ensure CI passes (see GITHUB_ACTIONS.md for CI details)
Development Workflow:
# Fork and clone the repository
git clone https://github.com/YOUR_USERNAME/kagome-cxx.git
cd kagome-cxx
# Create feature branch
git checkout -b feature/your-feature-name
# Build and test
mkdir build && cd build
cmake ..
make -j$(nproc)
ctest
# Submit pull request
Licensed under the Apache License, Version 2.0. See LICENSE file for details.
This project is based on the original kagome project by ikawaha.
This project is proudly sponsored by:
Japan's leading email security company - TwoFive specializes in messaging systems, email security solutions, and infrastructure consulting. As experts in large-scale email systems and Japanese text processing, they recognize the importance of advanced Japanese tokenization for email security and spam detection.
- Sponsorship: TwoFive Inc. for supporting Japanese email security innovation
- Original Implementation: kagome project by ikawaha
- Performance Libraries: unordered_dense by martinus
- Formatting: fmtlib formatting library
- Unicode Support: ICU Unicode library