Durak is a high-performance Turkish NLP toolkit built on a "Rust Core, Python Interface" architecture. Heavy lifting (normalization, tokenization, lemmatization) runs in compiled Rust, releasing the GIL for true parallelism, while providing a flexible, PyTorch-like API for Python researchers.
- Rust-Powered: Blazing fast text processing with zero-overhead resource embedding
- True Parallelism: GIL-released operations for multi-core batch processing
- Zero-Dependency Distribution: Resources compiled directly into binary
- Research-Ready: Type-safe, reproducible, easy to integrate
┌──────────────────────────────────────┐
│ Python Interface (python/durak/) │ ← Your code here
│ Pipeline • StopwordManager • API │
├──────────────────────────────────────┤
│ Rust Core (src/lib.rs) │ ← Performance critical
│ Tokenization • Normalization │
│ Embedded Resources (include_str!) │
└──────────────────────────────────────┘
See ARCHITECTURE.md for detailed design documentation.
pip install durak-nlpfrom durak import process_text
entries = [
"Türkiye'de NLP zor. Durak kolaylaştırır.",
"Ankara ' da kaldım.",
]
tokens = [
process_text(
entry,
remove_stopwords=True,
rejoin_suffixes=True, # glue detached suffixes before filtering
)
for entry in entries
]
print(tokens[0])
# ["türkiye'de", "nlp", "zor", ".", "durak", "kolaylaştırır", "."]
print(tokens[1])
# ["ankara'da", "kaldım", "."]The pipeline executes: clean → tokenize → rejoin detached suffixes → remove stopwords
from durak import (
StopwordManager,
attach_detached_suffixes,
clean_text,
remove_stopwords,
tokenize,
)
text = "İstanbul ' a vapurla geçtik."
cleaned = clean_text(text)
tokens = tokenize(cleaned)
tokens = attach_detached_suffixes(tokens)
# Custom stopword management
manager = StopwordManager(additions=["vapurla"], keep=["istanbul'a"])
filtered = remove_stopwords(tokens, manager=manager)
print(filtered)
# ["istanbul'a", "geçtik", "."]from durak import _durak_core
# High-performance functions (5-10x faster than Python)
normalized = _durak_core.fast_normalize("İSTANBUL") # "istanbul"
tokens = _durak_core.tokenize_with_offsets("Merhaba dünya!")
# Embedded resources (no file I/O!)
stopwords = _durak_core.get_stopwords_base() # 100-1000x faster loading
suffixes = _durak_core.get_detached_suffixes()- Unicode-aware cleaning: Turkish-specific normalization (İ/ı, I/i handling)
- Configurable stopword management: Keep-lists, custom additions, domain-specific sets
- Regex-based tokenizer: Preserves Turkish morphology (clitics, suffixes, apostrophes)
- Offset tracking: Character-accurate positions for NER and span tasks
- Embedded resources: Zero file I/O, compiled directly into binary
- Type-safe: Complete
.pyistubs for IDE support and static analysis - Tiered lemmatization: Dictionary lookup + heuristic fallback with performance metrics
Durak provides a high-performance lemmatizer with three strategies:
from durak import Lemmatizer
# Strategy options: "lookup", "heuristic", "hybrid"
lemmatizer = Lemmatizer(strategy="hybrid")
lemmatizer("kitaplar") # "kitap"
lemmatizer("geliyorum") # "gel"
lemmatizer("evlerimde") # "ev"lookup: Dictionary-only (fastest, high precision, fails on OOV words)heuristic: Suffix stripping (handles OOV, may over-strip)hybrid: Try lookup first, fallback to heuristic (recommended)
Enable metrics to analyze lemmatization behavior:
lemmatizer = Lemmatizer(strategy="hybrid", collect_metrics=True)
for word in large_corpus:
lemma = lemmatizer(word)
print(lemmatizer.get_metrics())
# Lemmatizer Metrics:
# Total Calls: 10,000
# Lookup Hits: 7,234 (72.3%)
# Heuristic Fallbacks: 2,766
# Avg Call Time: 0.042ms
# Lookup Time: 0.038s
# Heuristic Time: 0.004sMetrics include:
- Call counts (lookup hits/misses, heuristic fallbacks)
- Timing breakdown (per-strategy latency)
- Cache hit rate
- Average call time
Use cases:
- Compare strategies on your corpus
- Debug lemmatization issues
- Optimize production pipelines
- Report research performance
See examples/lemmatizer_metrics.py for strategy comparison examples.
Control heuristic quality with root validation:
lemmatizer = Lemmatizer(
strategy="hybrid",
validate_roots=True, # Enable validation
strict_validation=True, # Require roots in dictionary
min_root_length=3, # Minimum 3 characters
)
lemmatizer("kitaplardan") # Only strips if root ≥3 chars and valid# Clone the repository
git clone https://github.com/fbkaragoz/durak.git
cd durak
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Build and install with Rust extension
pip install maturin
maturin develop # or: maturin develop --release for optimized build
# Install dev dependencies
pip install -e .[dev]# Run all tests
pytest
# With coverage
pytest --cov=durak --cov-report=html
# Type checking
mypy python
# Linting
ruff check .durak/
├── src/ # Rust source (engine)
│ └── lib.rs
├── python/ # Python source (interface)
│ └── durak/
├── resources/ # Static data files
│ └── tr/ # Turkish resources
├── tests/ # Integration tests
├── benchmarks/ # Performance benchmarks
├── examples/ # Usage examples
└── docs/ # Documentation
- Architecture Guide: Design principles and component architecture
- Examples: Basic and advanced usage demonstrations
- Benchmarks: Performance comparison and optimization tips
- API Design Docs: Detailed component specifications
- Changelog: Version history and migration guides
- Roadmap: Future enhancements and planned features
- Code of Conduct: CODE_OF_CONDUCT.md
- Contributing: CONTRIBUTING.md
- Security: SECURITY.md
- Citation: CITATION.cff
- Issues: GitHub Issues
Topics: turkish-nlp, nlp, rust, pyo3, maturin, tokenization, lemmatization, text-processing
Rust-accelerated functions provide significant speedups:
- Normalization: 5-10x faster than pure Python
- Tokenization: 3-5x faster with offset tracking
- Resource Loading: 100-1000x faster (embedded, no file I/O)
- Full Pipeline: 2-4x overall speedup
Run python benchmarks/benchmark_rust_vs_python.py to measure on your system.
Durak is distributed under the Durak License v1.2. Commercial or institutional use requires explicit written permission from the author.
Homepage: karagoz.io Repository: github.com/fbkaragoz/durak