DAT Building: One-Time vs Every-Time - Detailed Explanation

Overview

DAT (Double-Array Trie) Building is the process of converting a text-based vocabulary (JSON/list) into an optimized binary format that enables ultra-fast tokenization.

The Building Process

What Happens During DAT Building?

Trie Construction (Step 1)
- Converts each vocabulary token into a tree structure
- Each character/byte becomes a node in the tree
- Common prefixes share the same path (e.g., "apple" and "apply" share "appl")
Array Packing (Step 2 - The Expensive Part)
- Uses a "First-Fit" algorithm to find optimal positions in integer arrays
- Compresses the tree into 3 parallel arrays: base, check, values
- This is computationally expensive: O(n×m) where n=vocab_size, m=avg_token_length
Binary Serialization (Step 3)
- Writes the arrays to a .dat binary file
- Format: [MAGIC|VERSION|SIZE|BASE_ARRAY|CHECK_ARRAY|VALUES_ARRAY]
- Enables memory-mapping for instant zero-copy loading

Performance Cost

Vocabulary Size	Build Time	DAT File Size
367 tokens	~38ms	5 KB
5,000 tokens	~26s	143 KB
50,000 tokens	~5-10min	~1.5 MB

One-Time vs Every-Time

✅ CORRECT APPROACH: One-Time Build + Cache

Build Once:

Run compile_profiles.py during:
- Package development
- First-time user setup
- CI/CD pipeline

Cache Forever:

Save .dat files to: ~/.cache/xerv/crayon/profiles/
OR distribute pre-built .dat files with the package
Users never rebuild unless vocabulary changes

Runtime:

# This should be INSTANT (just mmap)
vocab = CrayonVocab.load_profile("code")  # <1ms to load .dat
tokens = vocab.tokenize(text)              # 10M+ tokens/sec

❌ INCORRECT APPROACH: Build Every Time

# BAD: Building from JSON every import
builder = DATBuilder()
builder.build(vocab)  # Takes 26 seconds for 5k vocab!

This would make the library unusable.

Current Implementation Status

What Works ✅

DATBuilder (src/crayon/c_ext/dat_builder.py)
- ✅ Compiles vocab to DAT format
- ✅ Saves binary files
CrayonVocab.load_profile() (src/crayon/core/vocabulary.py)
- ✅ Checks for cached .dat file first
- ✅ Falls back to .json if .dat not found
- ✅ Calls build_and_cache_profile() if neither exists
C++ Engine (src/crayon/c_ext/engine.cpp)
- ✅ Memory-maps .dat files via Python buffer protocol
- ✅ Zero-copy instant loading (<1ms)
- ✅ AVX2 SIMD tokenization (10M+ tok/sec)

What's Missing ⚠️

Pre-built .dat files not distributed
- Currently, .dat files must be built manually via compile_profiles.py
- Should be included in package or built during pip install
Vocabulary files not in cache
- trained_vocab_*.json files exist in project root
- Not automatically copied to ~/.cache/xerv/crayon/profiles/
- build_and_cache_profile() should handle this
decode() method missing
- README examples show vocab.decode(tokens)
- Method doesn't exist in CrayonVocab class

Recommended Workflow

For Package Developers:

# 1. Train vocabularies (already done - trained_vocab_*.json exist)
python train_vocab.py

# 2. Compile to DAT format
python compile_profiles.py

# 3. Distribute .dat files with package
# - Include in MANIFEST.in
# - Copy to package installation directory

For End Users:

# Should just work (instant load from cached .dat)
from crayon import CrayonVocab
vocab = CrayonVocab.load_profile("code")  # <1ms

Summary

Aspect	Answer
One-time or Every-time?	ONE-TIME per vocabulary version
Who builds?	Developer OR first-time user setup
Build frequency?	Only when vocabulary changes
Runtime cost?	<1ms (just mmap, no rebuild)
User experience?	Instant, zero compilation delay

The DAT file is like a compiled binary - you compile your source code once, then distribute/cache the binary for instant execution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAT Building: One-Time vs Every-Time - Detailed Explanation

Overview

The Building Process

What Happens During DAT Building?

Performance Cost

One-Time vs Every-Time

✅ CORRECT APPROACH: One-Time Build + Cache

❌ INCORRECT APPROACH: Build Every Time

Current Implementation Status

What Works ✅

What's Missing ⚠️

Recommended Workflow

For Package Developers:

For End Users:

Summary

FilesExpand file tree

DAT_BUILDING_EXPLAINED.md

Latest commit

History

DAT_BUILDING_EXPLAINED.md

File metadata and controls

DAT Building: One-Time vs Every-Time - Detailed Explanation

Overview

The Building Process

What Happens During DAT Building?

Performance Cost

One-Time vs Every-Time

✅ CORRECT APPROACH: One-Time Build + Cache

❌ INCORRECT APPROACH: Build Every Time

Current Implementation Status

What Works ✅

What's Missing ⚠️

Recommended Workflow

For Package Developers:

For End Users:

Summary