How to Troubleshoot Tokenizer Discovery Issues

This guide provides step-by-step solutions for common tokenizer discovery and download problems in BitNet-rs neural network inference.

Quick Diagnostic Commands

Before diving into specific issues, use these commands to gather diagnostic information:

# Check model compatibility and metadata
cargo run -p bitnet-cli -- compat-check model.gguf

# Verify xtask functionality
cargo run -p xtask -- verify --model model.gguf --allow-mock

# Test inference with verbose logging
RUST_LOG=bitnet_tokenizers=debug cargo run -p xtask -- infer \
    --model model.gguf \
    --prompt "Test" \
    --auto-download

Common Issues and Solutions

1. "No compatible tokenizer found" Error

Symptoms:

Error: No compatible tokenizer found for llama model with vocab_size 128256 (strict mode)

Cause: Model architecture or vocabulary size doesn't match known patterns, and strict mode prevents fallback.

Solutions:

Option A: Disable Strict Mode (Development Only)

# Allow mock tokenizer fallback
unset BITNET_STRICT_TOKENIZERS
cargo run -p xtask -- infer --model model.gguf --prompt "Test" --auto-download

Option B: Provide Explicit Tokenizer

# Download compatible tokenizer manually
curl -L "https://huggingface.co/meta-llama/Meta-Llama-3-8B/resolve/main/tokenizer.json" \
    -o tokenizer.json

# Use explicit tokenizer path
cargo run -p xtask -- infer \
    --model model.gguf \
    --tokenizer tokenizer.json \
    --prompt "Test"

Option C: Fix Model Metadata

# Check what metadata the model has
cargo run -p bitnet-cli -- inspect-gguf model.gguf

# Look for these metadata keys:
# - general.architecture
# - tokenizer.ggml.vocab_size
# - llama.vocab_size

2. Download Failures

Symptoms:

Error: HTTP error 404: https://huggingface.co/nonexistent/repo/resolve/main/tokenizer.json
Error: HTTP request failed for https://huggingface.co/...: timeout

Solutions:

Network Issues

# Test network connectivity
curl -I "https://huggingface.co"

# Use offline mode with cached tokenizers
export BITNET_OFFLINE=1
cargo run -p xtask -- infer --model model.gguf --prompt "Test"

Repository Issues

# Verify the repository exists
curl -I "https://huggingface.co/meta-llama/Llama-2-7b-hf"

# Check available files
curl -s "https://huggingface.co/api/models/meta-llama/Llama-2-7b-hf" | grep -o '"[^"]*tokenizer[^"]*"'

# Use alternative repository
cargo run -p xtask -- infer \
    --model model.gguf \
    --tokenizer "https://huggingface.co/alternative-repo/resolve/main/tokenizer.json" \
    --prompt "Test"

Timeout Issues

# Increase timeout and retry
export BITNET_DOWNLOAD_TIMEOUT=600  # 10 minutes
cargo run -p xtask -- infer \
    --model model.gguf \
    --prompt "Test" \
    --auto-download

3. Cache Corruption Issues

Symptoms:

Error: Invalid JSON in downloaded tokenizer: expected `,` or `}` at line 1 column 42
Error: Downloaded tokenizer file is empty

Solutions:

Clear and Re-download

# Clear specific tokenizer cache
rm -rf ~/.cache/bitnet/tokenizers/llama2-32k/

# Clear all tokenizer caches
rm -rf ~/.cache/bitnet/tokenizers/

# Re-download
cargo run -p xtask -- infer \
    --model model.gguf \
    --prompt "Test" \
    --auto-download

Verify Cache Integrity

# Check cached files
find ~/.cache/bitnet/tokenizers/ -name "*.json" -exec sh -c '
    echo "Checking: $1"
    jq empty "$1" 2>/dev/null && echo "✓ Valid JSON" || echo "✗ Invalid JSON"
' _ {} \;

# Manually validate specific tokenizer
jq empty ~/.cache/bitnet/tokenizers/llama2-32k/tokenizer.json

Resume Interrupted Downloads

# BitNet-rs automatically resumes partial downloads
# If resume fails, clear the partial file:
rm ~/.cache/bitnet/tokenizers/*/tokenizer.json.partial

# Then retry download
cargo run -p xtask -- infer \
    --model model.gguf \
    --prompt "Test" \
    --auto-download

4. Vocabulary Size Mismatches

Symptoms:

Warning: Vocabulary size mismatch: expected 32000, got 50257
Model: 128256, Tokenizer: 50257

Causes and Solutions:

Incorrect Model Type Detection

# Check detected model type
RUST_LOG=bitnet_tokenizers::discovery=debug cargo run -p xtask -- infer \
    --model model.gguf \
    --prompt "Test" \
    --auto-download 2>&1 | grep "model type"

# If incorrectly detected as "gpt2" instead of "llama":
# This is a model metadata issue - check GGUF general.architecture

Wrong Repository Mapping

# Check what repository is being used
RUST_LOG=bitnet_tokenizers::discovery=debug cargo run -p xtask -- infer \
    --model model.gguf \
    --prompt "Test" \
    --auto-download 2>&1 | grep "download compatible tokenizer"

# Use explicit tokenizer for correct vocabulary size
wget "https://huggingface.co/meta-llama/Meta-Llama-3-8B/resolve/main/tokenizer.json"
cargo run -p xtask -- infer \
    --model model.gguf \
    --tokenizer tokenizer.json \
    --prompt "Test"

5. Permission and File System Issues

Symptoms:

Error: Permission denied (os error 13)
Error: No space left on device (os error 28)

Solutions:

Permission Issues

# Check cache directory permissions
ls -la ~/.cache/bitnet/

# Fix permissions
mkdir -p ~/.cache/bitnet/tokenizers
chmod 755 ~/.cache/bitnet/tokenizers

# Use custom cache directory
export BITNET_CACHE_DIR="/tmp/bitnet-cache"
mkdir -p "$BITNET_CACHE_DIR"
cargo run -p xtask -- infer \
    --model model.gguf \
    --prompt "Test" \
    --auto-download

Disk Space Issues

# Check available space
df -h ~/.cache/bitnet/

# Clear old caches
du -sh ~/.cache/bitnet/tokenizers/*
rm -rf ~/.cache/bitnet/tokenizers/old-unused-*

# Use temporary cache location
export BITNET_CACHE_DIR="/tmp/bitnet-temp-cache"
cargo run -p xtask -- infer \
    --model model.gguf \
    --prompt "Test" \
    --auto-download

6. GPU vs CPU Performance Issues

Symptoms:

Warning: Large vocabulary (128256) running on CPU - consider GPU acceleration
GPU operations: 0, CPU operations: 1000, GPU efficiency: 0.0%

Solutions:

Enable GPU Features

# Build with GPU support
cargo build --no-default-features --features gpu

# Run inference with GPU
cargo run -p xtask -- infer \
    --model model.gguf \
    --prompt "Test" \
    --auto-download \
    --no-default-features --features gpu

Verify CUDA Setup

# Check CUDA installation
nvcc --version
nvidia-smi

# Test CUDA functionality
cargo test --no-default-features -p bitnet-kernels --no-default-features --features gpu gpu_smoke_test

Fallback to CPU Gracefully

# Force CPU-only inference for large vocabularies
BITNET_FORCE_CPU=1 cargo run -p xtask -- infer \
    --model model.gguf \
    --prompt "Test" \
    --auto-download \
    --no-default-features --features cpu

Advanced Troubleshooting

Environment Variable Configuration

Create a troubleshooting environment:

# Create debug configuration
cat > debug-tokenizer.env << 'EOF'
RUST_LOG=bitnet_tokenizers=debug,bitnet_models=info
BITNET_DETERMINISTIC=1
BITNET_SEED=42
BITNET_STRICT_TOKENIZERS=0
BITNET_OFFLINE=0
EOF

# Source and test
source debug-tokenizer.env
cargo run -p xtask -- infer --model model.gguf --prompt "Debug test" --auto-download

Cross-Validation Testing

# Test tokenizer compatibility with cross-validation
export BITNET_GGUF="path/to/model.gguf"
cargo run -p xtask -- crossval --tokenizer-only

# Compare tokenizer outputs
cargo test --no-default-features -p bitnet-tokenizers --no-default-features --features cpu test_tokenizer_contract

Performance Profiling

# Profile tokenizer discovery performance
cargo build --no-default-features --release --no-default-features --features cpu
time cargo run --release -p xtask -- infer \
    --model model.gguf \
    --prompt "Performance test" \
    --auto-download

# Monitor memory usage
/usr/bin/time -v cargo run -p xtask -- infer \
    --model model.gguf \
    --prompt "Memory test" \
    --auto-download

Manual Tokenizer Testing

# Test tokenizer functionality independently
cargo test --no-default-features -p bitnet-tokenizers --no-default-features --features cpu

# Test specific tokenizer types
cargo test --no-default-features -p bitnet-tokenizers --no-default-features --features "cpu,spm" test_sentencepiece_tokenizer_contract

# Test with real fixtures
SPM_MODEL=tests/fixtures/spm/tiny.model \
cargo test --no-default-features -p bitnet-tokenizers --no-default-features --features "cpu,spm,integration-tests"

Prevention Strategies

1. Pre-deployment Validation

# Validate all models in your deployment
for model in models/*.gguf; do
    echo "Validating: $model"
    cargo run -p xtask -- verify --model "$model" --allow-mock || echo "❌ Failed: $model"
done

# Pre-download all required tokenizers
cargo run -p xtask -- pre-download-tokenizers --model-dir models/

2. Production Monitoring

# Health check script
cat > tokenizer-health-check.sh << 'EOF'
#!/bin/bash
set -e

export BITNET_STRICT_TOKENIZERS=1
export BITNET_DETERMINISTIC=1

for model in models/*.gguf; do
    if ! cargo run -p xtask -- infer \
        --model "$model" \
        --prompt "Health check" \
        --max-tokens 1 >/dev/null 2>&1; then
        echo "❌ Health check failed: $model"
        exit 1
    fi
done

echo "✅ All models healthy"
EOF

chmod +x tokenizer-health-check.sh
./tokenizer-health-check.sh

3. Cache Management

# Automated cache cleanup
cat > cleanup-tokenizer-cache.sh << 'EOF'
#!/bin/bash

CACHE_DIR="$HOME/.cache/bitnet/tokenizers"
MAX_AGE_DAYS=30

# Remove caches older than MAX_AGE_DAYS
find "$CACHE_DIR" -type f -mtime +$MAX_AGE_DAYS -delete

# Keep only the most recent 10 cached tokenizers
ls -t "$CACHE_DIR" | tail -n +11 | xargs -I {} rm -rf "$CACHE_DIR/{}"

echo "Cache cleanup complete"
EOF

chmod +x cleanup-tokenizer-cache.sh

# Run weekly
(crontab -l 2>/dev/null; echo "0 2 * * 0 /path/to/cleanup-tokenizer-cache.sh") | crontab -

Getting Help

If you're still experiencing issues:

Check GitHub Issues: BitNet-rs Issues
Enable Debug Logging: RUST_LOG=debug for comprehensive logs

Create Minimal Reproduction:

# Share this information in bug reports
cargo --version
rustc --version
echo "Model: $(file model.gguf)"
echo "Environment: $(env | grep BITNET_)"

Community Support: Join the discussion in GitHub Discussions

Remember: Most tokenizer issues are configuration or environment related. This guide covers 95% of common problems you'll encounter in production deployments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Troubleshoot Tokenizer Discovery Issues

Quick Diagnostic Commands

Common Issues and Solutions

1. "No compatible tokenizer found" Error

Option A: Disable Strict Mode (Development Only)

Option B: Provide Explicit Tokenizer

Option C: Fix Model Metadata

2. Download Failures

Network Issues

Repository Issues

Timeout Issues

3. Cache Corruption Issues

Clear and Re-download

Verify Cache Integrity

Resume Interrupted Downloads

4. Vocabulary Size Mismatches

Incorrect Model Type Detection

Wrong Repository Mapping

5. Permission and File System Issues

Permission Issues

Disk Space Issues

6. GPU vs CPU Performance Issues

Enable GPU Features

Verify CUDA Setup

Fallback to CPU Gracefully

Advanced Troubleshooting

Environment Variable Configuration

Cross-Validation Testing

Performance Profiling

Manual Tokenizer Testing

Prevention Strategies

1. Pre-deployment Validation

2. Production Monitoring

3. Cache Management

Getting Help

FilesExpand file tree

tokenizer-discovery-troubleshooting.md

Latest commit

History

tokenizer-discovery-troubleshooting.md

File metadata and controls

How to Troubleshoot Tokenizer Discovery Issues

Quick Diagnostic Commands

Common Issues and Solutions

1. "No compatible tokenizer found" Error

Option A: Disable Strict Mode (Development Only)

Option B: Provide Explicit Tokenizer

Option C: Fix Model Metadata

2. Download Failures

Network Issues

Repository Issues

Timeout Issues

3. Cache Corruption Issues

Clear and Re-download

Verify Cache Integrity

Resume Interrupted Downloads

4. Vocabulary Size Mismatches

Incorrect Model Type Detection

Wrong Repository Mapping

5. Permission and File System Issues

Permission Issues

Disk Space Issues

6. GPU vs CPU Performance Issues

Enable GPU Features

Verify CUDA Setup

Fallback to CPU Gracefully

Advanced Troubleshooting

Environment Variable Configuration

Cross-Validation Testing

Performance Profiling

Manual Tokenizer Testing

Prevention Strategies

1. Pre-deployment Validation

2. Production Monitoring

3. Cache Management

Getting Help