feat: add fasttokens benchmarks and tokenizer backend docs

biswapanda · biswapanda · commit e29c7c5dc5e2 · 2026-03-14T19:44:54.000-07:00
Benchmarks:
- Rename benches/tokenizer.rs to benches/tokenizer_simple.rs, add
  criterion fasttokens vs HF encode and batch-encode benchmarks
- Add benches/tokenizer_dataset.rs: dataset-driven benchmark using
  LongBench-v2 (503 real-world samples), sequential and batched modes
  with correctness verification (~24x sequential, ~27x batched speedup)

Docs:
- docs/components/frontend/tokenizer-backends.md: user guide with
  configuration, compatibility notes, and benchmark results
- docs/components/frontend/configuration.md: added Tokenizer section
- docs/index.yml: added Tokenizer Backends page under Frontend
diff --git a/docs/components/frontend/configuration.md b/docs/components/frontend/configuration.md
@@ -91,6 +91,12 @@ See the [Frontend Guide](frontend-guide.md) for KServe message formats and integ
 | `--metrics-prefix` | `DYN_METRICS_PREFIX` | `dynamo_frontend` | Prefix for frontend Prometheus metrics |
 | `--dump-config-to` | `DYN_DUMP_CONFIG_TO` | — | Dump resolved config to file path |
 
+## Tokenizer
+
+| CLI Argument | Env Var | Default | Description |
+|-------------|---------|---------|-------------|
+| `--dyn-tokenizer-backend` | `DYN_TOKENIZER_BACKEND` | `default` | Tokenizer backend: `default` (HuggingFace) or `fasttokens` (fastokens crate for high-performance BPE encoding). See [Tokenizer Backends](tokenizer-backends.md) |
+
 ## Experimental
 
 | CLI Argument | Env Var | Default | Description |
diff --git a/docs/components/frontend/tokenizer-backends.md b/docs/components/frontend/tokenizer-backends.md
@@ -0,0 +1,55 @@
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Tokenizer Backends
+---
+
+The Dynamo Frontend supports multiple tokenizer backends for BPE-based models. The backend controls how input text is tokenized before being sent to the inference engine.
+
+## Tokenizer Backends
+
+#### `default` HuggingFace Tokenizers
+
+The default backend uses the [HuggingFace `tokenizers`](https://github.com/huggingface/tokenizers) library (Rust). 
+It supports features in `tokenizer.json` files (normalizers, pre-tokenizers, post-processors, decoders, added tokens with special-token flags, and byte-fallback).
+
+#### `fasttokens` High-Performance BPE Encoding
+
+The `fasttokens` backend uses the [`fastokens`](https://github.com/Atero-ai/fastokens) crate, a purpose-built BPE encoder optimized for throughput. 
+It is a _hybrid_ backend: encoding uses `fastokens` while decoding falls back to HuggingFace so that incremental detokenization, byte-fallback, and special-token handling work correctly.
+
+Use this backend when tokenization is a measurable bottleneck, for example on high-concurrency prefill-heavy workloads.
+
+#### Compatibility notes:
+
+- Works with standard BPE `tokenizer.json` files (Qwen, LLaMA, GPT-family, Mistral, DeepSeek, etc.).
+- If `fastokens` cannot load a particular tokenizer file, the frontend logs a warning and transparently falls back to HuggingFace; requests are never dropped.
+- Has no effect on TikToken-format tokenizers (`.model` / `.tiktoken` files), which always use the TikToken backend.
+
+## Configuration
+
+Set the backend with a CLI flag or environment variable. The CLI flag takes precedence.
+
+| CLI Argument | Env Var | Valid values | Default |
+|---|---|---|---|
+| `--dyn-tokenizer-backend` | `DYN_TOKENIZER_BACKEND` | `default`, `fasttokens` | `default` |
+
+**Examples:**
+
+```bash
+# CLI flag
+python -m dynamo.frontend --dyn-tokenizer-backend fasttokens
+
+# Environment variable
+export DYN_TOKENIZER_BACKEND=fasttokens
+python -m dynamo.frontend
+```
+
+## Dynamo Frontend Behavior
+
+When `DYN_TOKENIZER_BACKEND=fasttokens` is set:
+
+1. The frontend passes the environment variable to the Rust runtime.
+2. When building the tokenizer for a model, `ModelDeploymentCard::tokenizer()` attempts to load `fastokens::Tokenizer` from the same `tokenizer.json` file.
+3. If loading succeeds, a hybrid `FastTokenizer` is created that encodes with `fastokens` and decodes with HuggingFace.
+4. If loading fails (unsupported tokenizer features, missing file, etc.), the frontend logs a warning and falls back to the standard HuggingFace backend; no operator intervention is needed.
diff --git a/docs/index.yml b/docs/index.yml
@@ -200,6 +200,8 @@ navigation:
         contents:
           - page: Frontend Guide
             path: components/frontend/frontend-guide.md
+          - page: Tokenizer Backends
+            path: components/frontend/tokenizer-backends.md
       - section: Router
         path: components/router/README.md
         contents:
diff --git a/lib/llm/Cargo.toml b/lib/llm/Cargo.toml
@@ -30,7 +30,11 @@ bench = ["dynamo-kv-router/bench"]
 kv-router-stress = ["dep:clap", "dep:indicatif", "bench"]
 
 [[bench]]
-name = "tokenizer"
+name = "tokenizer_simple"
+harness = false
+
+[[bench]]
+name = "tokenizer_dataset"
 harness = false
 
 [[bench]]
diff --git a/lib/llm/benches/tokenizer_dataset.rs b/lib/llm/benches/tokenizer_dataset.rs
diff --git a/lib/llm/benches/tokenizer_simple.rs b/lib/llm/benches/tokenizer_simple.rs