feat: integrate fastokens BPE tokenizer backend#7387
feat: integrate fastokens BPE tokenizer backend#7387biswapanda wants to merge 15 commits intomainfrom
Conversation
Add the fastokens crate (v0.1.0 from github.com/Atero-ai/fastokens) as an always-on workspace dependency for high-performance BPE encoding. Core integration: - lib/llm/src/tokenizers/fast.rs: hybrid FastTokenizer that encodes with fastokens and decodes with HuggingFace, with 4 unit tests - lib/llm/src/model_card.rs: tokenizer() checks DYN_TOKENIZER_BACKEND=fasttokens env var, falls back to HuggingFace on load failure Frontend CLI: - --dyn-tokenizer-backend flag / DYN_TOKENIZER_BACKEND env var with values "default" (HuggingFace) or "fasttokens"
WalkthroughThis pull request introduces support for a new "fasttokens" tokenizer backend. Changes include: adding the fastokens workspace dependency from a Git repository, exposing CLI configuration for backend selection, implementing a hybrid FastTokenizer struct that uses fastokens for encoding and HuggingFaceTokenizer for decoding, and providing test data for validation. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (1)
lib/llm/src/tokenizers/fast.rs (1)
125-143: Make the decode-stream test assert emitted text, not just absence of errors.Right now this only proves
step()doesn't fail. Empty or duplicated chunks would still pass. Please accumulate the returned chunks and compare them with a reference decode for the continuation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/llm/src/tokenizers/fast.rs` around lines 125 - 143, The test test_fast_with_decode_stream currently only checks that stream.step() doesn't error; change it to accumulate the emitted chunks from wrapper.decode_stream(&prompt_ids, true) by collecting each step(...) return (concatenate non-empty chunks) into a single string, then obtain the expected text by decoding the continuation (e.g. via wrapper.decode(cont_ids) or wrapper.decode(continuation)) and assert equality (or assert that the accumulated string contains the expected continuation); update references in the test around FastTokenizer::from_file, TokenizerWrapper::from, wrapper.decode_stream, and stream.step to perform this accumulation and comparison instead of a no-op loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Cargo.toml`:
- Line 49: The fastokens Git dependency is floating and needs an immutable ref;
update the Cargo.toml dependency line for fastokens (the entry fastokens = { git
= "https://github.com/Atero-ai/fastokens", version = "0.1.0" }) to include a rev
(or tag) field with the specific commit hash or tag you want pinned (e.g., rev =
"COMMIT_SHA") so the manifest is reproducible; ensure you use the exact commit
SHA or tag from the fastokens repo and run cargo update -p fastokens if needed
to regenerate the lockfile.
In `@components/src/dynamo/frontend/frontend_args.py`:
- Around line 429-440: The CLI currently allows any string for the
--dyn-tokenizer-backend argument; update the add_argument call that defines
flag_name="--dyn-tokenizer-backend" (dest="tokenizer_backend") to restrict
values to the supported set by adding an explicit choices constraint (e.g.,
["default","fasttokens"]) so parsing fails fast for invalid CLI/env values and
documents the accepted options in the help text.
In `@components/src/dynamo/frontend/main.py`:
- Around line 168-169: The current logic only sets
os.environ["DYN_TOKENIZER_BACKEND"] = "fasttokens" when config.tokenizer_backend
== "fasttokens" and never clears it; update the branch around
config.tokenizer_backend in main.py so that when config.tokenizer_backend ==
"default" you remove/unset DYN_TOKENIZER_BACKEND (e.g., pop or del from
os.environ if present), keep setting it for "fasttokens" as before, and ensure
no stale environment value remains when the default backend is chosen.
In `@lib/bindings/python/pyproject.toml`:
- Around line 25-26: Uncomment and replace the outdated dict license entries in
pyproject.toml by adding explicit SPDX and license-files fields: remove the
commented lines and add license = "Apache-2.0" and license-files = ["LICENSE"]
so the project uses the PEP 639-compatible string expression and explicit
license file declaration (update the existing commented/old keys related to
license and license-files).
In `@lib/llm/src/model_card.rs`:
- Around line 384-386: The current logic silently treats any non-"fasttokens"
DYN_TOKENIZER_BACKEND as the default behavior; change the handling to explicitly
match allowed values: if the env var is "fasttokens" set use_fast = true, if it
is "default" (or an explicit "slow"/"rust" token value you support) set use_fast
= false, and for any other value emit a clear warning or return an error so
misconfiguration is visible; update the code that reads DYN_TOKENIZER_BACKEND
(the use_fast assignment) to perform a match on the string and log/propagate an
error on unsupported values rather than silently falling back.
- Around line 394-399: When attempting the fast tokenizer, don't call
p.to_str().ok_or_else(...) which returns early on non-UTF-8 paths and prevents
the HuggingFace fallback; instead check p.to_str() with an if let/ match and
only call crate::tokenizers::FastTokenizer::from_file(path_str) when to_str()
returns Some. If to_str() is None, skip the fast-tokenizer attempt (do not
return an error) so the existing HF loader/fallback logic can run; also when
FastTokenizer::from_file fails, allow the code to continue to the HuggingFace
fallback rather than short-circuiting.
---
Nitpick comments:
In `@lib/llm/src/tokenizers/fast.rs`:
- Around line 125-143: The test test_fast_with_decode_stream currently only
checks that stream.step() doesn't error; change it to accumulate the emitted
chunks from wrapper.decode_stream(&prompt_ids, true) by collecting each
step(...) return (concatenate non-empty chunks) into a single string, then
obtain the expected text by decoding the continuation (e.g. via
wrapper.decode(cont_ids) or wrapper.decode(continuation)) and assert equality
(or assert that the accumulated string contains the expected continuation);
update references in the test around FastTokenizer::from_file,
TokenizerWrapper::from, wrapper.decode_stream, and stream.step to perform this
accumulation and comparison instead of a no-op loop.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 14f36a69-69af-4531-bb2b-0dfeeb99d059
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.locklib/bindings/python/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
Cargo.tomlcomponents/src/dynamo/frontend/frontend_args.pycomponents/src/dynamo/frontend/main.pylib/bindings/python/pyproject.tomllib/llm/Cargo.tomllib/llm/src/model_card.rslib/llm/src/tokenizers.rslib/llm/src/tokenizers/fast.rslib/llm/tests/data/sample-models/minimal-bpe/tokenizer.json
|
|
||
| add_argument( | ||
| g, | ||
| flag_name="--dyn-tokenizer-backend", |
There was a problem hiding this comment.
could this be tokenizer - or backend is needed
also @jh-nv I see we mix dyn prefix and no dyn prefix - did we decide on a common naming?
There was a problem hiding this comment.
Agreed backend can be dropped
There was a problem hiding this comment.
@jh-nv The DYN_ prefix only appears on the env var side consistently but --dyn- prefix on the CLI flag is not present for most of them. ok to drop dyn- prefix as well to keep it shorter --tokenizer
my initial thought was using --dyn for cli arg and DYN_ in env vars for consistency.
There was a problem hiding this comment.
fixed - Updated arg to --tokenizer and env var to DYN_TOKENIZER
There was a problem hiding this comment.
can you update the docs PR to reflect this
There was a problem hiding this comment.
sounds good. I'll rebase the docs PR, I was waiting for this PR to stabilize.
|
There are cargo-deny related CI failures due to upstream transitive dependencies and a PR for fast-tokens is being reviewed by Crusoe team - Atero-ai/fastokens#5 Issue: The fastokens crate (v0.1.0) declares hf-hub = "0.4.3" with default features, which pulls in native-tls and openssl-sys. Dynamo's deny.toml explicitly bans both crates (lines 63-64), causing cargo-deny to fail in CI. The dependency chain is: fastokens -> hf-hub (default features) -> ureq -> native-tls -> openssl-sys |
Cargo.toml
Outdated
| dynamo-kv-router = { path = "lib/kv-router", version = "1.0.0", features = ["metrics"] } | ||
| dynamo-async-openai = { path = "lib/async-openai", version = "1.0.0", features = ["byot"] } | ||
| dynamo-parsers = { path = "lib/parsers", version = "1.0.0" } | ||
| fastokens = { git = "https://github.com/biswapanda/fastokens", rev = "aed8b9f3ba024c689c72985d62506dfa79daec25", version = "0.1.0" } |
There was a problem hiding this comment.
shall we create a fort under ai-dynamo? cc @saturley-hall @dmitry-tokarev-nv
There was a problem hiding this comment.
yes, lets create a fork / mirror of upstream https://github.com/Atero-ai/fastokens.
There is an cargo-deny of fastokens transitive dependency issue: #7387 (comment)
I've a PR for Crusoe team to upstream this change: Atero-ai/fastokens#5.
I've pointed it to my repo with fix in the meantime as a workaround - fastokens team will review the PR by IST morning.
Overview:
Add the fastokens crate (v0.1.0 from github.com/Atero-ai/fastokens) as an always-on workspace dependency for high-performance BPE encoding.
Details:
Core integration:
Frontend CLI:
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
--dyn-tokenizer-backendcommand-line option to select different tokenizer backends.Tests