feat: Normalizer::normalize_str — skip NormalizedString allocation by ArthurZucker · Pull Request #2020 · huggingface/tokenizers

ArthurZucker · 2026-04-10T14:16:08Z

Summary

Adds a normalize_str(&str) -> Result<String> method to the Normalizer trait that produces the normalized output without allocating a full NormalizedString.

Problem

NormalizedString::from(s) allocates:

original: String — clone of input
normalized: String — another clone
alignments: Vec<(usize, usize)> — one entry per byte

For callers that only need the normalized string (like add_tokens building the normalized cache, or Python's normalize_str), this is pure overhead.

Solution

Default normalize_str on the trait — falls back to NormalizedString for normalizers that haven't opted in yet
Lowercase: s.to_lowercase() — zero intermediate allocations
ByteLevel: direct byte→char mapping into a pre-allocated String
Sequence: chains normalize_str calls without intermediate NormalizedString
NormalizerWrapper: forwards to the concrete normalizer's fast path
Python binding: normalize_str now calls the trait method directly

Follow-ups

Other normalizers (NFC, NFD, NFKC, NFKD, BertNormalizer, etc.) still fall back to the default. They can be optimized individually — the trait method is ready.

…ization Add a default normalize_str(&str) -> Result<String> method to the Normalizer trait that produces the normalized output without allocating a full NormalizedString (which carries original + normalized + alignment vectors — 3 allocations + O(n) alignment entries per call). Specialized fast paths: - Lowercase: direct s.to_lowercase(), no NormalizedString - ByteLevel: direct byte→char mapping into a pre-allocated String - Sequence: chains normalize_str calls without intermediate NormalizedString - NormalizerWrapper: forwards to the concrete normalizer's fast path Python binding updated to use normalize_str directly. All other normalizers fall back to the default implementation which still allocates a NormalizedString. They can be optimized individually in follow-ups.

HuggingFaceDocBuilderDev · 2026-04-10T14:19:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Instead of HashMap lookup + per-char push, use a flat [Utf8Entry; 256] array with pre-encoded UTF-8 bytes. Each input byte maps to 1-2 output bytes via extend_from_slice — no HashMap, no char encoding, no branching.

Fast paths (no NormalizedString allocation): - Lowercase: s.to_lowercase() - ByteLevel: pre-encoded UTF-8 table lookup - NFD/NFKD/NFC/NFKC: direct unicode_normalization iterator - Nmt: inline filter + map over chars - Strip: trim_start/trim_end - StripAccents: filter combining marks - Prepend: format! - Sequence: chain normalize_str calls Still using default fallback (NormalizedString): - BertNormalizer (complex multi-step logic) - Replace (regex-based, needs NormalizedString::transform) - Precompiled (sentencepiece precompiled charsmap)

…llocation Both `add_tokens` and `refresh_normalized_tokens` were building a full NormalizedString (with alignment vectors) just to extract the final string. Call `Normalizer::normalize_str` instead. Measured on added_vocab_deserialize vs main: non-special 100k + nfkc: 275.3ms -> 246.6ms (-10.4%) non-special 400k + nfkc: 1128.7ms -> 991.4ms (-12.2%) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…_fast When encode_fast is called (OffsetType::None), the normalization step now uses normalize_str + set_normalized instead of the full NormalizedString::normalize which builds per-byte alignment vectors. Changes: - NormalizedString::set_normalized(): replace normalized content with trivial 1:1 alignments (enough for splitting, no real offset mapping) - AddedVocabulary::extract_and_normalize_fast(): uses normalize_str for the normalization step, avoiding O(n) alignment allocations - encode_single_sequence: automatically picks the fast path when offsets_type is None (i.e. encode_fast) - Normalizer::normalize_str trait method added (default falls back to NormalizedString)

ArthurZucker added 2 commits April 10, 2026 16:24

perf: pre-encoded UTF-8 lookup table for ByteLevel::normalize_str

f94abca

Instead of HashMap lookup + per-char push, use a flat [Utf8Entry; 256] array with pre-encoded UTF-8 bytes. Each input byte maps to 1-2 output bytes via extend_from_slice — no HashMap, no char encoding, no branching.

docs: add concrete byte-level examples to Utf8Entry lookup table

cc51c2e

ArthurZucker requested a review from McPatate April 10, 2026 14:55

ArthurZucker mentioned this pull request Apr 10, 2026

perf: skip alignment tracking in encode_fast normalization #2022

Open

ArthurZucker and others added 3 commits April 23, 2026 06:02

Merge branch 'main' into fast-normalize-str

9690a27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Normalizer::normalize_str — skip NormalizedString allocation#2020

feat: Normalizer::normalize_str — skip NormalizedString allocation#2020
ArthurZucker wants to merge 7 commits intomainfrom
fast-normalize-str

ArthurZucker commented Apr 10, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArthurZucker commented Apr 10, 2026

Summary

Problem

Solution

Follow-ups

Uh oh!

HuggingFaceDocBuilderDev commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants