Skip to content

feat: Normalizer::normalize_str — skip NormalizedString allocation#2020

Open
ArthurZucker wants to merge 7 commits intomainfrom
fast-normalize-str
Open

feat: Normalizer::normalize_str — skip NormalizedString allocation#2020
ArthurZucker wants to merge 7 commits intomainfrom
fast-normalize-str

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

Summary

Adds a normalize_str(&str) -> Result<String> method to the Normalizer trait that produces the normalized output without allocating a full NormalizedString.

Problem

NormalizedString::from(s) allocates:

  1. original: String — clone of input
  2. normalized: String — another clone
  3. alignments: Vec<(usize, usize)> — one entry per byte

For callers that only need the normalized string (like add_tokens building the normalized cache, or Python's normalize_str), this is pure overhead.

Solution

  • Default normalize_str on the trait — falls back to NormalizedString for normalizers that haven't opted in yet
  • Lowercase: s.to_lowercase() — zero intermediate allocations
  • ByteLevel: direct byte→char mapping into a pre-allocated String
  • Sequence: chains normalize_str calls without intermediate NormalizedString
  • NormalizerWrapper: forwards to the concrete normalizer's fast path
  • Python binding: normalize_str now calls the trait method directly

Follow-ups

Other normalizers (NFC, NFD, NFKC, NFKD, BertNormalizer, etc.) still fall back to the default. They can be optimized individually — the trait method is ready.

…ization

Add a default normalize_str(&str) -> Result<String> method to the
Normalizer trait that produces the normalized output without allocating
a full NormalizedString (which carries original + normalized + alignment
vectors — 3 allocations + O(n) alignment entries per call).

Specialized fast paths:
- Lowercase: direct s.to_lowercase(), no NormalizedString
- ByteLevel: direct byte→char mapping into a pre-allocated String
- Sequence: chains normalize_str calls without intermediate NormalizedString
- NormalizerWrapper: forwards to the concrete normalizer's fast path

Python binding updated to use normalize_str directly.

All other normalizers fall back to the default implementation which
still allocates a NormalizedString. They can be optimized individually
in follow-ups.
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Instead of HashMap lookup + per-char push, use a flat [Utf8Entry; 256]
array with pre-encoded UTF-8 bytes. Each input byte maps to 1-2 output
bytes via extend_from_slice — no HashMap, no char encoding, no branching.
@ArthurZucker ArthurZucker requested a review from McPatate April 10, 2026 14:55
Fast paths (no NormalizedString allocation):
- Lowercase: s.to_lowercase()
- ByteLevel: pre-encoded UTF-8 table lookup
- NFD/NFKD/NFC/NFKC: direct unicode_normalization iterator
- Nmt: inline filter + map over chars
- Strip: trim_start/trim_end
- StripAccents: filter combining marks
- Prepend: format!
- Sequence: chain normalize_str calls

Still using default fallback (NormalizedString):
- BertNormalizer (complex multi-step logic)
- Replace (regex-based, needs NormalizedString::transform)
- Precompiled (sentencepiece precompiled charsmap)
ArthurZucker and others added 3 commits April 23, 2026 06:02
…llocation

Both `add_tokens` and `refresh_normalized_tokens` were building a full
NormalizedString (with alignment vectors) just to extract the final string.
Call `Normalizer::normalize_str` instead.

Measured on added_vocab_deserialize vs main:
  non-special 100k + nfkc: 275.3ms -> 246.6ms (-10.4%)
  non-special 400k + nfkc: 1128.7ms -> 991.4ms (-12.2%)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_fast

When encode_fast is called (OffsetType::None), the normalization step
now uses normalize_str + set_normalized instead of the full
NormalizedString::normalize which builds per-byte alignment vectors.

Changes:
- NormalizedString::set_normalized(): replace normalized content with
  trivial 1:1 alignments (enough for splitting, no real offset mapping)
- AddedVocabulary::extract_and_normalize_fast(): uses normalize_str
  for the normalization step, avoiding O(n) alignment allocations
- encode_single_sequence: automatically picks the fast path when
  offsets_type is None (i.e. encode_fast)
- Normalizer::normalize_str trait method added (default falls back to
  NormalizedString)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants