Run today's most used tokenizers directly in your browser or Node.js application. No heavy dependencies, no server required. Just fast, client-side tokenization compatible with thousands of models on the Hugging Face Hub. These tokenizers are also used in 🤗 Transformers.js
- Lightweight (~ 8.3kB gzip)
- Zero dependencies
- Works in browsers and Node.js
npm install @huggingface/tokenizersAlternatively, you can use it via a CDN as follows:
<script type="module">
import { Tokenizer } from "https://cdn.jsdelivr.net/npm/@huggingface/tokenizers";
</script>import { Tokenizer } from "@huggingface/tokenizers";
// Load files from the Hugging Face Hub
const modelId = "HuggingFaceTB/SmolLM3-3B";
const tokenizerJson = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer.json`).then((res) => res.json());
const tokenizerConfig = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`).then((res) => res.json());
// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);
// Tokenize text
const tokens = tokenizer.tokenize("Hello World"); // ['Hello', 'Ä World']
const encoded = tokenizer.encode("Hello World"); // { ids: [9906, 4435], tokens: ['Hello', 'Ä World'], attention_mask: [1, 1] }
const decoded = tokenizer.decode(encoded.ids); // 'Hello World'This library expects two files from Hugging Face models:
tokenizer.json- Contains the tokenizer configurationtokenizer_config.json- Contains additional metadata
Tokenizer configs are authored for the Rust tokenizers crate, which compiles patterns with Oniguruma — a regex engine whose syntax and Unicode semantics differ from JavaScript RegExp. Tokenizers.js translates these patterns structurally, matching Oniguruma behavior for: line anchors (^/$ recognize only \n, unlike JavaScript's m flag), absolute anchors (\A, \z, \Z), . (which excludes only \n), word/digit/space shorthands (including Oniguruma's exact word-character set and its Latin-1 ctype quirks), \h/\H hex digits, \b/\B boundaries, inline case-insensitive groups (including ranges like (?i:[a-f])), possessive and stacked quantifiers (a++, X{3}+), atomic groups, POSIX bracket expressions, \x{...} code points, \p{Word}, identity escapes, and literal braces/brackets.
A few constructs can't be fully reproduced — notably \G, full Unicode case folding (e.g. ß ~ ss), character-class intersection (&&), and the MergedWithPrevious/MergedWithNext/Contiguous split behaviors.
Tokenizers.js supports Hugging Face tokenizer components:
- NFD
- NFKC
- NFC
- NFKD
- Lowercase
- Strip
- StripAccents
- Replace
- BERT Normalizer
- Precompiled
- Sequence
- BERT
- ByteLevel
- Whitespace
- WhitespaceSplit
- Metaspace
- CharDelimiterSplit
- Split
- Punctuation
- Digits
- BPE (Byte-Pair Encoding)
- WordPiece
- Unigram
- Legacy
- ByteLevel
- TemplateProcessing
- RobertaProcessing
- BertProcessing
- Sequence
- ByteLevel
- WordPiece
- Metaspace
- BPE
- CTC
- Replace
- Fuse
- Strip
- ByteFallback
- Sequence