GitHub - huggingface/tokenizers.js: 🤗 Tokenizers.js: A pure JS/TS implementation of today's most used tokenizers

transformers.js javascript library logo

A lightweight tokenizer for the Web

Run today's most used tokenizers directly in your browser or Node.js application. No heavy dependencies, no server required. Just fast, client-side tokenization compatible with thousands of models on the Hugging Face Hub. These tokenizers are also used in 🤗 Transformers.js

Features

Lightweight (~ 8.3kB gzip)
Zero dependencies
Works in browsers and Node.js

Installation

npm install @huggingface/tokenizers

Alternatively, you can use it via a CDN as follows:

<script type="module">
  import { Tokenizer } from "https://cdn.jsdelivr.net/npm/@huggingface/tokenizers";
</script>

Usage

import { Tokenizer } from "@huggingface/tokenizers";

// Load files from the Hugging Face Hub
const modelId = "HuggingFaceTB/SmolLM3-3B";
const tokenizerJson = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer.json`).then((res) => res.json());
const tokenizerConfig = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`).then((res) => res.json());

// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);

// Tokenize text
const tokens = tokenizer.tokenize("Hello World"); // ['Hello', 'ĠWorld']
const encoded = tokenizer.encode("Hello World"); // { ids: [9906, 4435], tokens: ['Hello', 'ĠWorld'], attention_mask: [1, 1] }
const decoded = tokenizer.decode(encoded.ids); // 'Hello World'

Requirements

This library expects two files from Hugging Face models:

tokenizer.json - Contains the tokenizer configuration
tokenizer_config.json - Contains additional metadata

Regex compatibility

Tokenizer configs are authored for the Rust tokenizers crate, which compiles patterns with Oniguruma — a regex engine whose syntax and Unicode semantics differ from JavaScript RegExp. Tokenizers.js translates these patterns structurally, matching Oniguruma behavior for: line anchors (^/$ recognize only \n, unlike JavaScript's m flag), absolute anchors (\A, \z, \Z), . (which excludes only \n), word/digit/space shorthands (including Oniguruma's exact word-character set and its Latin-1 ctype quirks), \h/\H hex digits, \b/\B boundaries, inline case-insensitive groups (including ranges like (?i:[a-f])), possessive and stacked quantifiers (a++, X{3}+), atomic groups, POSIX bracket expressions, \x{...} code points, \p{Word}, identity escapes, and literal braces/brackets.

A few constructs can't be fully reproduced — notably \G, full Unicode case folding (e.g. ß ~ ss), character-class intersection (&&), and the MergedWithPrevious/MergedWithNext/Contiguous split behaviors.

Components

Tokenizers.js supports Hugging Face tokenizer components:

Normalizers

NFD
NFKC
NFC
NFKD
Lowercase
Strip
StripAccents
Replace
BERT Normalizer
Precompiled
Sequence

Pre-tokenizers

BERT
ByteLevel
Whitespace
WhitespaceSplit
Metaspace
CharDelimiterSplit
Split
Punctuation
Digits

Models

BPE (Byte-Pair Encoding)
WordPiece
Unigram
Legacy

Post-processors

ByteLevel
TemplateProcessing
RobertaProcessing
BertProcessing
Sequence

Decoders

ByteLevel
WordPiece
Metaspace
BPE
CTC
Replace
Fuse
Strip
ByteFallback
Sequence

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
.github		.github
scripts		scripts
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
LICENSE		LICENSE
README.md		README.md
eslint.config.mjs		eslint.config.mjs
jest.config.mjs		jest.config.mjs
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A lightweight tokenizer for the Web

Features

Installation

Usage

Requirements

Regex compatibility

Components

Normalizers

Pre-tokenizers

Models

Post-processors

Decoders

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A lightweight tokenizer for the Web

Features

Installation

Usage

Requirements

Regex compatibility

Components

Normalizers

Pre-tokenizers

Models

Post-processors

Decoders

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages