feat: add new faster whitespace split pretok by McPatate · Pull Request #1985 · huggingface/tokenizers

McPatate · 2026-03-26T15:31:44Z

Building on the work in #1822 and #1841, I've implemented and tested a faster whitespace split pretokenizer. Tested on xnli, no regressions observed, results are coherent with the regex split.

I've setup a benchmark to validate this and created a correctness test as well. We can probably refactor or remove these bits as they depened on a hardcoded path to a data/xnli.txt file, but leaving them there if someone wants to reproduce the results.

whitespace-pretok/regex/full-corpus
                        time:   [189.05 ms 190.64 ms 192.34 ms]
                        thrpt:  [32.173 MiB/s 32.459 MiB/s 32.732 MiB/s]
Found 9 outliers among 50 measurements (18.00%)
  5 (10.00%) low mild
  3 (6.00%) high mild
  1 (2.00%) high severe
whitespace-pretok/manual/full-corpus
                        time:   [165.67 ms 166.23 ms 166.75 ms]
                        thrpt:  [37.109 MiB/s 37.226 MiB/s 37.352 MiB/s]
Found 3 outliers among 50 measurements (6.00%)
  2 (4.00%) low mild
  1 (2.00%) high mild
whitespace-pretok/regex/line-by-line
                        time:   [196.79 ms 197.70 ms 198.76 ms]
                        thrpt:  [31.134 MiB/s 31.300 MiB/s 31.445 MiB/s]
Found 8 outliers among 50 measurements (16.00%)
  3 (6.00%) low mild
  2 (4.00%) high mild
  3 (6.00%) high severe
whitespace-pretok/manual/line-by-line
                        time:   [171.48 ms 172.15 ms 172.82 ms]
                        thrpt:  [35.807 MiB/s 35.946 MiB/s 36.085 MiB/s]

whitespace-pretok-scaling/regex/100
                        time:   [2.3880 µs 2.3905 µs 2.3932 µs]
                        thrpt:  [39.849 MiB/s 39.895 MiB/s 39.936 MiB/s]
whitespace-pretok-scaling/manual/100
                        time:   [2.1645 µs 2.1887 µs 2.2119 µs]
                        thrpt:  [43.115 MiB/s 43.573 MiB/s 44.059 MiB/s]
Found 1 outliers among 50 measurements (2.00%)
  1 (2.00%) high mild
whitespace-pretok-scaling/regex/1000
                        time:   [20.116 µs 20.126 µs 20.139 µs]
                        thrpt:  [47.354 MiB/s 47.385 MiB/s 47.410 MiB/s]
Found 4 outliers among 50 measurements (8.00%)
  3 (6.00%) high mild
  1 (2.00%) high severe
whitespace-pretok-scaling/manual/1000
                        time:   [15.967 µs 16.036 µs 16.097 µs]
                        thrpt:  [59.246 MiB/s 59.469 MiB/s 59.727 MiB/s]
Found 11 outliers among 50 measurements (22.00%)
  3 (6.00%) high mild
  8 (16.00%) high severe
whitespace-pretok-scaling/regex/10000
                        time:   [248.21 µs 248.37 µs 248.58 µs]
                        thrpt:  [38.365 MiB/s 38.397 MiB/s 38.423 MiB/s]
Found 4 outliers among 50 measurements (8.00%)
  3 (6.00%) high mild
  1 (2.00%) high severe
whitespace-pretok-scaling/manual/10000
                        time:   [188.85 µs 188.96 µs 189.06 µs]
                        thrpt:  [50.443 MiB/s 50.470 MiB/s 50.500 MiB/s]
Found 3 outliers among 50 measurements (6.00%)
  2 (4.00%) high mild
  1 (2.00%) high severe
whitespace-pretok-scaling/regex/100000
                        time:   [2.6126 ms 2.6139 ms 2.6156 ms]
                        thrpt:  [36.462 MiB/s 36.485 MiB/s 36.503 MiB/s]
Found 6 outliers among 50 measurements (12.00%)
  2 (4.00%) high mild
  4 (8.00%) high severe
whitespace-pretok-scaling/manual/100000
                        time:   [2.0381 ms 2.0405 ms 2.0435 ms]
                        thrpt:  [46.669 MiB/s 46.737 MiB/s 46.793 MiB/s]
Found 5 outliers among 50 measurements (10.00%)
  3 (6.00%) high mild
  2 (4.00%) high severe

regex is the current impl and manual is the new implementation. It works by simply iterating on individual chars and handles unicode properly.

Closes #1821, #1825

HuggingFaceDocBuilderDev · 2026-03-26T15:44:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2026-04-02T13:57:19Z

+    }
+
+    group.finish();
+}


bench a full encode_batch please

ArthurZucker

Let's go!

ArthurZucker

---- pre_tokenizers::whitespace::tests::assert_equivalent_xnli stdout ----

thread 'pre_tokenizers::whitespace::tests::assert_equivalent_xnli' (40277) panicked at src/pre_tokenizers/whitespace.rs:270:61:
called Result::unwrap() on an Err value: Os { code: 2, kind: NotFound, message: "No such file or directory" }

failures:
pre_tokenizers::whitespace::tests::assert_equivalent_xnli

test result: FAILED. 199 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.08s

Probably because I refactor big text and etc

McPatate · 2026-04-19T13:21:08Z

Opened https://huggingface.co/datasets/hf-internal-testing/tokenizers-test-data/discussions/2 to have the xnli.txt file in the internal testing repo, and added it to the Makefile so it works out of the box.

McPatate requested a review from ArthurZucker March 26, 2026 15:31

McPatate force-pushed the feat/improve_whitespace_splitting_perf branch from 940ab5a to 610ca05 Compare March 26, 2026 15:41

McPatate force-pushed the feat/improve_whitespace_splitting_perf branch from 65b6950 to a09f44b Compare March 26, 2026 15:56

ArthurZucker reviewed Apr 2, 2026

View reviewed changes

McPatate added 3 commits April 2, 2026 16:03

feat: add new faster whitespace split pretok

cf9ef72

refactor: fmt

087f1be

refactor: use is_mark instead of the three is_mark_* fns

37b2036

McPatate force-pushed the feat/improve_whitespace_splitting_perf branch from a09f44b to 37b2036 Compare April 2, 2026 14:12

ArthurZucker approved these changes Apr 2, 2026

View reviewed changes

ArthurZucker reviewed Apr 2, 2026

View reviewed changes

McPatate force-pushed the feat/improve_whitespace_splitting_perf branch from e478098 to 298001b Compare April 19, 2026 13:03

ArthurZucker and others added 2 commits April 24, 2026 11:31

Merge branch 'main' into feat/improve_whitespace_splitting_perf

8ef9c0c

feat: add xnli.txt download in the Makefile

60165a1

McPatate force-pushed the feat/improve_whitespace_splitting_perf branch from 93685f4 to 60165a1 Compare April 24, 2026 09:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add new faster whitespace split pretok#1985

feat: add new faster whitespace split pretok#1985
McPatate wants to merge 5 commits intomainfrom
feat/improve_whitespace_splitting_perf

McPatate commented Mar 26, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 26, 2026

Uh oh!

ArthurZucker Apr 2, 2026

Uh oh!

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker left a comment

Uh oh!

McPatate commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

McPatate commented Mar 26, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 26, 2026

Uh oh!

ArthurZucker Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

McPatate commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants