Skip to content

Commit 4fcd3d8

Browse files
committed
feat(json): switch JSON backend to ujson with RapidJSON support
1 parent 1619cc2 commit 4fcd3d8

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+16739
-59
lines changed

.github/workflows/build.yml

Lines changed: 40 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,32 +7,63 @@ on:
77
branches: [ "main" ]
88

99
jobs:
10-
build:
10+
generate-data:
1111
runs-on: ubuntu-latest
12-
1312
steps:
1413
- uses: actions/checkout@v3
1514
with:
1615
submodules: recursive
1716

17+
- name: Set up Python
18+
uses: actions/setup-python@v4
19+
with:
20+
python-version: '3.10'
21+
1822
- name: Install dependencies
23+
run: |
24+
pip install modelscope transformers
25+
26+
- name: Generate Tests Data
27+
run: |
28+
cd tests
29+
python generate_assets.py
30+
31+
- name: Upload models
32+
uses: actions/upload-artifact@v4
33+
with:
34+
name: models
35+
path: tests/models/
36+
37+
build-and-test:
38+
needs: generate-data
39+
runs-on: ubuntu-latest
40+
strategy:
41+
matrix:
42+
ujson_use_rapidjson: ["OFF", "ON"]
43+
name: build-test (RapidJSON=${{ matrix.ujson_use_rapidjson }})
44+
steps:
45+
- uses: actions/checkout@v3
46+
with:
47+
submodules: recursive
48+
49+
- name: Download models
50+
uses: actions/download-artifact@v4
51+
with:
52+
name: models
53+
path: tests/models/
54+
55+
- name: Install build dependencies
1956
run: |
2057
sudo apt-get update
2158
sudo apt-get install -y cmake g++
22-
pip install modelscope transformers
2359
2460
- name: Build
2561
run: |
2662
mkdir build
2763
cd build
28-
cmake .. -DCMAKE_BUILD_TYPE=Release
64+
cmake .. -DCMAKE_BUILD_TYPE=Release -DUJSON_USE_RAPIDJSON=${{ matrix.ujson_use_rapidjson }}
2965
make -j$(nproc)
3066
31-
- name: Generate Tests Data
32-
run: |
33-
cd tests
34-
python generate_assets.py
35-
3667
- name: Run Tests
3768
run: |
3869
cd build

README.md

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,9 @@ It provides a high-performance C++ implementation for modern LLM tokenization pi
1111
## Features
1212

1313
- **HuggingFace Compatible**: Loads directly from `tokenizer.json`.
14-
- **Comprehensive Support**: Supports BPE, WordPiece, and Unigram models.
15-
- **Complex Normalization**: Implements NFKC, Sequence, Prepend, Replace, and more.
16-
- **Advanced Pre-tokenization**: Supports ByteLevel, Digits, Split, and Regex-based patterns (GPT-2/4 style).
17-
- **Efficient**: Optimized C++ implementation using minimal dependencies.
18-
- **Self-Contained**: Includes pruned versions of optimizations like Oniguruma for minimal footprint.
14+
- **Dual JSON Backend**: Supports both `nlohmann/json` and `RapidJSON` via `ujson` bridge.
15+
- **Efficient**: Optimized C++ implementation with nearly 2x faster loading using RapidJSON.
16+
- **Self-Contained**: Includes pruned Oniguruma for minimal footprint.
1917

2018
## Supported Models
2119

@@ -42,7 +40,10 @@ The library allows easy loading and usage of tokenizers.
4240
```bash
4341
mkdir build
4442
cd build
43+
# Default: uses nlohmann/json
4544
cmake ..
45+
# Optional: use RapidJSON for 2x faster loading
46+
cmake .. -DUJSON_USE_RAPIDJSON=ON
4647
make
4748
```
4849

@@ -82,6 +83,18 @@ int main() {
8283
}
8384
```
8485

86+
## Performance
87+
88+
The library is optimized for loading speed, especially for large models. Using the `RapidJSON` backend provides a significant performance boost:
89+
90+
| Metric (41 Models / 1691 Cases) | nlohmann/json | RapidJSON (via ujson) | Speedup |
91+
| :--- | :--- | :--- | :--- |
92+
| **Total Loading Time** | 92.40 s | 47.13 s | **1.96x** |
93+
| **Total Encode Time** | 0.25 s | 0.23 s | 1.07x |
94+
| **Total Time** | 92.65 s | 47.36 s | **1.95x** |
95+
96+
*Benchmarks conducted on 41 different model architectures.*
97+
8598
## Documentation
8699

87100
For deep technical details on the implementation and architecture, see [doc/implementation_details_CN.md](doc/implementation_details_CN.md).

README_CN.md

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,10 @@
1010

1111
## 主要特性
1212

13-
- **HuggingFace 兼容**:直接加载标准的 `tokenizer.json` 文件,无需转换。
14-
- **全面支持**:支持 BPE (Byte-Pair Encoding), WordPiece, Unigram 算法。
15-
- **复杂的规范化**:内置 NFKC、Sequence、Prepend、Replace 等多种规范化器。
16-
- **高级预分词**:支持 ByteLevel、Digits、Split 以及基于 Regex 的复杂切分(完美复刻 GPT-2/4 风格)。
17-
- **高效轻量**:优化的 C++ 实现,依赖极少。
18-
- **自包含**:内置经过深度裁剪的 Oniguruma 正则引擎,在保持强大 Unicode 支持的同时最小化体积。
13+
- **HuggingFace 兼容**:直接加载标准的 `tokenizer.json` 文件。
14+
- **双 JSON 后端**:通过 `ujson` 桥接同时支持 `nlohmann/json``RapidJSON`
15+
- **高效高性能**:优化的 C++ 实现,使用 RapidJSON 后端可提升约 2 倍加载速度。
16+
- **轻量自包含**:内置裁剪版 Oniguruma,最小化二进制体积。
1917

2018
## 支持模型
2119

@@ -40,7 +38,10 @@
4038
```bash
4139
mkdir build
4240
cd build
41+
# 默认:使用 nlohmann/json
4342
cmake ..
43+
# 可选:使用 RapidJSON 提升 2 倍加载速度
44+
cmake .. -DUJSON_USE_RAPIDJSON=ON
4445
make
4546
```
4647

@@ -80,6 +81,18 @@ int main() {
8081
}
8182
```
8283

84+
## 性能测试
85+
86+
本库针对加载速度进行了深度优化,特别是在处理超大模型配置文件时。使用 `RapidJSON` 后端可获得显著性能提升:
87+
88+
| 指标 (41 个模型 / 1691 个测试用例) | nlohmann/json | RapidJSON (via ujson) | 加速比 |
89+
| :--- | :--- | :--- | :--- |
90+
| **总计加载时间** | 92.40 s | 47.13 s | **1.96x** |
91+
| **总计编码时间** | 0.25 s | 0.23 s | 1.07x |
92+
| **总计总耗时** | 92.65 s | 47.36 s | **1.95x** |
93+
94+
*测试涵盖 41 种不同模型架构。*
95+
8396
## 文档
8497

8598
关于项目架构和技术实现的深度解析,请参阅技术文档 [doc/implementation_details_CN.md](doc/implementation_details_CN.md)

doc/implementation_details_CN.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,9 +49,11 @@
4949
* **裁剪**: 移除了所有非 UTF-8 编码支持 (EUC-JP, SJIS 等),移除了 POSIX/GNU 兼容层,仅保留核心正则引擎。
5050
* **体积优化**: 使得最终二进制体积增加极小,远小于引入 ICU 或完整版 Oniguruma。
5151

52-
### 2. JSON 加载与兼容性
52+
### 2. JSON 加载与性能优化
5353
* 直接解析 HuggingFace 标准的 `tokenizer.json`
54-
* 使用 `nlohmann::json` 处理复杂的嵌套配置 (如 `normalizer` 中套 `Sequence` 再套 `Replace`)。
54+
* **ujson 桥接**: 引入了 `ujson` 桥接层,允许在 `nlohmann/json``RapidJSON` 之间灵活切换。
55+
* **后端切换**: 开发者可以通过 `UJSON_USE_RAPIDJSON` 宏启用 RapidJSON 后端。
56+
* **性能提升**: 在 40+ 模型的大规模测试中,RapidJSON 后端将模型加载时间从 ~92s 降低到 ~47s,实现了近 **2 倍** 的加载性能提升。
5557
* 针对 `pre_tokenizer``normalizer` 的多态类型实现了工厂模式加载。
5658

5759
### 3. Unicode 处理

src/tokenizer.cpp

Lines changed: 39 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,14 @@
99
#include <cmath>
1010
#include <oniguruma.h>
1111
#include <utf8proc/utf8proc.h>
12+
#include <iostream>
13+
#include "ujson.hpp"
1214
#include "jinja.hpp"
1315

1416
namespace tokenizer {
1517

18+
using json = ujson::json;
19+
1620
// ==========================================
1721
// C++11 Polyfills
1822
// ==========================================
@@ -87,7 +91,7 @@ static std::string OnigurumaRegexEscape(const std::string& pattern) {
8791
return escaped;
8892
}
8993

90-
static std::string get_token_content(const nlohmann::json& j) {
94+
static std::string get_token_content(const json& j) {
9195
if (j.is_string()) return j.get<std::string>();
9296
if (j.is_object() && j.contains("content")) return j["content"].get<std::string>();
9397
return "";
@@ -560,8 +564,8 @@ class BPEModel : public Model {
560564
return out;
561565
}
562566

563-
void load(const nlohmann::json& v, const nlohmann::json& m) {
564-
for (auto it = v.begin(); it != v.end(); ++it) { vocab_[it.key()] = it.value(); id_to_token_[it.value()] = it.key(); }
567+
void load(const json& v, const json& m) {
568+
for (auto it = v.begin(); it != v.end(); ++it) { vocab_[it.key()] = it.value().get<int>(); id_to_token_[it.value().get<int>()] = it.key(); }
565569
int rank = 0;
566570
for (const auto& item : m) {
567571
std::string s1, s2;
@@ -585,10 +589,10 @@ class WordPieceModel : public Model {
585589
WordPieceModel(const std::string& unk = "[UNK]", const std::string& prefix = "##", int max_chars = 100)
586590
: unk_token_(unk), continuing_subword_prefix_(prefix), max_input_chars_per_word_(max_chars), unk_token_id_(-1) {}
587591

588-
void load(const nlohmann::json& v) {
592+
void load(const json& v) {
589593
for (auto it = v.begin(); it != v.end(); ++it) {
590-
vocab_[it.key()] = it.value();
591-
id_to_token_[it.value()] = it.key();
594+
vocab_[it.key()] = it.value().get<int>();
595+
id_to_token_[it.value().get<int>()] = it.key();
592596
}
593597
auto it = vocab_.find(unk_token_);
594598
if (it != vocab_.end()) unk_token_id_ = it->second;
@@ -659,7 +663,7 @@ class UnigramModel : public Model {
659663
UnigramModel(int unk_id = 0, bool byte_fallback = false)
660664
: unk_token_id_(unk_id), byte_fallback_(byte_fallback) {}
661665

662-
void load(const nlohmann::json& v) {
666+
void load(const json& v) {
663667
int idx = 0;
664668
for (const auto& item : v) {
665669
if (item.is_array() && item.size() >= 2) {
@@ -1059,7 +1063,7 @@ struct PreTrainedTokenizer::Impl {
10591063
}
10601064
}
10611065

1062-
bool load_from_json(PreTrainedTokenizer* public_api, const nlohmann::json& j) {
1066+
bool load_from_json(PreTrainedTokenizer* public_api, const json& j) {
10631067
if (j.contains("model") && j["model"].is_object()) {
10641068
std::string model_type = j["model"].value("type", "");
10651069
// Auto-detect model type if not specified
@@ -1118,7 +1122,7 @@ struct PreTrainedTokenizer::Impl {
11181122
if (j["model"].contains("byte_fallback")) byte_fallback = j["model"]["byte_fallback"].get<bool>();
11191123

11201124
bool use_byte_level = false;
1121-
auto check_bl = [](const nlohmann::json& c) -> bool {
1125+
auto check_bl = [](const json& c) -> bool {
11221126
if (!c.is_object()) return false;
11231127
if (c.value("type", "") == "ByteLevel") return true;
11241128
if (c.contains("pretokenizers")) {
@@ -1132,9 +1136,9 @@ struct PreTrainedTokenizer::Impl {
11321136
}
11331137
return false;
11341138
};
1135-
if (check_bl(j.value("pre_tokenizer", nlohmann::json()))) use_byte_level = true;
1136-
if (check_bl(j.value("post_processor", nlohmann::json()))) use_byte_level = true;
1137-
if (check_bl(j.value("decoder", nlohmann::json()))) use_byte_level = true;
1139+
if (check_bl(j.value("pre_tokenizer", json()))) use_byte_level = true;
1140+
if (check_bl(j.value("post_processor", json()))) use_byte_level = true;
1141+
if (check_bl(j.value("decoder", json()))) use_byte_level = true;
11381142

11391143
// If we have a ByteLevelPreTokenizer in the sequence, BPEModel should not do the mapping itself
11401144
bool pt_has_byte_level = false;
@@ -1151,7 +1155,7 @@ struct PreTrainedTokenizer::Impl {
11511155
}
11521156
}
11531157
if (j.contains("normalizer") && !j["normalizer"].is_null()) {
1154-
auto create_norm = [&](const nlohmann::json& s) -> std::shared_ptr<Normalizer> {
1158+
auto create_norm = [&](const json& s) -> std::shared_ptr<Normalizer> {
11551159
std::string type = s.value("type", "");
11561160
if (type == "NFKC") return std::make_shared<NFKCNormalizer>();
11571161
if (type == "Precompiled") {
@@ -1199,7 +1203,7 @@ struct PreTrainedTokenizer::Impl {
11991203
}
12001204
}
12011205
if (j.contains("decoder") && !j["decoder"].is_null()) {
1202-
auto create_dec = [&](const nlohmann::json& s) -> std::shared_ptr<Decoder> {
1206+
auto create_dec = [&](const json& s) -> std::shared_ptr<Decoder> {
12031207
std::string type = s.value("type", "");
12041208
if (type == "Replace") {
12051209
std::string p;
@@ -1232,7 +1236,7 @@ struct PreTrainedTokenizer::Impl {
12321236
}
12331237
if (j.contains("pre_tokenizer") && !j["pre_tokenizer"].is_null()) {
12341238
auto pt = j["pre_tokenizer"];
1235-
auto create_pt = [&](const nlohmann::json& s) -> std::shared_ptr<PreTokenizer> {
1239+
auto create_pt = [&](const json& s) -> std::shared_ptr<PreTokenizer> {
12361240
std::string type = s.value("type", "");
12371241
if (type == "Split") {
12381242
std::string p;
@@ -1269,22 +1273,22 @@ struct PreTrainedTokenizer::Impl {
12691273
}
12701274
if (j.contains("post_processor") && !j["post_processor"].is_null()) {
12711275
auto pp = j["post_processor"];
1272-
auto ptl = [&](const nlohmann::json& s) {
1276+
auto ptl = [&](const json& s) {
12731277
std::vector<TemplateProcessing::Step> steps;
12741278
if (s.contains("single")) {
1275-
for (auto& i : s["single"]) {
1279+
for (const auto& i : s["single"]) {
12761280
if (i.contains("SpecialToken")) steps.push_back({true, public_api->token_to_id(i["SpecialToken"]["id"].get<std::string>())});
12771281
else if (i.contains("Sequence")) steps.push_back({false, 0});
12781282
}
12791283
this->post_processor_ = std::make_shared<TemplateProcessing>(steps);
12801284
}
12811285
};
12821286
if (pp.value("type", "") == "TemplateProcessing") ptl(pp);
1283-
else if (pp.value("type", "") == "Sequence" && pp.contains("processors")) { for (auto& s : pp["processors"]) if (s.value("type", "") == "TemplateProcessing") { ptl(s); break; } }
1287+
else if (pp.value("type", "") == "Sequence" && pp.contains("processors")) { for (const auto& s : pp["processors"]) if (s.value("type", "") == "TemplateProcessing") { ptl(s); break; } }
12841288
}
12851289
if (j.contains("added_tokens") && j["added_tokens"].is_array()) {
12861290
std::vector<std::string> cs;
1287-
for (auto& item : j["added_tokens"]) {
1291+
for (const auto& item : j["added_tokens"]) {
12881292
std::string c = item.value("content", ""); int id = item.value("id", -1);
12891293
bool special = item.value("special", false);
12901294
bool lstrip = item.value("lstrip", false);
@@ -1358,30 +1362,29 @@ void PreTrainedTokenizer::set_chat_template(const std::string& t) {
13581362
impl_->chat_template_ = t;
13591363
impl_->jinja_template_ = std::make_shared<jinja::Template>(t);
13601364
}
1361-
13621365
std::string PreTrainedTokenizer::apply_chat_template(const ChatMessages& msgs, bool add_gen) const {
13631366
if (!impl_->jinja_template_) return "";
1364-
nlohmann::json j_msgs = nlohmann::json::array();
1367+
json j_msgs = json::array();
13651368
for (const auto& m : msgs) j_msgs.push_back({{"role", m.first}, {"content", m.second}});
1366-
nlohmann::json extra;
1369+
json extra = json::object();
13671370
extra["bos_token"] = id_to_token(impl_->special_tokens_.bos);
13681371
extra["eos_token"] = id_to_token(impl_->special_tokens_.eos);
1369-
return impl_->jinja_template_->apply_chat_template(j_msgs, add_gen, nlohmann::json::array(), extra);
1372+
return impl_->jinja_template_->apply_chat_template(j_msgs, add_gen, json::array(), extra);
13701373
}
13711374

13721375
std::string PreTrainedTokenizer::apply_chat_template(const std::string& json_str, bool add_generation_prompt) const {
13731376
if (!impl_->jinja_template_) return "";
1374-
auto j_msgs = nlohmann::json::parse(json_str, nullptr, false);
1377+
auto j_msgs = json::parse(json_str);
13751378
if (!j_msgs.is_array()) return "";
1376-
nlohmann::json extra;
1379+
json extra = json::object();
13771380
extra["bos_token"] = id_to_token(impl_->special_tokens_.bos);
13781381
extra["eos_token"] = id_to_token(impl_->special_tokens_.eos);
1379-
return impl_->jinja_template_->apply_chat_template(j_msgs, add_generation_prompt, nlohmann::json::array(), extra);
1382+
return impl_->jinja_template_->apply_chat_template(j_msgs, add_generation_prompt, json::array(), extra);
13801383
}
13811384

13821385
bool PreTrainedTokenizer::load_from_json_str(const std::string& json_str) {
1383-
auto j = nlohmann::json::parse(json_str, nullptr, false);
1384-
if (j.is_discarded()) return false;
1386+
auto j = json::parse(json_str);
1387+
if (j.is_null()) return false;
13851388
return impl_->load_from_json(this, j);
13861389
}
13871390

@@ -1396,16 +1399,20 @@ void PreTrainedTokenizer::set_clean_up_tokenization_spaces(bool clean) {
13961399
std::shared_ptr<PreTrainedTokenizer> AutoTokenizer::from_pretrained(const std::string& path) {
13971400
auto tok = std::make_shared<PreTrainedTokenizer>();
13981401
std::ifstream f(path + "/tokenizer.json"); if (!f.is_open()) return nullptr;
1399-
nlohmann::json j; f >> j;
1402+
std::stringstream ss_j; ss_j << f.rdbuf();
1403+
json j = json::parse(ss_j.str());
1404+
if (j.is_null()) return nullptr;
1405+
14001406
std::ifstream fc(path + "/tokenizer_config.json");
14011407
bool clean_up_spaces = false;
14021408
if (fc.is_open()) {
1403-
nlohmann::json jc; fc >> jc; if (jc.contains("chat_template")) tok->set_chat_template(jc["chat_template"].get<std::string>());
1409+
std::stringstream ss_jc; ss_jc << fc.rdbuf();
1410+
json jc = json::parse(ss_jc.str());
1411+
if (jc.contains("chat_template")) tok->set_chat_template(jc["chat_template"].get<std::string>());
14041412
clean_up_spaces = jc.value("clean_up_tokenization_spaces", false);
14051413
j["config_overrides"] = jc;
14061414
}
1407-
std::stringstream ss; ss << j;
1408-
if (!tok->load_from_json_str(ss.str())) return nullptr;
1415+
if (!tok->load_from_json_str(j.dump())) return nullptr;
14091416
tok->set_clean_up_tokenization_spaces(clean_up_spaces);
14101417
return tok;
14111418
}

tests/test_main.cpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,9 @@
1818
#include "tokenizer.hpp"
1919

2020
#include <utf8proc/utf8proc.h>
21-
#include <nlohmann/json.hpp>
21+
#include "ujson.hpp"
2222

23-
using json = nlohmann::json;
23+
using json = ujson::json;
2424

2525
// ==================== 颜色定义 ====================
2626
namespace Color {

0 commit comments

Comments
 (0)