Skip to content

Commit 213c6c2

Browse files
committed
Enable jieba in wasm build and assets
1 parent dc223a8 commit 213c6c2

29 files changed

+958909
-5
lines changed

wasm-lib/README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -78,10 +78,12 @@ const result = await converter("服务器软件"); // 伺服器軟體
7878
| Config | Description | Example |
7979
|--------|-------------|---------|
8080
| `s2twp` | Simplified → Taiwan Traditional (with regional phrases) | 软件 → 軟體 |
81+
| `s2twp_jieba` | Simplified → Taiwan Traditional (jieba segmentation) | 城堡的士兵 → 城堡的士兵 |
8182
| `s2tw` | Simplified → Taiwan Traditional | 心里 → 心裡 |
8283
| `s2hk` | Simplified → Hong Kong Traditional | 心里 → 心裏 |
8384
| `s2t` | Simplified → OpenCC Standard Traditional | 简体 → 簡體 |
8485
| `tw2sp` | Taiwan → Simplified (with regional phrases) | 滑鼠 → 鼠标 |
86+
| `tw2sp_jieba` | Taiwan → Simplified (jieba segmentation) | 慰藉著 → 慰藉着 |
8587
| `tw2s` | Taiwan → Simplified | 軟體 → 软件 |
8688
| `tw2t` | Taiwan → Traditional | 吃飯 → 喫飯 |
8789
| `hk2s` | Hong Kong → Simplified | 打印機 → 打印机 |
@@ -276,7 +278,7 @@ console.log(await t2s("繁體")); // 繁体
276278
```typescript
277279
import OpenCC from 'opencc-wasm';
278280

279-
type ConfigName = 's2t' | 's2tw' | 's2twp' | 't2s';
281+
type ConfigName = 's2t' | 's2tw' | 's2twp' | 's2twp_jieba' | 't2s' | 'tw2sp_jieba';
280282

281283
async function convert(config: ConfigName, text: string): Promise<string> {
282284
const converter = OpenCC.Converter({ config });
@@ -342,7 +344,7 @@ wasm-lib/
342344
│ │ ├── index.cjs
343345
│ │ ├── opencc-wasm.cjs
344346
│ │ └── opencc-wasm.wasm
345-
│ └── data/ ← OpenCC configs + dicts
347+
│ └── data/ ← OpenCC configs + dicts (+ jieba files if enabled)
346348
├── index.js ← Source API
347349
├── index.d.ts ← TypeScript definitions
348350
└── scripts/
@@ -375,6 +377,7 @@ A: Initial load downloads configs + dicts (~1-2MB). Subsequent conversions are f
375377

376378
- Uses persistent OpenCC handles to avoid reloading configs
377379
- Dictionaries stored in `/data/dict/` in virtual FS
380+
- Jieba assets stored in `/data/jieba_dict/` (dict, hmm_model, user dict, idf, stop_words)
378381
- Memory grows on demand (`ALLOW_MEMORY_GROWTH=1`)
379382
- Performance: Focuses on fidelity and compatibility with official OpenCC. May be slower than pure-JS implementations for raw throughput, but guarantees full OpenCC behavior.
380383

wasm-lib/README.zh.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,10 +78,12 @@ const result = await converter("服务器软件"); // 伺服器軟體
7878
| 設定檔 | 說明 | 範例 |
7979
|--------|------|------|
8080
| `s2twp` | 簡體 → 台灣正體(含地域用詞轉換) | 軟體 → 軟體 |
81+
| `s2twp_jieba` | 簡體 → 台灣正體(jieba 分詞) | 城堡的士兵 → 城堡的士兵 |
8182
| `s2tw` | 簡體 → 台灣正體 | 心里 → 心裡 |
8283
| `s2hk` | 簡體 → 香港繁體 | 心里 → 心裏 |
8384
| `s2t` | 簡體 → OpenCC 標準繁體 | 简体 → 簡體 |
8485
| `tw2sp` | 台灣正體 → 簡體(含地域用詞轉換) | 滑鼠 → 鼠标 |
86+
| `tw2sp_jieba` | 台灣正體 → 簡體(jieba 分詞) | 慰藉著 → 慰藉着 |
8587
| `tw2s` | 台灣正體 → 簡體 | 軟體 → 软件 |
8688
| `tw2t` | 台灣正體 → OpenCC 標準繁體 | 吃飯 → 喫飯 |
8789
| `hk2s` | 香港繁體 → 簡體 | 打印機 → 打印机 |
@@ -276,7 +278,7 @@ console.log(await t2s("繁體")); // 繁体
276278
```typescript
277279
import OpenCC from 'opencc-wasm';
278280

279-
type ConfigName = 's2t' | 's2tw' | 's2twp' | 't2s';
281+
type ConfigName = 's2t' | 's2tw' | 's2twp' | 's2twp_jieba' | 't2s' | 'tw2sp_jieba';
280282

281283
async function convert(config: ConfigName, text: string): Promise<string> {
282284
const converter = OpenCC.Converter({ config });
@@ -375,6 +377,7 @@ A:首次載入需要下載設定檔和字典檔(約 1-2MB)。後續轉換
375377

376378
- 使用持久的 OpenCC 控制代碼避免重複載入設定
377379
- 字典儲存在虛擬檔案系統的 `/data/dict/`
380+
- Jieba 資產儲存在 `/data/jieba_dict/`(詞典、hmm_model、user dict、idf、stop_words)
378381
- 記憶體按需成長(`ALLOW_MEMORY_GROWTH=1`
379382
- 效能:專注於精確度和與官方 OpenCC 的相容性。原始吞吐量可能比純 JavaScript 實作慢,但保證完整的 OpenCC 行為。
380383

wasm-lib/build.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ OPENCC_SRCS=(
2222
${OPENCC_SRC_DIR}/src/Dict.cpp
2323
${OPENCC_SRC_DIR}/src/DictEntry.cpp
2424
${OPENCC_SRC_DIR}/src/DictGroup.cpp
25+
${OPENCC_SRC_DIR}/src/JiebaSegmentation.cpp
2526
${OPENCC_SRC_DIR}/src/Lexicon.cpp
2627
${OPENCC_SRC_DIR}/src/MarisaDict.cpp
2728
${OPENCC_SRC_DIR}/src/MaxMatchSegmentation.cpp
@@ -49,6 +50,7 @@ MARISA_SRCS=(
4950
# 头文件搜索路径
5051
INCLUDE_FLAGS=(
5152
-I${OPENCC_SRC_DIR}/src
53+
-I${OPENCC_SRC_DIR}/deps/libcppjieba/include
5254
-I${MARISA_DIR}/include
5355
-I${MARISA_DIR}/lib
5456
-I${OPENCC_SRC_DIR}/deps/rapidjson-1.1.0
@@ -63,6 +65,7 @@ INCLUDE_FLAGS=(
6365
# -O2: 体积/性能权衡
6466
COMMON_FLAGS=(
6567
-DOPENCC_WASM_WITH_OPENCC
68+
-DENABLE_JIEBA
6669
"${OPENCC_SRCS[@]}"
6770
"${MARISA_SRCS[@]}"
6871
src/main.cpp
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
{
2+
"name": "Simplified Chinese to Traditional Chinese (Taiwan standard, with phrases, Jieba Segmentation - Experimental)",
3+
"segmentation": {
4+
"type": "jieba",
5+
"dict_path": "jieba_dict/jieba.dict.utf8",
6+
"model_path": "jieba_dict/hmm_model.utf8",
7+
"user_dict_path": "jieba_dict/user.dict.utf8"
8+
},
9+
"conversion_chain": [{
10+
"dict": {
11+
"type": "group",
12+
"dicts": [{
13+
"type": "ocd2",
14+
"file": "STPhrases.ocd2"
15+
}, {
16+
"type": "ocd2",
17+
"file": "STCharacters.ocd2"
18+
}]
19+
}
20+
}, {
21+
"dict": {
22+
"type": "ocd2",
23+
"file": "TWPhrases.ocd2"
24+
}
25+
}, {
26+
"dict": {
27+
"type": "ocd2",
28+
"file": "TWVariants.ocd2"
29+
}
30+
}]
31+
}
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
{
2+
"name": "Traditional Chinese (Taiwan standard) to Simplified Chinese (with phrases, Jieba Segmentation - Experimental)",
3+
"segmentation": {
4+
"type": "jieba",
5+
"dict_path": "jieba_dict/jieba.dict.utf8",
6+
"model_path": "jieba_dict/hmm_model.utf8",
7+
"user_dict_path": "jieba_dict/user.dict.utf8"
8+
},
9+
"conversion_chain": [{
10+
"dict": {
11+
"type": "group",
12+
"dicts": [{
13+
"type": "ocd2",
14+
"file": "TWPhrasesRev.ocd2"
15+
}, {
16+
"type": "ocd2",
17+
"file": "TWVariantsRevPhrases.ocd2"
18+
}, {
19+
"type": "ocd2",
20+
"file": "TWVariantsRev.ocd2"
21+
}]
22+
}
23+
}, {
24+
"dict": {
25+
"type": "group",
26+
"dicts": [{
27+
"type": "ocd2",
28+
"file": "TSPhrases.ocd2"
29+
}, {
30+
"type": "ocd2",
31+
"file": "TSCharacters.ocd2"
32+
}]
33+
}
34+
}]
35+
}
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
package(default_visibility = ["//visibility:public"])
2+
3+
filegroup(
4+
name = "jieba_dict",
5+
srcs = glob(["*.utf8", "README.md"]),
6+
)

wasm-lib/data/jieba_dict/README.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Jieba 分词词典
2+
3+
此目录包含 Jieba 中文分词所需的词典文件,来源于 [libcppjieba](https://github.com/yanyiwu/libcppjieba)
4+
5+
## 文件说明
6+
7+
- **jieba.dict.utf8** (4.9 MB) - 主词典文件,包含词语及其词频
8+
- **hmm_model.utf8** (508 KB) - 隐马尔可夫模型(HMM)文件,用于识别未登录词
9+
- **user.dict.utf8** (33 B) - 用户自定义词典(可选)
10+
11+
## 许可证
12+
13+
这些词典文件继承自 jieba 项目,遵循 MIT 许可证。
14+
15+
## 使用方式
16+
17+
在 OpenCC 配置文件中指定这些词典的路径。IDF 和停用词数据
18+
会从 `deps/libcppjieba/dict/` 自动解析,无需复制到此目录:
19+
20+
```json
21+
{
22+
"segmentation": {
23+
"type": "jieba",
24+
"dict_path": "jieba_dict/jieba.dict.utf8",
25+
"model_path": "jieba_dict/hmm_model.utf8",
26+
"user_dict_path": "jieba_dict/user.dict.utf8"
27+
}
28+
}
29+
```
30+
31+
## 自定义用户词典
32+
33+
您可以编辑 `user.dict.utf8` 添加自定义词语,格式为:
34+
35+
```
36+
词语 词频 词性
37+
```
38+
39+
例如:
40+
```
41+
云计算 5 n
42+
机器学习 8 n
43+
```
44+
45+
每行一个词语,词频和词性可选。

wasm-lib/data/jieba_dict/hmm_model.utf8

Lines changed: 34 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)