doc: 提案 — Jieba 外掛化動態載入架構與執行方案 by frankslin · Pull Request #27 · frankslin/OpenCC

frankslin · 2026-03-07T14:37:35Z

Motivation

目前 Jieba 是在編譯期透過 ENABLE_JIEBA 選項靜態納入核心，造成核心與可選分詞能力耦合與封包困難。
目標是將 Jieba 外掛化，讓 libopencc 在遇到 "segmentation": { "type": "jieba" } 時以動態載入外掛（例如 libopencc-jieba.so）的方式提供分詞能力，方便發行版拆包成 opencc 與 opencc-jieba。

Description

新增 doc/JIEBA_PLUGIN_ARCHITECTURE_PLAN.md，提供完整外掛化設計與執行方案，包括架構圖、C ABI 函式表草案（src/plugin/OpenCCPlugin.h 示意）、PluginManager 與 PluginSegmentationAdapter 設計、錯誤語意與記憶體釋放策略、外掛搜尋順序與安全控管、以及 CMake/Bazel 與打包建議。
在 doc/JIEBA_USAGE.md 開頭加入交叉參考連結，提示使用者參考新的外掛化設計文件以利遷移與打包。
文件中定義了外掛 ABI 版本化、檔名與平台對應、環境變數 OPENCC_SEGMENTATION_PLUGIN_PATH、以及可選的 OPENCC_DISABLE_PLUGINS 安全開關，並提出漸進式遷移與相容性策略（保留舊的 ENABLE_JIEBA 支援以便過渡）。

Testing

此 PR 僅為文件新增與文字修改，未變更執行碼或 API，因此未執行程式碼層級的自動化測試。

* Add WASM demo scaffold and project notes * Add OpenCC WASM demo with converter UI and test runner - 补充 WASM 编译结果在前端 JS 中的用法 * Polish WASM demo UI and paths, run tests, and streamline converter export * Add wasm-based OpenCC package and update demo to consume it * Add wasm-based OpenCC package, static demo bundle, and benchmarking page * Add copyright notice and LICENSE

…eparation This commit enhances the opencc-wasm library with TypeScript support and implements a cleaner build architecture with semantic separation between intermediate build artifacts and publishable distribution. TypeScript Support: - Add comprehensive type definitions (index.d.ts) with full JSDoc documentation - Define interfaces: ConverterOptions, ConverterFunction, OpenCCNamespace, etc. - Provide complete type safety for better IDE support and developer experience Build Architecture Redesign (semantic separation): - build/ - Intermediate WASM artifacts (gitignored, for tests/development) * build/opencc-wasm.esm.js - ESM WASM glue * build/opencc-wasm.cjs - CJS WASM glue * build/opencc-wasm.wasm - WASM binary - dist/ - Publishable distribution (committed, for npm) * dist/esm/ - ESM package entry * dist/cjs/ - CJS package entry * dist/data/ - OpenCC config and dictionary files Invariants and Semantics: - Tests import source (index.js) → loads from build/ - Published package exports dist/ only - build/ = internal intermediate artifacts - dist/ = publishable artifacts - Clear separation ensures tests validate actual build output Enhanced .gitignore: - Add build/ to gitignore (intermediate artifacts) - Add node_modules/, logs, OS-specific files (.DS_Store, Thumbs.db) - Exclude editor configurations (.vscode/, .idea/) - Add cache and temporary file exclusions Two-Stage Build Process: Stage 1 (build.sh): - Compiles C++ to WASM using Emscripten - Outputs to build/ directory Stage 2 (build-api.js): - Copies WASM artifacts from build/ to dist/ - Transforms source paths for production - Generates API wrappers for ESM and CJS - Copies data files Package Configuration (package.json): - Add "types" field pointing to index.d.ts - Update "main" and "module" to point to API wrappers in dist/ - Add comprehensive "exports" map: * "." - Main API (ESM/CJS wrappers) * "./wasm" - Direct access to WASM glue for advanced users * "./dist/*" - Wildcard for flexible file access - Include LICENSE and NOTICE in published files Documentation: - Add comprehensive README section explaining build architecture - Document project structure with invariants - Explain semantic separation between build/ and dist/ Benefits: - Better TypeScript integration and IDE autocomplete - Cleaner, more maintainable directory structure - Tests validate actual build output, not stale dist files - Clear semantic separation between internal and publishable artifacts - Professional project setup following modern npm best practices - Long-term maintainability through clear invariants

…cases.json (#10) - add refresh_assets.sh to rebuild/copy only config-referenced .ocd2 files and testcases.json - convert wasm-lib tests to consume the new `{cases:[...]}` JSON format - update bundled .ocd2 dictionaries and testcases.json fixtures ---- * wasm-lib: refresh assets script and switch tests to consolidated testcases.json - add refresh_assets.sh to rebuild/copy only config-referenced .ocd2 files and testcases.json - convert wasm-lib tests to consume the new `{cases:[...]}` JSON format - update bundled .ocd2 dictionaries and testcases.json fixtures * Rebuild the wasm-lib and update the documentations

1. 新增演算法與理論局限性分析文件 - 詳細說明最大正向匹配分詞演算法 - 分析轉換鏈機制與詞典系統 - 探討理論局限性（一對多歧義、缺乏上下文理解、維護負擔） - 與現代方法（統計模型、神經網路）的比較 2. 更新 AGENTS.md - 新增「延伸閱讀」章節 - 連結到技術文件和貢獻指南 3. 新增 Claude Code 配置 - .claude/hooks/session_start.sh - 會話啟動時顯示專案資訊 - .claude/skills/opencc-dict-edit.md - 詞典編輯技能 - .claude/skills/opencc-algorithm-explain.md - 演算法解釋技能這些配置幫助 AI 代理更好地理解 OpenCC 專案架構與開發流程。

🚨 BREAKING CHANGE: New distribution layout The .wasm files have been moved to be co-located with their corresponding glue code files, fixing loading issues and enabling proper CDN usage. New layout: dist/ esm/ opencc-wasm.js opencc-wasm.wasm ← Now here (same directory) cjs/ opencc-wasm.cjs opencc-wasm.wasm ← Now here (same directory) opencc-wasm.wasm ← Kept for legacy compatibility Features: - ✅ CDN support: Can now import directly from jsDelivr/unpkg - ✅ Fixed WASM loading in various bundlers and environments - ✅ Comprehensive test suite with CDN usage tests - ✅ Complete documentation (CDN_USAGE.md, TESTING.md, CHANGELOG.md) Test suite: - npm test → Run all tests (core + CDN) - npm run test:core → Run 56 core functionality tests - npm run test:cdn → Run CDN usage tests All 56 core tests + CDN tests pass successfully. Usage example: ```js import OpenCC from "https://cdn.jsdelivr.net/npm/opencc-wasm@0.3.0/dist/esm/index.js"; const converter = OpenCC.Converter({ from: "cn", to: "t" }); const result = await converter("简体中文"); ``` Co-authored-by: Claude <claude@anthropic.com>

- 在頭部新增「專案說明」章節，說明本項目為 BYVoid/OpenCC 的 fork - 闡述兩個主要目的：WASM 實現與詞表擴充 - 新增「背景」小節，說明現有第三方實作的維護狀況與本專案定位 - 原有 README 內容完整保留在分隔線下方作為參考

This commit adds significant improvements to opencc-wasm: **API Enhancements:** - Add `config` parameter to Converter() as intuitive alternative to `from`/`to` - Support direct OpenCC config file names (e.g., `{ config: "s2twp" }`) - Expand CONFIG_MAP to support all conversion types and aliases - Maintain backward compatibility with `from`/`to` parameters **Documentation Improvements:** - Consolidate all API documentation into comprehensive README.md - Add Traditional Chinese README (README.zh-TW.md) with Taiwan localization - Emphasize "zero configuration" and "3-line start" features - Include practical examples for React, Vue, Node.js, and Web Workers - Add best practices and FAQ sections - Create interactive demo (test/demo-out-of-box.html) **User Experience:** - Clarify auto-loading of configs and dictionaries from CDN - Show both API methods side-by-side for user choice - Provide TypeScript usage examples All 56 core tests + new config parameter tests passing.

…'方程式' See [ByVoid Issue BYVoid#714](BYVoid#714).

添加基於《通用規範漢字表》(2013) 的繁簡轉換模式，支持將各種繁體標準轉換為中國政府規範繁體字。 1. **t2cngov.json** - 繁體到政府標準（全轉換） - 繁體異體標準化：溼 → 濕 - 簡體轉標準繁體：湿 → 濕 - 部分繁簡轉換：淨 → 净 2. **t2cngov_keep_simp.json** - 繁體到政府標準（保留簡體） - 保留原文中有意使用的簡體字 - 僅轉換繁體異體字第三方字典來源： - 作者：TerryTian-tech - 許可證：Apache License 2.0 - 參考標準：《通用規範漢字表》(2013) 字典文件： - TGCharacters.txt (37KB → 45KB ocd2) - 約 4000 個字符映射 - TGCharacters_keep_simp.txt (13KB → 21KB ocd2) - 保留簡體變體 - TGPhrases.txt (1.1MB → 911KB ocd2) - 約 7000 個詞組映射 - data/CMakeLists.txt: 構建 cngov 字典（扁平命名，分層安裝） - test/CMakeLists.txt: 整合測試用例 - data/dictionary/cngov/BUILD.bazel: cngov 字典構建規則 - data/config/BUILD.bazel: 新增 cngov_validation_test - test/testcases/BUILD.bazel: 新增 cngov_testcases filegroup - test/CommandLineConvertTest.cpp: 新增 ConvertCNGovFromJson 測試函數 - test/testcases/cngov_testcases.json: 5 個專屬測試用例 - data/config/CNGovValidationTest.cpp: 獨立的 Bazel 測試 - 測試命令： * bazel test //data/config:cngov_validation_test * bazel test //data/... - wasm-lib/data/dict/cngov/*.ocd2: 編譯後的字典 - wasm-lib/test/cngov_testcases.json: 測試用例 - wasm-lib/test/cngov.test.js: Node.js 測試代碼 - wasm-lib/scripts/refresh_assets.sh: 更新以支持子目錄和 cngov - README.md: 新增 CN Government Standard Mode 使用說明 - wasm-lib/README.md & README.zh.md: 配置表新增 t2cngov 條目 - data/dictionary/cngov/README.txt: 字典來源和版權聲明 ```bash echo "盫" | opencc -c t2cngov.json # → 盦 echo "简体混杂繁體" | opencc -c t2cngov.json # → 簡體混雜繁體 echo "潮溼的露臺" | opencc -c t2cngov.json # → 潮濕的露臺 echo "一乾二淨" | opencc -c t2cngov.json # → 一乾二净 ``` - 子目錄隔離：第三方字典放在 data/dictionary/cngov/ - 獨立測試：避免與上游 testcases.json 合併衝突 - 雙構建系統：同時支持 CMake 和 Bazel - 完整元數據：JSON 配置包含作者、許可證、貢獻者信息 - 字典壓縮：ocd2 格式體積減少 70-80% 基於 TerryTian-tech 的研究成果，整合時遵循 Apache License 2.0。貢獻者：TerryTian-tech, Yi Jianpeng, Hu Xinmei, Duan Yatong

Ensures that the build is always run before publishing to npm, preventing the publication of stale build artifacts.

This commit adds detailed Chinese-language documentation analyzing the two critical security vulnerabilities fixed in the previous commit. ## Documentation Files ### 1. doc/ISSUE_997_ANALYSIS.md Comprehensive analysis of the MaxMatchSegmentation buffer overflow (GitHub Issue BYVoid#997): - Problem description and crash location - Root cause analysis with step-by-step execution trace - Detailed explanation of integer underflow mechanism - Comparison: why normal text doesn't trigger vs. malicious input - Solution design and correctness proof - Test case documentation - Security impact assessment (CVSS ~7.5) - Best practices and lessons learned - Prevention strategies for similar issues Key sections: - Actual demonstration of the bug with hex output - Multi-layer defense architecture explanation - Reference to related CVE/CWE entries ### 2. doc/CONVERSION_INFORMATION_DISCLOSURE.md In-depth security analysis of the Conversion.cpp information disclosure vulnerability (more severe than BYVoid#997): - Complete vulnerability description - Attack scenario with memory layout diagrams - Step-by-step exploit demonstration showing heap data leakage - Direct comparison with Issue BYVoid#997 (why this is worse) - Exploitability analysis with test results - Information that could be leaked (keys, passwords, etc.) - Security impact: CWE-125, CWE-200, CVSS ~8.6 - Detailed fix explanation with multi-layer defense - Why normal usage was not affected - CVE recommendation and scoring rationale Key highlights: - Demonstrates actual heap memory leakage (0xAA bytes, "ABC" strings) - Shows that leaked data IS OUTPUT to conversion result - Explains ASLR bypass potential - Documents test cases that would fail with old code - Provides defensive programming recommendations ## Documentation Quality Both documents include: - Complete technical analysis in Chinese - Code snippets with annotations - Before/after comparisons - Security risk assessments - Prevention recommendations - References to standards (CWE, CVSS, OWASP) These documents serve as: - Security disclosure materials - Educational resources for similar vulnerability patterns - Reference for CVE submission - Internal security audit documentation Total additions: ~860 lines of detailed security analysis

* Refresh wasm-lib assets before build * Install Bazel before refreshing wasm assets

update package.json add --provenance to wasm-lib-publish.yml

…tion This document explores integrating Jieba word segmentation algorithm alongside the existing mmseg (maximum match segmentation) in OpenCC through experimental configuration support. Key findings: - Analyzed two implementation approaches: cppjieba (C++ native) and Python embedding via pybind11 - Strongly recommends cppjieba integration for performance, deployment simplicity, and maintenance - Designed extensible architecture using existing Segmentation interface - Proposed experimental config format to enable jieba without affecting current functionality - Outlined 4-phase implementation roadmap with risk mitigation strategies The analysis includes technical details on: - OpenCC's current segmentation architecture (Segmentation.hpp, Config.cpp) - Jieba's algorithm principles (Trie, DAG, HMM with Viterbi) - Detailed code examples for JiebaSegmentation class - CMake integration approach with ENABLE_JIEBA option - Comprehensive comparison matrix and implementation timeline

* Check in a complete copy of libcppjieba from https://github.com/yanyiwu/cppjieba (MIT License) ---- * Check in a copy of Jieba dictionary in data/jieba_dict/ for OpenCC: * jieba.dict.utf8 (4.9MB): Main dictionary with word frequencies * hmm_model.utf8 (508KB): HMM model for unknown word recognition * user.dict.utf8: User-defined custom dictionary * README.md: Dictionary documentation and customization guide ---- * Implement (experimental) Jieba segmentation support ---- * Add comprehensive test suite for Jieba segmentation Added unit tests and comparison tests following OpenCC testing patterns. 1. Basic Unit Tests (src/JiebaSegmentationTest.cpp): - BasicSegmentation: Validates basic Chinese word segmentation - ComplexPhrase: Tests multi-word phrases and proper nouns - EmptyString, SingleCharacter: Edge case handling - EnglishAndChinese: Mixed language support - UnknownWords: HMM model's ability to recognize unknown words 2. JSON-Driven Comparison Tests (src/JiebaComparisonTest.cpp): - Follows t2cngov test pattern with external JSON test cases - Loads test definitions from test/testcases/jieba_comparison_testcases.json - Compares mmseg vs Jieba segmentation outputs - Displays: Input, Jieba segments, Expected segments, Conversion outputs - Converter caching for performance optimization 3. Test Cases Definition (test/testcases/jieba_comparison_testcases.json): - 15 comprehensive test cases covering: * Simplified to Traditional (10 cases): s2twp vs s2twp_jieba * Traditional to Simplified (5 cases): tw2sp vs tw2sp_jieba Key test scenarios: - jieba_s2t_001: 着名 ambiguity (wearing+name vs famous) "生活着名为正敏的少女" -> Expected: "生活/着/名为/正敏/的/少女" - jieba_s2t_002: Compound words (中学生, 中等身材) "一个中学生，一个中等身材的人" - jieba_t2s_001: Traditional 著名/為 conversion "生活著名為正敏的少女" -> Expected: "生活/著名/為/正敏/的/少女" - Other cases: Proper nouns, modern terms, mixed content, ambiguous structures, Taiwan-specific vocabulary, long compounds, classical Chinese 4. Focused Individual Tests: - AmbiguousCase_ZhaoMing: Detailed output for "着名" ambiguity - TraditionalToSimplified_ZhuMing: Detailed output for "著名" conversion Output Format: === Test: jieba_s2t_001 === Input: 生活着名为正敏的少女 Jieba segments: 生活/着/名为/正敏/的/少女 Expected segs: 生活/着/名为/正敏/的/少女 s2twp: 生活著名爲正敏的少女 s2twp_jieba: 生活著名為正敏的少女 Benefits: - Visual comparison of segmentation algorithms - Easy to add new test cases (just edit JSON) - Documents expected behavior for ambiguous cases - Validates that Jieba improves segmentation accuracy - Test data can be reviewed independently from code Build System Integration: - Tests added to CMake UNITTESTS when ENABLE_JIEBA=ON - Automatically run with 'make test' or 'ctest' reorder * Fix Jieba tests in Bazel and add more examples. * Add comprehensive Jieba segmentation documentation Added two detailed documentation files: 1. doc/JIEBA_SEGMENTATION_FEASIBILITY.md (559 lines) - Comprehensive feasibility analysis for integrating Jieba segmentation - Compares two implementation approaches: * cppjieba (C++ native) - RECOMMENDED * Python embedding via pybind11 - Not recommended - Technical analysis of Jieba's algorithm (Trie, DAG, HMM) - Detailed implementation plan with code examples - Performance, deployment, and maintenance comparison matrix - 4-phase implementation roadmap - Risk assessment and mitigation strategies 2. doc/JIEBA_USAGE.md - Complete user guide for Jieba segmentation feature - Compilation instructions with CMake - Configuration file format and examples - C++/CLI/Python API usage examples - Custom user dictionary guide - Performance considerations and benchmarks - mmseg vs Jieba comparison table - Troubleshooting guide - Limitations and best practices Key recommendations: - Use cppjieba for production (performance, zero dependencies) - Enable via -DENABLE_JIEBA=ON compile flag - Experimental feature, opt-in only ---- * Fix C++ compiler compatibility * Fix //python/tests:test_opencc --------- Co-authored-by: Claude <noreply@anthropic.com>

…cters-according-to-Chinese-government-standards 最新词库和模式，增加 s2t_cngov.json 和 t2s_cngov.json

…yTian-tech/OpenCC-Traditional-Chinese-characters-according-to-Chinese-government-standards

…acters-according-to-Chinese-government-standards as a submodule under deps/cngov

Source: TerryTian-tech/OpenCC-Traditional-Chinese-characters-according-to-Chinese-government-standards Upstream version: 1.2.4 Upstream commit: e7d3c9f8921ca682fd44ee1b117c6e59fee3ac8e

frankslin and others added 30 commits January 28, 2026 07:09

Bump opencc-wasm version to 0.2.0

86ec8f6

Copy both wasm binaries for opencc-wasm 0.2.1

4021fea

Delete the now unused wasm-lib/scripts/gen_testcases_json.py

d0d5e44

Fix(wasm): sync dictionary updates and add regression test for tw2sp …

a37ce36

…'方程式' See [ByVoid Issue BYVoid#714](BYVoid#714).

Add tests per code review comment

a76c802

Add missing data

c200b70

Ensure new tests are run in wasm-lib

55e6f86

chore: prepare 0.4.0 publish

e0146ef

chore: release wasm-lib 0.4.1

c42b799

Add prepack script to wasm-lib package.json

386d8cd

Ensures that the build is always run before publishing to npm, preventing the publication of stale build artifacts.

Propose a dictionary text file comment syntax and sorting rules

5ce3f7e

bazel: bump version and refresh module lockfile

ba66e19

Refresh wasm-lib assets before build in publish workflow (#21)

96d6348

* Refresh wasm-lib assets before build * Install Bazel before refreshing wasm assets

Update github workflow for npm publish; fix Emscripten setup.

f0eaecb

update package.json add --provenance to wasm-lib-publish.yml

Update wasm library documentation

e54a86b

Specify node version 24 in wasm-lib-publish.yml

b06c108

Bump opencc-wasm version to 0.4.2

c3a7e33

Add a vscode setting to insert a newline at the end of files.

b434aea

同步上游字典更新至 17d6682 版本

757a790

Prepare opencc-wasm v0.5.0 release

5168bbd

frankslin and others added 14 commits January 28, 2026 07:09

Enable jieba segmentation in WASM build and include assets

d32fe61

chore: update 0.6.0 changelog

55f3f53

Add Jieba benchmarks

274e4cc

更新 https://github.com/TerryTian-tech/OpenCC-Traditional-Chinese-chara…

683fb4e

…cters-according-to-Chinese-government-standards 最新词库和模式，增加 s2t_cngov.json 和 t2s_cngov.json

correct filetype in data/config/s2t_cngov.json

4e59e4a

Update wasm-lib to include latest changes from BYVoid/OpenCC and Terr…

5c51976

…yTian-tech/OpenCC-Traditional-Chinese-characters-according-to-Chinese-government-standards

correct filetype in data/config/s2t_cngov.json

74f4408

increment opencc-wasm version to 0.6.1

0845b0f

Add https://github.com/TerryTian-tech/OpenCC-Traditional-Chinese-char…

574e924

…acters-according-to-Chinese-government-standards as a submodule under deps/cngov

Add cngov sync update script

c83dce5

Update cngov dictionaries from upstream v1.2.4

73b58dc

Source: TerryTian-tech/OpenCC-Traditional-Chinese-characters-according-to-Chinese-government-standards Upstream version: 1.2.4 Upstream commit: e7d3c9f8921ca682fd44ee1b117c6e59fee3ac8e

Bump opencc-wasm version to 0.6.2

533e817

doc: add jieba plugin architecture and rollout plan

7a4ac25

frankslin added the codex label Mar 7, 2026 — with ChatGPT Codex Connector

doc: add Windows and WinGet plan for jieba plugin

6e844ac

frankslin force-pushed the master branch from 533e817 to 8c83f64 Compare March 9, 2026 03:53

frankslin closed this Mar 9, 2026

frankslin deleted the codex/-opencc branch March 9, 2026 03:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc: 提案 — Jieba 外掛化動態載入架構與執行方案#27

doc: 提案 — Jieba 外掛化動態載入架構與執行方案#27
frankslin wants to merge 45 commits intomasterfrom
codex/-opencc

frankslin commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

frankslin commented Mar 7, 2026

Motivation

Description

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants