Skip to content

doc: 提案 — Jieba 外掛化動態載入架構與執行方案#27

Closed
frankslin wants to merge 45 commits intomasterfrom
codex/-opencc
Closed

doc: 提案 — Jieba 外掛化動態載入架構與執行方案#27
frankslin wants to merge 45 commits intomasterfrom
codex/-opencc

Conversation

@frankslin
Copy link
Owner

Motivation

  • 目前 Jieba 是在編譯期透過 ENABLE_JIEBA 選項靜態納入核心,造成核心與可選分詞能力耦合與封包困難。
  • 目標是將 Jieba 外掛化,讓 libopencc 在遇到 "segmentation": { "type": "jieba" } 時以動態載入外掛(例如 libopencc-jieba.so)的方式提供分詞能力,方便發行版拆包成 openccopencc-jieba

Description

  • 新增 doc/JIEBA_PLUGIN_ARCHITECTURE_PLAN.md,提供完整外掛化設計與執行方案,包括架構圖、C ABI 函式表草案(src/plugin/OpenCCPlugin.h 示意)、PluginManagerPluginSegmentationAdapter 設計、錯誤語意與記憶體釋放策略、外掛搜尋順序與安全控管、以及 CMake/Bazel 與打包建議。
  • doc/JIEBA_USAGE.md 開頭加入交叉參考連結,提示使用者參考新的外掛化設計文件以利遷移與打包。
  • 文件中定義了外掛 ABI 版本化、檔名與平台對應、環境變數 OPENCC_SEGMENTATION_PLUGIN_PATH、以及可選的 OPENCC_DISABLE_PLUGINS 安全開關,並提出漸進式遷移與相容性策略(保留舊的 ENABLE_JIEBA 支援以便過渡)。

Testing

  • 此 PR 僅為文件新增與文字修改,未變更執行碼或 API,因此未執行程式碼層級的自動化測試。

Codex Task

frankslin and others added 30 commits January 28, 2026 07:09
* Add WASM demo scaffold and project notes
* Add OpenCC WASM demo with converter UI and test runner
  - 补充 WASM 编译结果在前端 JS 中的用法
* Polish WASM demo UI and paths, run tests, and streamline converter export
* Add wasm-based OpenCC package and update demo to consume it
* Add wasm-based OpenCC package, static demo bundle, and benchmarking page
* Add copyright notice and LICENSE
…eparation

This commit enhances the opencc-wasm library with TypeScript support and
implements a cleaner build architecture with semantic separation between
intermediate build artifacts and publishable distribution.

TypeScript Support:
- Add comprehensive type definitions (index.d.ts) with full JSDoc documentation
- Define interfaces: ConverterOptions, ConverterFunction, OpenCCNamespace, etc.
- Provide complete type safety for better IDE support and developer experience

Build Architecture Redesign (semantic separation):
- build/ - Intermediate WASM artifacts (gitignored, for tests/development)
  * build/opencc-wasm.esm.js - ESM WASM glue
  * build/opencc-wasm.cjs - CJS WASM glue
  * build/opencc-wasm.wasm - WASM binary
- dist/ - Publishable distribution (committed, for npm)
  * dist/esm/ - ESM package entry
  * dist/cjs/ - CJS package entry
  * dist/data/ - OpenCC config and dictionary files

Invariants and Semantics:
- Tests import source (index.js) → loads from build/
- Published package exports dist/ only
- build/ = internal intermediate artifacts
- dist/ = publishable artifacts
- Clear separation ensures tests validate actual build output

Enhanced .gitignore:
- Add build/ to gitignore (intermediate artifacts)
- Add node_modules/, logs, OS-specific files (.DS_Store, Thumbs.db)
- Exclude editor configurations (.vscode/, .idea/)
- Add cache and temporary file exclusions

Two-Stage Build Process:
Stage 1 (build.sh):
  - Compiles C++ to WASM using Emscripten
  - Outputs to build/ directory

Stage 2 (build-api.js):
  - Copies WASM artifacts from build/ to dist/
  - Transforms source paths for production
  - Generates API wrappers for ESM and CJS
  - Copies data files

Package Configuration (package.json):
- Add "types" field pointing to index.d.ts
- Update "main" and "module" to point to API wrappers in dist/
- Add comprehensive "exports" map:
  * "." - Main API (ESM/CJS wrappers)
  * "./wasm" - Direct access to WASM glue for advanced users
  * "./dist/*" - Wildcard for flexible file access
- Include LICENSE and NOTICE in published files

Documentation:
- Add comprehensive README section explaining build architecture
- Document project structure with invariants
- Explain semantic separation between build/ and dist/

Benefits:
- Better TypeScript integration and IDE autocomplete
- Cleaner, more maintainable directory structure
- Tests validate actual build output, not stale dist files
- Clear semantic separation between internal and publishable artifacts
- Professional project setup following modern npm best practices
- Long-term maintainability through clear invariants
…cases.json (#10)

- add refresh_assets.sh to rebuild/copy only config-referenced .ocd2 files and testcases.json
- convert wasm-lib tests to consume the new `{cases:[...]}` JSON format
- update bundled .ocd2 dictionaries and testcases.json fixtures

----

* wasm-lib: refresh assets script and switch tests to consolidated testcases.json
  - add refresh_assets.sh to rebuild/copy only config-referenced .ocd2 files and testcases.json
  - convert wasm-lib tests to consume the new `{cases:[...]}` JSON format
  - update bundled .ocd2 dictionaries and testcases.json fixtures
* Rebuild the wasm-lib and update the documentations
1. 新增演算法與理論局限性分析文件
   - 詳細說明最大正向匹配分詞演算法
   - 分析轉換鏈機制與詞典系統
   - 探討理論局限性(一對多歧義、缺乏上下文理解、維護負擔)
   - 與現代方法(統計模型、神經網路)的比較

2. 更新 AGENTS.md
   - 新增「延伸閱讀」章節
   - 連結到技術文件和貢獻指南

3. 新增 Claude Code 配置
   - .claude/hooks/session_start.sh - 會話啟動時顯示專案資訊
   - .claude/skills/opencc-dict-edit.md - 詞典編輯技能
   - .claude/skills/opencc-algorithm-explain.md - 演算法解釋技能

這些配置幫助 AI 代理更好地理解 OpenCC 專案架構與開發流程。
🚨 BREAKING CHANGE: New distribution layout

The .wasm files have been moved to be co-located with their corresponding
glue code files, fixing loading issues and enabling proper CDN usage.

New layout:
  dist/
    esm/
      opencc-wasm.js
      opencc-wasm.wasm      ← Now here (same directory)
    cjs/
      opencc-wasm.cjs
      opencc-wasm.wasm      ← Now here (same directory)
    opencc-wasm.wasm        ← Kept for legacy compatibility

Features:
- ✅ CDN support: Can now import directly from jsDelivr/unpkg
- ✅ Fixed WASM loading in various bundlers and environments
- ✅ Comprehensive test suite with CDN usage tests
- ✅ Complete documentation (CDN_USAGE.md, TESTING.md, CHANGELOG.md)

Test suite:
- npm test         → Run all tests (core + CDN)
- npm run test:core → Run 56 core functionality tests
- npm run test:cdn  → Run CDN usage tests

All 56 core tests + CDN tests pass successfully.

Usage example:
```js
import OpenCC from "https://cdn.jsdelivr.net/npm/opencc-wasm@0.3.0/dist/esm/index.js";
const converter = OpenCC.Converter({ from: "cn", to: "t" });
const result = await converter("简体中文");
```

Co-authored-by: Claude <claude@anthropic.com>
- 在頭部新增「專案說明」章節,說明本項目為 BYVoid/OpenCC 的 fork
- 闡述兩個主要目的:WASM 實現與詞表擴充
- 新增「背景」小節,說明現有第三方實作的維護狀況與本專案定位
- 原有 README 內容完整保留在分隔線下方作為參考
This commit adds significant improvements to opencc-wasm:

**API Enhancements:**
- Add `config` parameter to Converter() as intuitive alternative to `from`/`to`
- Support direct OpenCC config file names (e.g., `{ config: "s2twp" }`)
- Expand CONFIG_MAP to support all conversion types and aliases
- Maintain backward compatibility with `from`/`to` parameters

**Documentation Improvements:**
- Consolidate all API documentation into comprehensive README.md
- Add Traditional Chinese README (README.zh-TW.md) with Taiwan localization
- Emphasize "zero configuration" and "3-line start" features
- Include practical examples for React, Vue, Node.js, and Web Workers
- Add best practices and FAQ sections
- Create interactive demo (test/demo-out-of-box.html)

**User Experience:**
- Clarify auto-loading of configs and dictionaries from CDN
- Show both API methods side-by-side for user choice
- Provide TypeScript usage examples

All 56 core tests + new config parameter tests passing.
添加基於《通用規範漢字表》(2013) 的繁簡轉換模式,支持將各種繁體標準
轉換為中國政府規範繁體字。

1. **t2cngov.json** - 繁體到政府標準(全轉換)
   - 繁體異體標準化:溼 → 濕
   - 簡體轉標準繁體:湿 → 濕
   - 部分繁簡轉換:淨 → 净

2. **t2cngov_keep_simp.json** - 繁體到政府標準(保留簡體)
   - 保留原文中有意使用的簡體字
   - 僅轉換繁體異體字

第三方字典來源:
- 作者:TerryTian-tech
- 許可證:Apache License 2.0
- 參考標準:《通用規範漢字表》(2013)

字典文件:
- TGCharacters.txt (37KB → 45KB ocd2) - 約 4000 個字符映射
- TGCharacters_keep_simp.txt (13KB → 21KB ocd2) - 保留簡體變體
- TGPhrases.txt (1.1MB → 911KB ocd2) - 約 7000 個詞組映射

- data/CMakeLists.txt: 構建 cngov 字典(扁平命名,分層安裝)
- test/CMakeLists.txt: 整合測試用例

- data/dictionary/cngov/BUILD.bazel: cngov 字典構建規則
- data/config/BUILD.bazel: 新增 cngov_validation_test
- test/testcases/BUILD.bazel: 新增 cngov_testcases filegroup

- test/CommandLineConvertTest.cpp: 新增 ConvertCNGovFromJson 測試函數
- test/testcases/cngov_testcases.json: 5 個專屬測試用例

- data/config/CNGovValidationTest.cpp: 獨立的 Bazel 測試
- 測試命令:
  * bazel test //data/config:cngov_validation_test
  * bazel test //data/...

- wasm-lib/data/dict/cngov/*.ocd2: 編譯後的字典
- wasm-lib/test/cngov_testcases.json: 測試用例
- wasm-lib/test/cngov.test.js: Node.js 測試代碼
- wasm-lib/scripts/refresh_assets.sh: 更新以支持子目錄和 cngov

- README.md: 新增 CN Government Standard Mode 使用說明
- wasm-lib/README.md & README.zh.md: 配置表新增 t2cngov 條目
- data/dictionary/cngov/README.txt: 字典來源和版權聲明

```bash
echo "盫" | opencc -c t2cngov.json              # → 盦
echo "简体混杂繁體" | opencc -c t2cngov.json    # → 簡體混雜繁體
echo "潮溼的露臺" | opencc -c t2cngov.json      # → 潮濕的露臺
echo "一乾二淨" | opencc -c t2cngov.json        # → 一乾二净
```

- 子目錄隔離:第三方字典放在 data/dictionary/cngov/
- 獨立測試:避免與上游 testcases.json 合併衝突
- 雙構建系統:同時支持 CMake 和 Bazel
- 完整元數據:JSON 配置包含作者、許可證、貢獻者信息
- 字典壓縮:ocd2 格式體積減少 70-80%

基於 TerryTian-tech 的研究成果,整合時遵循 Apache License 2.0。
貢獻者:TerryTian-tech, Yi Jianpeng, Hu Xinmei, Duan Yatong
Ensures that the build is always run before publishing to npm,
preventing the publication of stale build artifacts.
This commit adds detailed Chinese-language documentation analyzing
the two critical security vulnerabilities fixed in the previous commit.

## Documentation Files

### 1. doc/ISSUE_997_ANALYSIS.md

Comprehensive analysis of the MaxMatchSegmentation buffer overflow
(GitHub Issue BYVoid#997):

- Problem description and crash location
- Root cause analysis with step-by-step execution trace
- Detailed explanation of integer underflow mechanism
- Comparison: why normal text doesn't trigger vs. malicious input
- Solution design and correctness proof
- Test case documentation
- Security impact assessment (CVSS ~7.5)
- Best practices and lessons learned
- Prevention strategies for similar issues

Key sections:
- Actual demonstration of the bug with hex output
- Multi-layer defense architecture explanation
- Reference to related CVE/CWE entries

### 2. doc/CONVERSION_INFORMATION_DISCLOSURE.md

In-depth security analysis of the Conversion.cpp information
disclosure vulnerability (more severe than BYVoid#997):

- Complete vulnerability description
- Attack scenario with memory layout diagrams
- Step-by-step exploit demonstration showing heap data leakage
- Direct comparison with Issue BYVoid#997 (why this is worse)
- Exploitability analysis with test results
- Information that could be leaked (keys, passwords, etc.)
- Security impact: CWE-125, CWE-200, CVSS ~8.6
- Detailed fix explanation with multi-layer defense
- Why normal usage was not affected
- CVE recommendation and scoring rationale

Key highlights:
- Demonstrates actual heap memory leakage (0xAA bytes, "ABC" strings)
- Shows that leaked data IS OUTPUT to conversion result
- Explains ASLR bypass potential
- Documents test cases that would fail with old code
- Provides defensive programming recommendations

## Documentation Quality

Both documents include:
- Complete technical analysis in Chinese
- Code snippets with annotations
- Before/after comparisons
- Security risk assessments
- Prevention recommendations
- References to standards (CWE, CVSS, OWASP)

These documents serve as:
- Security disclosure materials
- Educational resources for similar vulnerability patterns
- Reference for CVE submission
- Internal security audit documentation

Total additions: ~860 lines of detailed security analysis
* Refresh wasm-lib assets before build

* Install Bazel before refreshing wasm assets
update package.json

add --provenance to wasm-lib-publish.yml
…tion

This document explores integrating Jieba word segmentation algorithm alongside
the existing mmseg (maximum match segmentation) in OpenCC through experimental
configuration support.

Key findings:
- Analyzed two implementation approaches: cppjieba (C++ native) and Python
  embedding via pybind11
- Strongly recommends cppjieba integration for performance, deployment
  simplicity, and maintenance
- Designed extensible architecture using existing Segmentation interface
- Proposed experimental config format to enable jieba without affecting
  current functionality
- Outlined 4-phase implementation roadmap with risk mitigation strategies

The analysis includes technical details on:
- OpenCC's current segmentation architecture (Segmentation.hpp, Config.cpp)
- Jieba's algorithm principles (Trie, DAG, HMM with Viterbi)
- Detailed code examples for JiebaSegmentation class
- CMake integration approach with ENABLE_JIEBA option
- Comprehensive comparison matrix and implementation timeline
frankslin and others added 14 commits January 28, 2026 07:09
* Check in a complete copy of libcppjieba from https://github.com/yanyiwu/cppjieba (MIT License)

----
* Check in a copy of Jieba dictionary in  data/jieba_dict/ for OpenCC:

* jieba.dict.utf8 (4.9MB): Main dictionary with word frequencies
* hmm_model.utf8 (508KB): HMM model for unknown word recognition
* user.dict.utf8: User-defined custom dictionary
* README.md: Dictionary documentation and customization guide

----
* Implement (experimental) Jieba segmentation support

----
* Add comprehensive test suite for Jieba segmentation

Added unit tests and comparison tests following OpenCC testing patterns.

1. Basic Unit Tests (src/JiebaSegmentationTest.cpp):
   - BasicSegmentation: Validates basic Chinese word segmentation
   - ComplexPhrase: Tests multi-word phrases and proper nouns
   - EmptyString, SingleCharacter: Edge case handling
   - EnglishAndChinese: Mixed language support
   - UnknownWords: HMM model's ability to recognize unknown words

2. JSON-Driven Comparison Tests (src/JiebaComparisonTest.cpp):
   - Follows t2cngov test pattern with external JSON test cases
   - Loads test definitions from test/testcases/jieba_comparison_testcases.json
   - Compares mmseg vs Jieba segmentation outputs
   - Displays: Input, Jieba segments, Expected segments, Conversion outputs
   - Converter caching for performance optimization

3. Test Cases Definition (test/testcases/jieba_comparison_testcases.json):
   - 15 comprehensive test cases covering:
     * Simplified to Traditional (10 cases): s2twp vs s2twp_jieba
     * Traditional to Simplified (5 cases): tw2sp vs tw2sp_jieba

   Key test scenarios:
   - jieba_s2t_001: 着名 ambiguity (wearing+name vs famous)
     "生活着名为正敏的少女" -> Expected: "生活/着/名为/正敏/的/少女"

   - jieba_s2t_002: Compound words (中学生, 中等身材)
     "一个中学生,一个中等身材的人"

   - jieba_t2s_001: Traditional 著名/為 conversion
     "生活著名為正敏的少女" -> Expected: "生活/著名/為/正敏/的/少女"

   - Other cases: Proper nouns, modern terms, mixed content,
     ambiguous structures, Taiwan-specific vocabulary,
     long compounds, classical Chinese

4. Focused Individual Tests:
   - AmbiguousCase_ZhaoMing: Detailed output for "着名" ambiguity
   - TraditionalToSimplified_ZhuMing: Detailed output for "著名" conversion

Output Format:
  === Test: jieba_s2t_001 ===
  Input:          生活着名为正敏的少女
  Jieba segments: 生活/着/名为/正敏/的/少女
  Expected segs:  生活/着/名为/正敏/的/少女
  s2twp:          生活著名爲正敏的少女
  s2twp_jieba:    生活著名為正敏的少女

Benefits:
- Visual comparison of segmentation algorithms
- Easy to add new test cases (just edit JSON)
- Documents expected behavior for ambiguous cases
- Validates that Jieba improves segmentation accuracy
- Test data can be reviewed independently from code

Build System Integration:
- Tests added to CMake UNITTESTS when ENABLE_JIEBA=ON
- Automatically run with 'make test' or 'ctest'

reorder

* Fix Jieba tests in Bazel and add more examples.

* Add comprehensive Jieba segmentation documentation

Added two detailed documentation files:

1. doc/JIEBA_SEGMENTATION_FEASIBILITY.md (559 lines)
   - Comprehensive feasibility analysis for integrating Jieba segmentation
   - Compares two implementation approaches:
     * cppjieba (C++ native) - RECOMMENDED
     * Python embedding via pybind11 - Not recommended
   - Technical analysis of Jieba's algorithm (Trie, DAG, HMM)
   - Detailed implementation plan with code examples
   - Performance, deployment, and maintenance comparison matrix
   - 4-phase implementation roadmap
   - Risk assessment and mitigation strategies

2. doc/JIEBA_USAGE.md
   - Complete user guide for Jieba segmentation feature
   - Compilation instructions with CMake
   - Configuration file format and examples
   - C++/CLI/Python API usage examples
   - Custom user dictionary guide
   - Performance considerations and benchmarks
   - mmseg vs Jieba comparison table
   - Troubleshooting guide
   - Limitations and best practices

Key recommendations:
- Use cppjieba for production (performance, zero dependencies)
- Enable via -DENABLE_JIEBA=ON compile flag
- Experimental feature, opt-in only

----
* Fix C++ compiler compatibility
* Fix //python/tests:test_opencc

---------

Co-authored-by: Claude <noreply@anthropic.com>
…yTian-tech/OpenCC-Traditional-Chinese-characters-according-to-Chinese-government-standards
Source: TerryTian-tech/OpenCC-Traditional-Chinese-characters-according-to-Chinese-government-standards

Upstream version: 1.2.4

Upstream commit: e7d3c9f8921ca682fd44ee1b117c6e59fee3ac8e
@frankslin frankslin closed this Mar 9, 2026
@frankslin frankslin deleted the codex/-opencc branch March 9, 2026 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants