Conversation
* Add WASM demo scaffold and project notes * Add OpenCC WASM demo with converter UI and test runner - 补充 WASM 编译结果在前端 JS 中的用法 * Polish WASM demo UI and paths, run tests, and streamline converter export * Add wasm-based OpenCC package and update demo to consume it * Add wasm-based OpenCC package, static demo bundle, and benchmarking page * Add copyright notice and LICENSE
…eparation This commit enhances the opencc-wasm library with TypeScript support and implements a cleaner build architecture with semantic separation between intermediate build artifacts and publishable distribution. TypeScript Support: - Add comprehensive type definitions (index.d.ts) with full JSDoc documentation - Define interfaces: ConverterOptions, ConverterFunction, OpenCCNamespace, etc. - Provide complete type safety for better IDE support and developer experience Build Architecture Redesign (semantic separation): - build/ - Intermediate WASM artifacts (gitignored, for tests/development) * build/opencc-wasm.esm.js - ESM WASM glue * build/opencc-wasm.cjs - CJS WASM glue * build/opencc-wasm.wasm - WASM binary - dist/ - Publishable distribution (committed, for npm) * dist/esm/ - ESM package entry * dist/cjs/ - CJS package entry * dist/data/ - OpenCC config and dictionary files Invariants and Semantics: - Tests import source (index.js) → loads from build/ - Published package exports dist/ only - build/ = internal intermediate artifacts - dist/ = publishable artifacts - Clear separation ensures tests validate actual build output Enhanced .gitignore: - Add build/ to gitignore (intermediate artifacts) - Add node_modules/, logs, OS-specific files (.DS_Store, Thumbs.db) - Exclude editor configurations (.vscode/, .idea/) - Add cache and temporary file exclusions Two-Stage Build Process: Stage 1 (build.sh): - Compiles C++ to WASM using Emscripten - Outputs to build/ directory Stage 2 (build-api.js): - Copies WASM artifacts from build/ to dist/ - Transforms source paths for production - Generates API wrappers for ESM and CJS - Copies data files Package Configuration (package.json): - Add "types" field pointing to index.d.ts - Update "main" and "module" to point to API wrappers in dist/ - Add comprehensive "exports" map: * "." - Main API (ESM/CJS wrappers) * "./wasm" - Direct access to WASM glue for advanced users * "./dist/*" - Wildcard for flexible file access - Include LICENSE and NOTICE in published files Documentation: - Add comprehensive README section explaining build architecture - Document project structure with invariants - Explain semantic separation between build/ and dist/ Benefits: - Better TypeScript integration and IDE autocomplete - Cleaner, more maintainable directory structure - Tests validate actual build output, not stale dist files - Clear semantic separation between internal and publishable artifacts - Professional project setup following modern npm best practices - Long-term maintainability through clear invariants
…cases.json (#10) - add refresh_assets.sh to rebuild/copy only config-referenced .ocd2 files and testcases.json - convert wasm-lib tests to consume the new `{cases:[...]}` JSON format - update bundled .ocd2 dictionaries and testcases.json fixtures ---- * wasm-lib: refresh assets script and switch tests to consolidated testcases.json - add refresh_assets.sh to rebuild/copy only config-referenced .ocd2 files and testcases.json - convert wasm-lib tests to consume the new `{cases:[...]}` JSON format - update bundled .ocd2 dictionaries and testcases.json fixtures * Rebuild the wasm-lib and update the documentations
1. 新增演算法與理論局限性分析文件 - 詳細說明最大正向匹配分詞演算法 - 分析轉換鏈機制與詞典系統 - 探討理論局限性(一對多歧義、缺乏上下文理解、維護負擔) - 與現代方法(統計模型、神經網路)的比較 2. 更新 AGENTS.md - 新增「延伸閱讀」章節 - 連結到技術文件和貢獻指南 3. 新增 Claude Code 配置 - .claude/hooks/session_start.sh - 會話啟動時顯示專案資訊 - .claude/skills/opencc-dict-edit.md - 詞典編輯技能 - .claude/skills/opencc-algorithm-explain.md - 演算法解釋技能 這些配置幫助 AI 代理更好地理解 OpenCC 專案架構與開發流程。
🚨 BREAKING CHANGE: New distribution layout
The .wasm files have been moved to be co-located with their corresponding
glue code files, fixing loading issues and enabling proper CDN usage.
New layout:
dist/
esm/
opencc-wasm.js
opencc-wasm.wasm ← Now here (same directory)
cjs/
opencc-wasm.cjs
opencc-wasm.wasm ← Now here (same directory)
opencc-wasm.wasm ← Kept for legacy compatibility
Features:
- ✅ CDN support: Can now import directly from jsDelivr/unpkg
- ✅ Fixed WASM loading in various bundlers and environments
- ✅ Comprehensive test suite with CDN usage tests
- ✅ Complete documentation (CDN_USAGE.md, TESTING.md, CHANGELOG.md)
Test suite:
- npm test → Run all tests (core + CDN)
- npm run test:core → Run 56 core functionality tests
- npm run test:cdn → Run CDN usage tests
All 56 core tests + CDN tests pass successfully.
Usage example:
```js
import OpenCC from "https://cdn.jsdelivr.net/npm/opencc-wasm@0.3.0/dist/esm/index.js";
const converter = OpenCC.Converter({ from: "cn", to: "t" });
const result = await converter("简体中文");
```
Co-authored-by: Claude <claude@anthropic.com>
- 在頭部新增「專案說明」章節,說明本項目為 BYVoid/OpenCC 的 fork - 闡述兩個主要目的:WASM 實現與詞表擴充 - 新增「背景」小節,說明現有第三方實作的維護狀況與本專案定位 - 原有 README 內容完整保留在分隔線下方作為參考
This commit adds significant improvements to opencc-wasm:
**API Enhancements:**
- Add `config` parameter to Converter() as intuitive alternative to `from`/`to`
- Support direct OpenCC config file names (e.g., `{ config: "s2twp" }`)
- Expand CONFIG_MAP to support all conversion types and aliases
- Maintain backward compatibility with `from`/`to` parameters
**Documentation Improvements:**
- Consolidate all API documentation into comprehensive README.md
- Add Traditional Chinese README (README.zh-TW.md) with Taiwan localization
- Emphasize "zero configuration" and "3-line start" features
- Include practical examples for React, Vue, Node.js, and Web Workers
- Add best practices and FAQ sections
- Create interactive demo (test/demo-out-of-box.html)
**User Experience:**
- Clarify auto-loading of configs and dictionaries from CDN
- Show both API methods side-by-side for user choice
- Provide TypeScript usage examples
All 56 core tests + new config parameter tests passing.
…'方程式' See [ByVoid Issue BYVoid#714](BYVoid#714).
添加基於《通用規範漢字表》(2013) 的繁簡轉換模式,支持將各種繁體標準 轉換為中國政府規範繁體字。 1. **t2cngov.json** - 繁體到政府標準(全轉換) - 繁體異體標準化:溼 → 濕 - 簡體轉標準繁體:湿 → 濕 - 部分繁簡轉換:淨 → 净 2. **t2cngov_keep_simp.json** - 繁體到政府標準(保留簡體) - 保留原文中有意使用的簡體字 - 僅轉換繁體異體字 第三方字典來源: - 作者:TerryTian-tech - 許可證:Apache License 2.0 - 參考標準:《通用規範漢字表》(2013) 字典文件: - TGCharacters.txt (37KB → 45KB ocd2) - 約 4000 個字符映射 - TGCharacters_keep_simp.txt (13KB → 21KB ocd2) - 保留簡體變體 - TGPhrases.txt (1.1MB → 911KB ocd2) - 約 7000 個詞組映射 - data/CMakeLists.txt: 構建 cngov 字典(扁平命名,分層安裝) - test/CMakeLists.txt: 整合測試用例 - data/dictionary/cngov/BUILD.bazel: cngov 字典構建規則 - data/config/BUILD.bazel: 新增 cngov_validation_test - test/testcases/BUILD.bazel: 新增 cngov_testcases filegroup - test/CommandLineConvertTest.cpp: 新增 ConvertCNGovFromJson 測試函數 - test/testcases/cngov_testcases.json: 5 個專屬測試用例 - data/config/CNGovValidationTest.cpp: 獨立的 Bazel 測試 - 測試命令: * bazel test //data/config:cngov_validation_test * bazel test //data/... - wasm-lib/data/dict/cngov/*.ocd2: 編譯後的字典 - wasm-lib/test/cngov_testcases.json: 測試用例 - wasm-lib/test/cngov.test.js: Node.js 測試代碼 - wasm-lib/scripts/refresh_assets.sh: 更新以支持子目錄和 cngov - README.md: 新增 CN Government Standard Mode 使用說明 - wasm-lib/README.md & README.zh.md: 配置表新增 t2cngov 條目 - data/dictionary/cngov/README.txt: 字典來源和版權聲明 ```bash echo "盫" | opencc -c t2cngov.json # → 盦 echo "简体混杂繁體" | opencc -c t2cngov.json # → 簡體混雜繁體 echo "潮溼的露臺" | opencc -c t2cngov.json # → 潮濕的露臺 echo "一乾二淨" | opencc -c t2cngov.json # → 一乾二净 ``` - 子目錄隔離:第三方字典放在 data/dictionary/cngov/ - 獨立測試:避免與上游 testcases.json 合併衝突 - 雙構建系統:同時支持 CMake 和 Bazel - 完整元數據:JSON 配置包含作者、許可證、貢獻者信息 - 字典壓縮:ocd2 格式體積減少 70-80% 基於 TerryTian-tech 的研究成果,整合時遵循 Apache License 2.0。 貢獻者:TerryTian-tech, Yi Jianpeng, Hu Xinmei, Duan Yatong
Ensures that the build is always run before publishing to npm, preventing the publication of stale build artifacts.
This commit adds detailed Chinese-language documentation analyzing the two critical security vulnerabilities fixed in the previous commit. ## Documentation Files ### 1. doc/ISSUE_997_ANALYSIS.md Comprehensive analysis of the MaxMatchSegmentation buffer overflow (GitHub Issue BYVoid#997): - Problem description and crash location - Root cause analysis with step-by-step execution trace - Detailed explanation of integer underflow mechanism - Comparison: why normal text doesn't trigger vs. malicious input - Solution design and correctness proof - Test case documentation - Security impact assessment (CVSS ~7.5) - Best practices and lessons learned - Prevention strategies for similar issues Key sections: - Actual demonstration of the bug with hex output - Multi-layer defense architecture explanation - Reference to related CVE/CWE entries ### 2. doc/CONVERSION_INFORMATION_DISCLOSURE.md In-depth security analysis of the Conversion.cpp information disclosure vulnerability (more severe than BYVoid#997): - Complete vulnerability description - Attack scenario with memory layout diagrams - Step-by-step exploit demonstration showing heap data leakage - Direct comparison with Issue BYVoid#997 (why this is worse) - Exploitability analysis with test results - Information that could be leaked (keys, passwords, etc.) - Security impact: CWE-125, CWE-200, CVSS ~8.6 - Detailed fix explanation with multi-layer defense - Why normal usage was not affected - CVE recommendation and scoring rationale Key highlights: - Demonstrates actual heap memory leakage (0xAA bytes, "ABC" strings) - Shows that leaked data IS OUTPUT to conversion result - Explains ASLR bypass potential - Documents test cases that would fail with old code - Provides defensive programming recommendations ## Documentation Quality Both documents include: - Complete technical analysis in Chinese - Code snippets with annotations - Before/after comparisons - Security risk assessments - Prevention recommendations - References to standards (CWE, CVSS, OWASP) These documents serve as: - Security disclosure materials - Educational resources for similar vulnerability patterns - Reference for CVE submission - Internal security audit documentation Total additions: ~860 lines of detailed security analysis
* Refresh wasm-lib assets before build * Install Bazel before refreshing wasm assets
update package.json add --provenance to wasm-lib-publish.yml
…tion This document explores integrating Jieba word segmentation algorithm alongside the existing mmseg (maximum match segmentation) in OpenCC through experimental configuration support. Key findings: - Analyzed two implementation approaches: cppjieba (C++ native) and Python embedding via pybind11 - Strongly recommends cppjieba integration for performance, deployment simplicity, and maintenance - Designed extensible architecture using existing Segmentation interface - Proposed experimental config format to enable jieba without affecting current functionality - Outlined 4-phase implementation roadmap with risk mitigation strategies The analysis includes technical details on: - OpenCC's current segmentation architecture (Segmentation.hpp, Config.cpp) - Jieba's algorithm principles (Trie, DAG, HMM with Viterbi) - Detailed code examples for JiebaSegmentation class - CMake integration approach with ENABLE_JIEBA option - Comprehensive comparison matrix and implementation timeline
* Check in a complete copy of libcppjieba from https://github.com/yanyiwu/cppjieba (MIT License) ---- * Check in a copy of Jieba dictionary in data/jieba_dict/ for OpenCC: * jieba.dict.utf8 (4.9MB): Main dictionary with word frequencies * hmm_model.utf8 (508KB): HMM model for unknown word recognition * user.dict.utf8: User-defined custom dictionary * README.md: Dictionary documentation and customization guide ---- * Implement (experimental) Jieba segmentation support ---- * Add comprehensive test suite for Jieba segmentation Added unit tests and comparison tests following OpenCC testing patterns. 1. Basic Unit Tests (src/JiebaSegmentationTest.cpp): - BasicSegmentation: Validates basic Chinese word segmentation - ComplexPhrase: Tests multi-word phrases and proper nouns - EmptyString, SingleCharacter: Edge case handling - EnglishAndChinese: Mixed language support - UnknownWords: HMM model's ability to recognize unknown words 2. JSON-Driven Comparison Tests (src/JiebaComparisonTest.cpp): - Follows t2cngov test pattern with external JSON test cases - Loads test definitions from test/testcases/jieba_comparison_testcases.json - Compares mmseg vs Jieba segmentation outputs - Displays: Input, Jieba segments, Expected segments, Conversion outputs - Converter caching for performance optimization 3. Test Cases Definition (test/testcases/jieba_comparison_testcases.json): - 15 comprehensive test cases covering: * Simplified to Traditional (10 cases): s2twp vs s2twp_jieba * Traditional to Simplified (5 cases): tw2sp vs tw2sp_jieba Key test scenarios: - jieba_s2t_001: 着名 ambiguity (wearing+name vs famous) "生活着名为正敏的少女" -> Expected: "生活/着/名为/正敏/的/少女" - jieba_s2t_002: Compound words (中学生, 中等身材) "一个中学生,一个中等身材的人" - jieba_t2s_001: Traditional 著名/為 conversion "生活著名為正敏的少女" -> Expected: "生活/著名/為/正敏/的/少女" - Other cases: Proper nouns, modern terms, mixed content, ambiguous structures, Taiwan-specific vocabulary, long compounds, classical Chinese 4. Focused Individual Tests: - AmbiguousCase_ZhaoMing: Detailed output for "着名" ambiguity - TraditionalToSimplified_ZhuMing: Detailed output for "著名" conversion Output Format: === Test: jieba_s2t_001 === Input: 生活着名为正敏的少女 Jieba segments: 生活/着/名为/正敏/的/少女 Expected segs: 生活/着/名为/正敏/的/少女 s2twp: 生活著名爲正敏的少女 s2twp_jieba: 生活著名為正敏的少女 Benefits: - Visual comparison of segmentation algorithms - Easy to add new test cases (just edit JSON) - Documents expected behavior for ambiguous cases - Validates that Jieba improves segmentation accuracy - Test data can be reviewed independently from code Build System Integration: - Tests added to CMake UNITTESTS when ENABLE_JIEBA=ON - Automatically run with 'make test' or 'ctest' reorder * Fix Jieba tests in Bazel and add more examples. * Add comprehensive Jieba segmentation documentation Added two detailed documentation files: 1. doc/JIEBA_SEGMENTATION_FEASIBILITY.md (559 lines) - Comprehensive feasibility analysis for integrating Jieba segmentation - Compares two implementation approaches: * cppjieba (C++ native) - RECOMMENDED * Python embedding via pybind11 - Not recommended - Technical analysis of Jieba's algorithm (Trie, DAG, HMM) - Detailed implementation plan with code examples - Performance, deployment, and maintenance comparison matrix - 4-phase implementation roadmap - Risk assessment and mitigation strategies 2. doc/JIEBA_USAGE.md - Complete user guide for Jieba segmentation feature - Compilation instructions with CMake - Configuration file format and examples - C++/CLI/Python API usage examples - Custom user dictionary guide - Performance considerations and benchmarks - mmseg vs Jieba comparison table - Troubleshooting guide - Limitations and best practices Key recommendations: - Use cppjieba for production (performance, zero dependencies) - Enable via -DENABLE_JIEBA=ON compile flag - Experimental feature, opt-in only ---- * Fix C++ compiler compatibility * Fix //python/tests:test_opencc --------- Co-authored-by: Claude <noreply@anthropic.com>
…cters-according-to-Chinese-government-standards 最新词库和模式,增加 s2t_cngov.json 和 t2s_cngov.json
…yTian-tech/OpenCC-Traditional-Chinese-characters-according-to-Chinese-government-standards
…acters-according-to-Chinese-government-standards as a submodule under deps/cngov
Source: TerryTian-tech/OpenCC-Traditional-Chinese-characters-according-to-Chinese-government-standards Upstream version: 1.2.4 Upstream commit: e7d3c9f8921ca682fd44ee1b117c6e59fee3ac8e
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
ENABLE_JIEBA選項靜態納入核心,造成核心與可選分詞能力耦合與封包困難。libopencc在遇到"segmentation": { "type": "jieba" }時以動態載入外掛(例如libopencc-jieba.so)的方式提供分詞能力,方便發行版拆包成opencc與opencc-jieba。Description
doc/JIEBA_PLUGIN_ARCHITECTURE_PLAN.md,提供完整外掛化設計與執行方案,包括架構圖、C ABI 函式表草案(src/plugin/OpenCCPlugin.h示意)、PluginManager與PluginSegmentationAdapter設計、錯誤語意與記憶體釋放策略、外掛搜尋順序與安全控管、以及 CMake/Bazel 與打包建議。doc/JIEBA_USAGE.md開頭加入交叉參考連結,提示使用者參考新的外掛化設計文件以利遷移與打包。OPENCC_SEGMENTATION_PLUGIN_PATH、以及可選的OPENCC_DISABLE_PLUGINS安全開關,並提出漸進式遷移與相容性策略(保留舊的ENABLE_JIEBA支援以便過渡)。Testing
Codex Task