Add comprehensive Unicode handling guide and UTF-8 encoding configuration by Copilot · Pull Request #1 · NotASecretOrganzation/TypeAlias

Copilot · 2025-10-19T18:57:37Z

Overview

This PR adds comprehensive documentation and configuration for Unicode handling across multiple programming languages, addressing the need for consistent UTF-8 encoding standards and multi-language string processing best practices.

Problem Statement

The repository needed guidance on:

原始碼編碼 (Source Code Encoding): Standardizing on UTF-8 to prevent mojibake (乱码)
字串與 Unicode (String & Unicode): Understanding language-specific differences in string handling
比對與正規化 (Comparison & Normalization): Proper Unicode equivalence comparison techniques
資料結構 (Data Structures): Multi-language localization patterns

Solution

1. Unicode Handling Guide (UNICODE_HANDLING_GUIDE.md)

Created a comprehensive 662-line guide covering:

Source Code Encoding

UTF-8 as project standard
Editor configuration examples for VS Code, JetBrains IDEs, and Vim

Language-Specific String Handling

Python 3: str as Unicode with normalization examples
Rust: String/&str as guaranteed valid UTF-8
Go: string as byte sequence, conventionally UTF-8
JavaScript/TypeScript: string as UTF-16 with surrogate pair warnings
C# / .NET: string as UTF-16 with StringInfo usage

Unicode Normalization

Detailed explanation of NFC, NFD, NFKC, NFKD forms
Safe comparison examples for all languages
Practical demonstrations with composed/decomposed characters (café vs café)

Multi-Language Data Structures

Dictionary<Locale, TokenMap> pattern implementation
Complete examples in TypeScript, Python, Rust, Go, and C#
Includes Chinese (你好), Japanese (こんにちは), and English translations

2. Project Configuration

.gitattributes

Enforces UTF-8 encoding for source files
Sets proper line endings and text normalization
Ensures consistent handling across platforms

.editorconfig

Enforces UTF-8 charset across all file types
Configures consistent indentation per language
Provides uniform coding style across different editors

3. Enhanced README

Updated with:

Quick reference table comparing string handling across languages
Key considerations for Unicode processing
Direct link to comprehensive guide

Examples

The guide includes working code examples that demonstrate proper Unicode handling:

# Python: Unicode normalization for comparison
from unicodedata import normalize
composed = "café"
decomposed = "café"
print(normalize('NFC', composed) == normalize('NFC', decomposed))  # True

// JavaScript: Handling emojis and surrogate pairs
const emoji = "🌍";
console.log(emoji.length);        // 2 (surrogate pair)
console.log([...emoji].length);   // 1 (one code point)

// Rust: Safe UTF-8 handling
let text = "Hello 世界 🌍";
println!("Byte length: {}", text.len());
println!("Char count: {}", text.chars().count());

Testing

All files verified to:

Use UTF-8 encoding (file -i confirms charset=utf-8)
Contain multi-language Unicode characters (Chinese, Japanese, emojis)
Render correctly across different platforms

Benefits

✅ Prevents encoding issues: UTF-8 configuration prevents mojibake
✅ Cross-platform consistency: Works across Windows, macOS, and Linux
✅ Language coverage: Supports Python, Rust, Go, JavaScript/TypeScript, and C#
✅ Practical guidance: Copy-paste ready examples for immediate use
✅ Best practices: Comprehensive recommendations for Unicode handling

ChatGPT: " * 原始碼編碼：專案統一 UTF-8；檔案另存為 UTF-8 可避免亂碼（多數現代編譯器/工具預設支援）。

字串與 Unicode（語言差異大要留意）：

Python 3 的 str 為 Unicode；

Rust String/&str 為 UTF-8；

Go 的 string 是位元組序列，慣例存放 UTF-8；

JS/TS 與 .NET/C# 的 string 以 UTF-16 編碼單位運作（遇到表情符號/合字需留心 code point vs. code unit）。

比對與正規化：若要做「等價比對」（如「é」的合成/分解形），請考慮使用 ICU 或語言內建的 Unicode Normalization API。

資料結構：若多語系，常見結構是 Dictionary<string, TokenMap>，例如 locale -> (token -> localized text)。"

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

…tion Co-authored-by: ewdlop <25368970+ewdlop@users.noreply.github.com>

…ration Co-authored-by: ewdlop <25368970+ewdlop@users.noreply.github.com>

…ility Co-authored-by: ewdlop <25368970+ewdlop@users.noreply.github.com>

Initial plan

37c1ccc

Copilot AI assigned Copilot and ewdlop Oct 19, 2025

Copilot started work on behalf of ewdlop October 19, 2025 18:58 View session

Copilot AI and others added 3 commits October 19, 2025 19:02

Add comprehensive Unicode handling guide and UTF-8 encoding configura…

c24c829

…tion Co-authored-by: ewdlop <25368970+ewdlop@users.noreply.github.com>

Address code review: clarify TypeScript type definition is for illust…

dfe63ad

…ration Co-authored-by: ewdlop <25368970+ewdlop@users.noreply.github.com>

Improve TypeScript comment for better clarity on normalize() compatib…

675f2ab

…ility Co-authored-by: ewdlop <25368970+ewdlop@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Update encoding guideline for Unicode and strings~~ Add comprehensive Unicode handling guide and UTF-8 encoding configuration Oct 19, 2025

Copilot finished work on behalf of ewdlop October 19, 2025 19:07

Copilot AI requested a review from ewdlop October 19, 2025 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comprehensive Unicode handling guide and UTF-8 encoding configuration#1

Add comprehensive Unicode handling guide and UTF-8 encoding configuration#1
Copilot wants to merge 4 commits intomainfrom
copilot/update-unicode-handling-guide

Copilot AI commented Oct 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Problem Statement

Solution

1. Unicode Handling Guide (UNICODE_HANDLING_GUIDE.md)

2. Project Configuration

3. Enhanced README

Examples

Testing

Benefits

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Oct 19, 2025 •

edited

Loading