Skip to content

Add comprehensive Unicode handling guide and UTF-8 encoding configuration#1

Draft
Copilot wants to merge 4 commits intomainfrom
copilot/update-unicode-handling-guide
Draft

Add comprehensive Unicode handling guide and UTF-8 encoding configuration#1
Copilot wants to merge 4 commits intomainfrom
copilot/update-unicode-handling-guide

Conversation

Copy link
Copy Markdown

Copilot AI commented Oct 19, 2025

Overview

This PR adds comprehensive documentation and configuration for Unicode handling across multiple programming languages, addressing the need for consistent UTF-8 encoding standards and multi-language string processing best practices.

Problem Statement

The repository needed guidance on:

  • 原始碼編碼 (Source Code Encoding): Standardizing on UTF-8 to prevent mojibake (乱码)
  • 字串與 Unicode (String & Unicode): Understanding language-specific differences in string handling
  • 比對與正規化 (Comparison & Normalization): Proper Unicode equivalence comparison techniques
  • 資料結構 (Data Structures): Multi-language localization patterns

Solution

1. Unicode Handling Guide (UNICODE_HANDLING_GUIDE.md)

Created a comprehensive 662-line guide covering:

Source Code Encoding

  • UTF-8 as project standard
  • Editor configuration examples for VS Code, JetBrains IDEs, and Vim

Language-Specific String Handling

  • Python 3: str as Unicode with normalization examples
  • Rust: String/&str as guaranteed valid UTF-8
  • Go: string as byte sequence, conventionally UTF-8
  • JavaScript/TypeScript: string as UTF-16 with surrogate pair warnings
  • C# / .NET: string as UTF-16 with StringInfo usage

Unicode Normalization

  • Detailed explanation of NFC, NFD, NFKC, NFKD forms
  • Safe comparison examples for all languages
  • Practical demonstrations with composed/decomposed characters (café vs café)

Multi-Language Data Structures

  • Dictionary<Locale, TokenMap> pattern implementation
  • Complete examples in TypeScript, Python, Rust, Go, and C#
  • Includes Chinese (你好), Japanese (こんにちは), and English translations

2. Project Configuration

.gitattributes

  • Enforces UTF-8 encoding for source files
  • Sets proper line endings and text normalization
  • Ensures consistent handling across platforms

.editorconfig

  • Enforces UTF-8 charset across all file types
  • Configures consistent indentation per language
  • Provides uniform coding style across different editors

3. Enhanced README

Updated with:

  • Quick reference table comparing string handling across languages
  • Key considerations for Unicode processing
  • Direct link to comprehensive guide

Examples

The guide includes working code examples that demonstrate proper Unicode handling:

# Python: Unicode normalization for comparison
from unicodedata import normalize
composed = "café"
decomposed = "café"
print(normalize('NFC', composed) == normalize('NFC', decomposed))  # True
// JavaScript: Handling emojis and surrogate pairs
const emoji = "🌍";
console.log(emoji.length);        // 2 (surrogate pair)
console.log([...emoji].length);   // 1 (one code point)
// Rust: Safe UTF-8 handling
let text = "Hello 世界 🌍";
println!("Byte length: {}", text.len());
println!("Char count: {}", text.chars().count());

Testing

All files verified to:

  • Use UTF-8 encoding (file -i confirms charset=utf-8)
  • Contain multi-language Unicode characters (Chinese, Japanese, emojis)
  • Render correctly across different platforms

Benefits

Prevents encoding issues: UTF-8 configuration prevents mojibake
Cross-platform consistency: Works across Windows, macOS, and Linux
Language coverage: Supports Python, Rust, Go, JavaScript/TypeScript, and C#
Practical guidance: Copy-paste ready examples for immediate use
Best practices: Comprehensive recommendations for Unicode handling

Related

Addresses requirements from issue regarding Unicode handling best practices across different programming languages.

Original prompt

ChatGPT: " * 原始碼編碼:專案統一 UTF-8;檔案另存為 UTF-8 可避免亂碼(多數現代編譯器/工具預設支援)。

  • 字串與 Unicode(語言差異大要留意):

    • Python 3 的 str 為 Unicode;
    • Rust String/&str 為 UTF-8;
    • Go 的 string 是位元組序列,慣例存放 UTF-8;
    • JS/TS 與 .NET/C# 的 string 以 UTF-16 編碼單位運作(遇到表情符號/合字需留心 code point vs. code unit)。
  • 比對與正規化:若要做「等價比對」(如「é」的合成/分解形),請考慮使用 ICU 或語言內建的 Unicode Normalization API。

  • 資料結構:若多語系,常見結構是 Dictionary<string, TokenMap>,例如 locale -> (token -> localized text)。"


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 3 commits October 19, 2025 19:02
…tion

Co-authored-by: ewdlop <25368970+ewdlop@users.noreply.github.com>
…ration

Co-authored-by: ewdlop <25368970+ewdlop@users.noreply.github.com>
…ility

Co-authored-by: ewdlop <25368970+ewdlop@users.noreply.github.com>
Copilot AI changed the title [WIP] Update encoding guideline for Unicode and strings Add comprehensive Unicode handling guide and UTF-8 encoding configuration Oct 19, 2025
Copilot AI requested a review from ewdlop October 19, 2025 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants