|
| 1 | +# Unicode String Length Handling |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This document explains how string length validation works in `pubky-app-specs` and the important differences between JavaScript's native string length and Rust's character counting. |
| 6 | + |
| 7 | +## The Problem |
| 8 | + |
| 9 | +JavaScript and Rust count string length differently for certain Unicode characters: |
| 10 | + |
| 11 | +| Character | Type | Rust `.chars().count()` | JS `.length` | |
| 12 | +|-----------|------|-------------------------|--------------| |
| 13 | +| `"Hello"` | ASCII | 5 | 5 | |
| 14 | +| `"中文"` | Chinese | 2 | 2 | |
| 15 | +| `"café"` | Accented | 4 | 4 | |
| 16 | +| `"🔥"` | Emoji | **1** | **2** | |
| 17 | +| `"𒅃"` | Cuneiform | **1** | **2** | |
| 18 | +| `"𓀀"` | Hieroglyph | **1** | **2** | |
| 19 | + |
| 20 | +### Why the Difference? |
| 21 | + |
| 22 | +- **JavaScript** uses **UTF-16** encoding internally. The `.length` property counts **UTF-16 code units**. |
| 23 | +- **Rust** `.chars().count()` counts **Unicode code points** (scalar values). |
| 24 | + |
| 25 | +Characters in the **Basic Multilingual Plane (BMP)** (U+0000 to U+FFFF) use 1 UTF-16 code unit. |
| 26 | +Characters **outside the BMP** (U+10000 and above) require a **surrogate pair** (2 UTF-16 code units). |
| 27 | + |
| 28 | +### Characters Outside BMP (Affected by This Difference) |
| 29 | + |
| 30 | +| Category | Examples | UTF-16 Units per Char | |
| 31 | +|----------|----------|----------------------| |
| 32 | +| Emoji | 🔥 🚀 😀 👋 🌍 | 2 | |
| 33 | +| Cuneiform (Sumerian) | 𒅃 𒀀 𒁀 | 2 | |
| 34 | +| Egyptian Hieroglyphs | 𓀀 𓆉 𓍄 | 2 | |
| 35 | +| Musical Symbols | 𝄞 𝄢 | 2 | |
| 36 | +| Mathematical Alphanumeric | 𝔸 𝕏 | 2 | |
| 37 | +| Historic Scripts | Various | 2 | |
| 38 | + |
| 39 | +**Note**: Characters in the BMP (ASCII, Chinese, Japanese, Korean, Arabic, Hebrew, Cyrillic, Greek, Thai, etc.) all use 1 UTF-16 unit and are **unaffected** by this difference. |
| 40 | + |
| 41 | +## Our Solution: WASM-Based Validation |
| 42 | + |
| 43 | +All validation in `pubky-app-specs` happens **inside the WASM module** (Rust), not in JavaScript. |
| 44 | + |
| 45 | +### Architecture |
| 46 | + |
| 47 | +``` |
| 48 | +┌─────────────────────────────────────────────────────────┐ |
| 49 | +│ JavaScript Client │ |
| 50 | +│ │ |
| 51 | +│ const user = PubkyAppUser.fromJson({ │ |
| 52 | +│ name: "Alice🔥", │ |
| 53 | +│ bio: "Hello 𓀀" │ |
| 54 | +│ }); │ |
| 55 | +└─────────────────────┬───────────────────────────────────┘ |
| 56 | + │ |
| 57 | + ▼ |
| 58 | +┌─────────────────────────────────────────────────────────┐ |
| 59 | +│ WASM Module (Rust) │ |
| 60 | +│ │ |
| 61 | +│ 1. Deserialize JSON │ |
| 62 | +│ 2. Sanitize (trim whitespace, normalize) │ |
| 63 | +│ 3. Validate (using .chars().count()) ◄── Single │ |
| 64 | +│ 4. Return Result Source │ |
| 65 | +│ of Truth │ |
| 66 | +└─────────────────────────────────────────────────────────┘ |
| 67 | +``` |
| 68 | + |
| 69 | +### Why This Works |
| 70 | + |
| 71 | +1. **Single Source of Truth**: All validation uses Rust's `.chars().count()` (Unicode code points) |
| 72 | +2. **No JS Validation Needed**: JavaScript delegates entirely to WASM |
| 73 | +3. **Consistent Results**: Same behavior for emoji, Chinese, cuneiform, etc. |
| 74 | + |
| 75 | +### Example: Username Validation |
| 76 | + |
| 77 | +```rust |
| 78 | +// In Rust (WASM) |
| 79 | +const MAX_USERNAME_LENGTH: usize = 50; |
| 80 | + |
| 81 | +fn validate(&self, _id: Option<&str>) -> Result<(), String> { |
| 82 | + let name_length = self.name.chars().count(); // Unicode code points |
| 83 | + if name_length > MAX_USERNAME_LENGTH { |
| 84 | + return Err("Validation Error: Invalid name length".into()); |
| 85 | + } |
| 86 | + Ok(()) |
| 87 | +} |
| 88 | +``` |
| 89 | + |
| 90 | +| Input | `.chars().count()` | Valid? (max 50) | |
| 91 | +|-------|-------------------|-----------------| |
| 92 | +| `"Alice"` | 5 | ✅ | |
| 93 | +| `"🔥".repeat(50)` | 50 | ✅ | |
| 94 | +| `"🔥".repeat(51)` | 51 | ❌ | |
| 95 | +| `"𓀀".repeat(50)` | 50 | ✅ | |
| 96 | + |
| 97 | +## Client-Side Validation |
| 98 | + |
| 99 | +For client-side validation (for UX feedback), we recommend relying on the existing pubky-app-specs validation in the WASM module. |
| 100 | + |
| 101 | +### How to Validate in Your Application |
| 102 | + |
| 103 | +The WASM module automatically validates all objects when you create them or parse them from JSON. Use these methods for validation: |
| 104 | + |
| 105 | +```javascript |
| 106 | +import { PubkySpecsBuilder, PubkyAppUser } from "pubky-app-specs"; |
| 107 | + |
| 108 | +// Method 1: Using builder |
| 109 | +try { |
| 110 | + const builder = new PubkySpecsBuilder(userId); |
| 111 | + const { user } = builder.createUser( |
| 112 | + "Alice🔥", // Emoji counts as 1 character |
| 113 | + "Bio with 𓀀", // Hieroglyph counts as 1 character |
| 114 | + null, null, null |
| 115 | + ); |
| 116 | + console.log("User is valid!"); |
| 117 | +} catch (error) { |
| 118 | + showError(error.message); // Validation failed |
| 119 | +} |
| 120 | + |
| 121 | +// Method 2: From JSON |
| 122 | +try { |
| 123 | + const user = PubkyAppUser.fromJson({ |
| 124 | + name: "Alice🔥", |
| 125 | + bio: "Bio with 𓀀", |
| 126 | + image: null, |
| 127 | + links: null, |
| 128 | + status: null |
| 129 | + }); |
| 130 | + console.log("User is valid!"); |
| 131 | +} catch (error) { |
| 132 | + showError(error.message); // Validation failed |
| 133 | +} |
| 134 | + |
| 135 | +// Both methods throw on validation failure - no manual checks needed! |
| 136 | +``` |
| 137 | + |
| 138 | +### JavaScript Length Methods Comparison |
| 139 | + |
| 140 | +If you need client-side length validation for real-time input feedback (e.g., character counters) or custom validation, you should use methods that count Unicode code points to match Rust's `.chars().count()` behavior: |
| 141 | + |
| 142 | +```javascript |
| 143 | +const str = "Hi🔥"; |
| 144 | + |
| 145 | +// ❌ WRONG - counts UTF-16 code units, not Unicode code points |
| 146 | +str.length // 4 (will reject valid input) |
| 147 | +if (username.length > MAX_USERNAME_LENGTH) { |
| 148 | + showError("Username too long"); |
| 149 | +} |
| 150 | +// This would incorrectly reject "🔥".repeat(25) |
| 151 | +// because JS sees 50 code units, but Rust sees 25 code points (valid!) |
| 152 | + |
| 153 | +// ✅ CORRECT - counts Unicode code points (matches Rust) |
| 154 | +// These methods correctly handle characters outside BMP (emoji, etc.) |
| 155 | +[...str].length // 3 (Unicode code points) - counts 🔥 as 1 |
| 156 | +Array.from(str).length // 3 (also works) |
| 157 | +``` |
| 158 | + |
| 159 | +### When to Validate |
| 160 | + |
| 161 | +- **On form submit**: Always - catch errors before network calls |
| 162 | +- **Real-time feedback**: Optional - use `[...str].length` for input counters |
| 163 | +- **On input change**: Usually not needed - can impact UX with emoji autocomplete |
| 164 | + |
| 165 | +### Edge Cases: Grapheme Clusters (Advanced) |
| 166 | + |
| 167 | +⚠️ **This is informational** - current validation doesn't handle grapheme clusters, and that's acceptable for most use cases. |
| 168 | + |
| 169 | +Even `.chars().count()` doesn't handle complex **grapheme clusters** (what users perceive as single characters): |
| 170 | + |
| 171 | +| String | Visual | Code Points | User Perception | |
| 172 | +|--------|--------|-------------|----------------| |
| 173 | +| `"👨👩👧👦"` | family emoji | 7 | 1 | |
| 174 | +| `"🇺🇸"` | flag | 2 | 1 | |
| 175 | +| `"é"` (e + ◌́) | accented e | 2 | 1 | |
| 176 | + |
| 177 | +**Impact**: A username with 50 flag emojis would actually be 100 code points and fail validation. |
| 178 | + |
| 179 | +**Decision**: For usernames, tags, and bios, code point counting is sufficient. True grapheme counting would add complexity and dependencies without significant benefit for this use case. |
| 180 | + |
| 181 | +## Summary |
| 182 | + |
| 183 | +| Aspect | Approach | |
| 184 | +|--------|----------| |
| 185 | +| **Validation Location** | WASM (Rust) only | |
| 186 | +| **Length Method** | `.chars().count()` (Unicode code points) | |
| 187 | +| **JS Client** | Use `[...str].length` if local validation needed | |
| 188 | +| **Affected Characters** | Emoji, ancient scripts, musical symbols | |
| 189 | +| **Unaffected Characters** | ASCII, Chinese, Japanese, Arabic, etc. | |
| 190 | +| **Performance** | <1ms for typical inputs | |
| 191 | + |
| 192 | +## References |
| 193 | + |
| 194 | +- [Unicode Standard](https://unicode.org/) |
| 195 | +- [UTF-16 on Wikipedia](https://en.wikipedia.org/wiki/UTF-16) |
| 196 | +- [JavaScript String length](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length) |
| 197 | +- [Rust chars() documentation](https://doc.rust-lang.org/std/primitive.str.html#method.chars) |
0 commit comments