Skip to content

Commit a46e4ba

Browse files
authored
chore: add new validation rules (#84)
* draft of validation * more test coverage and small fixes * add JS coverage * fmt fixes * final touches * decrease allowed protocols * improve docs * review, request fixes * ready to publish v0.4.2
1 parent 33a52eb commit a46e4ba

File tree

20 files changed

+2154
-432
lines changed

20 files changed

+2154
-432
lines changed

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
The MIT License (MIT)
22

3-
Copyright (c) 2024
3+
Copyright (c) 2024-2026 Synonym
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

docs/UNICODE_NOTES.md

Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
# Unicode String Length Handling
2+
3+
## Overview
4+
5+
This document explains how string length validation works in `pubky-app-specs` and the important differences between JavaScript's native string length and Rust's character counting.
6+
7+
## The Problem
8+
9+
JavaScript and Rust count string length differently for certain Unicode characters:
10+
11+
| Character | Type | Rust `.chars().count()` | JS `.length` |
12+
|-----------|------|-------------------------|--------------|
13+
| `"Hello"` | ASCII | 5 | 5 |
14+
| `"中文"` | Chinese | 2 | 2 |
15+
| `"café"` | Accented | 4 | 4 |
16+
| `"🔥"` | Emoji | **1** | **2** |
17+
| `"𒅃"` | Cuneiform | **1** | **2** |
18+
| `"𓀀"` | Hieroglyph | **1** | **2** |
19+
20+
### Why the Difference?
21+
22+
- **JavaScript** uses **UTF-16** encoding internally. The `.length` property counts **UTF-16 code units**.
23+
- **Rust** `.chars().count()` counts **Unicode code points** (scalar values).
24+
25+
Characters in the **Basic Multilingual Plane (BMP)** (U+0000 to U+FFFF) use 1 UTF-16 code unit.
26+
Characters **outside the BMP** (U+10000 and above) require a **surrogate pair** (2 UTF-16 code units).
27+
28+
### Characters Outside BMP (Affected by This Difference)
29+
30+
| Category | Examples | UTF-16 Units per Char |
31+
|----------|----------|----------------------|
32+
| Emoji | 🔥 🚀 😀 👋 🌍 | 2 |
33+
| Cuneiform (Sumerian) | 𒅃 𒀀 𒁀 | 2 |
34+
| Egyptian Hieroglyphs | 𓀀 𓆉 𓍄 | 2 |
35+
| Musical Symbols | 𝄞 𝄢 | 2 |
36+
| Mathematical Alphanumeric | 𝔸 𝕏 | 2 |
37+
| Historic Scripts | Various | 2 |
38+
39+
**Note**: Characters in the BMP (ASCII, Chinese, Japanese, Korean, Arabic, Hebrew, Cyrillic, Greek, Thai, etc.) all use 1 UTF-16 unit and are **unaffected** by this difference.
40+
41+
## Our Solution: WASM-Based Validation
42+
43+
All validation in `pubky-app-specs` happens **inside the WASM module** (Rust), not in JavaScript.
44+
45+
### Architecture
46+
47+
```
48+
┌─────────────────────────────────────────────────────────┐
49+
│ JavaScript Client │
50+
│ │
51+
│ const user = PubkyAppUser.fromJson({ │
52+
│ name: "Alice🔥", │
53+
│ bio: "Hello 𓀀" │
54+
│ }); │
55+
└─────────────────────┬───────────────────────────────────┘
56+
57+
58+
┌─────────────────────────────────────────────────────────┐
59+
│ WASM Module (Rust) │
60+
│ │
61+
│ 1. Deserialize JSON │
62+
│ 2. Sanitize (trim whitespace, normalize) │
63+
│ 3. Validate (using .chars().count()) ◄── Single │
64+
│ 4. Return Result Source │
65+
│ of Truth │
66+
└─────────────────────────────────────────────────────────┘
67+
```
68+
69+
### Why This Works
70+
71+
1. **Single Source of Truth**: All validation uses Rust's `.chars().count()` (Unicode code points)
72+
2. **No JS Validation Needed**: JavaScript delegates entirely to WASM
73+
3. **Consistent Results**: Same behavior for emoji, Chinese, cuneiform, etc.
74+
75+
### Example: Username Validation
76+
77+
```rust
78+
// In Rust (WASM)
79+
const MAX_USERNAME_LENGTH: usize = 50;
80+
81+
fn validate(&self, _id: Option<&str>) -> Result<(), String> {
82+
let name_length = self.name.chars().count(); // Unicode code points
83+
if name_length > MAX_USERNAME_LENGTH {
84+
return Err("Validation Error: Invalid name length".into());
85+
}
86+
Ok(())
87+
}
88+
```
89+
90+
| Input | `.chars().count()` | Valid? (max 50) |
91+
|-------|-------------------|-----------------|
92+
| `"Alice"` | 5 ||
93+
| `"🔥".repeat(50)` | 50 ||
94+
| `"🔥".repeat(51)` | 51 ||
95+
| `"𓀀".repeat(50)` | 50 ||
96+
97+
## Client-Side Validation
98+
99+
For client-side validation (for UX feedback), we recommend relying on the existing pubky-app-specs validation in the WASM module.
100+
101+
### How to Validate in Your Application
102+
103+
The WASM module automatically validates all objects when you create them or parse them from JSON. Use these methods for validation:
104+
105+
```javascript
106+
import { PubkySpecsBuilder, PubkyAppUser } from "pubky-app-specs";
107+
108+
// Method 1: Using builder
109+
try {
110+
const builder = new PubkySpecsBuilder(userId);
111+
const { user } = builder.createUser(
112+
"Alice🔥", // Emoji counts as 1 character
113+
"Bio with 𓀀", // Hieroglyph counts as 1 character
114+
null, null, null
115+
);
116+
console.log("User is valid!");
117+
} catch (error) {
118+
showError(error.message); // Validation failed
119+
}
120+
121+
// Method 2: From JSON
122+
try {
123+
const user = PubkyAppUser.fromJson({
124+
name: "Alice🔥",
125+
bio: "Bio with 𓀀",
126+
image: null,
127+
links: null,
128+
status: null
129+
});
130+
console.log("User is valid!");
131+
} catch (error) {
132+
showError(error.message); // Validation failed
133+
}
134+
135+
// Both methods throw on validation failure - no manual checks needed!
136+
```
137+
138+
### JavaScript Length Methods Comparison
139+
140+
If you need client-side length validation for real-time input feedback (e.g., character counters) or custom validation, you should use methods that count Unicode code points to match Rust's `.chars().count()` behavior:
141+
142+
```javascript
143+
const str = "Hi🔥";
144+
145+
// ❌ WRONG - counts UTF-16 code units, not Unicode code points
146+
str.length // 4 (will reject valid input)
147+
if (username.length > MAX_USERNAME_LENGTH) {
148+
showError("Username too long");
149+
}
150+
// This would incorrectly reject "🔥".repeat(25)
151+
// because JS sees 50 code units, but Rust sees 25 code points (valid!)
152+
153+
// ✅ CORRECT - counts Unicode code points (matches Rust)
154+
// These methods correctly handle characters outside BMP (emoji, etc.)
155+
[...str].length // 3 (Unicode code points) - counts 🔥 as 1
156+
Array.from(str).length // 3 (also works)
157+
```
158+
159+
### When to Validate
160+
161+
- **On form submit**: Always - catch errors before network calls
162+
- **Real-time feedback**: Optional - use `[...str].length` for input counters
163+
- **On input change**: Usually not needed - can impact UX with emoji autocomplete
164+
165+
### Edge Cases: Grapheme Clusters (Advanced)
166+
167+
⚠️ **This is informational** - current validation doesn't handle grapheme clusters, and that's acceptable for most use cases.
168+
169+
Even `.chars().count()` doesn't handle complex **grapheme clusters** (what users perceive as single characters):
170+
171+
| String | Visual | Code Points | User Perception |
172+
|--------|--------|-------------|----------------|
173+
| `"👨‍👩‍👧‍👦"` | family emoji | 7 | 1 |
174+
| `"🇺🇸"` | flag | 2 | 1 |
175+
| `"é"` (e + ◌́) | accented e | 2 | 1 |
176+
177+
**Impact**: A username with 50 flag emojis would actually be 100 code points and fail validation.
178+
179+
**Decision**: For usernames, tags, and bios, code point counting is sufficient. True grapheme counting would add complexity and dependencies without significant benefit for this use case.
180+
181+
## Summary
182+
183+
| Aspect | Approach |
184+
|--------|----------|
185+
| **Validation Location** | WASM (Rust) only |
186+
| **Length Method** | `.chars().count()` (Unicode code points) |
187+
| **JS Client** | Use `[...str].length` if local validation needed |
188+
| **Affected Characters** | Emoji, ancient scripts, musical symbols |
189+
| **Unaffected Characters** | ASCII, Chinese, Japanese, Arabic, etc. |
190+
| **Performance** | <1ms for typical inputs |
191+
192+
## References
193+
194+
- [Unicode Standard](https://unicode.org/)
195+
- [UTF-16 on Wikipedia](https://en.wikipedia.org/wiki/UTF-16)
196+
- [JavaScript String length](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length)
197+
- [Rust chars() documentation](https://doc.rust-lang.org/std/primitive.str.html#method.chars)

0 commit comments

Comments
 (0)