Commit f886bbc
authored
fix: Add Unicode sanitization for cloud embedders (MemTensor#1048)
# Fix: Add Unicode sanitization for cloud embedders
## Problem
Cloud embedding APIs (VoyageAI, OpenAI, etc.) reject texts containing
Unicode surrogates and certain emoji characters, causing
`UnicodeEncodeError` in production.
### Error Example
```python
text = "Hello 👋 \ud800" # Contains emoji + surrogate
embedder.embed([text])
# UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800'
```
### Root Cause
- Unicode surrogates (U+D800–U+DFFF) are invalid in UTF-8
- Some emoji and international characters cause encoding issues
- Cloud APIs have stricter validation than local embedders (Ollama)
## Solution
Added `_sanitize_unicode()` function that:
1. Removes Unicode surrogates using `surrogatepass` error handling
2. Replaces invalid characters with empty string
3. Falls back to removing all non-BMP characters if needed
4. Applied automatically before all embedding API calls
### Implementation
```python
def _sanitize_unicode(text: str) -> str:
"""Remove Unicode surrogates and problematic characters."""
try:
cleaned = text.encode('utf-8', errors='surrogatepass').decode('utf-8', errors='replace')
return cleaned.replace('\ufffd', '')
except Exception:
return ''.join(c for c in text if ord(c) < 0x10000)
```
## Testing
Tested with:
- ✅ Emoji: "Hello 👋 🔥"
- ✅ Surrogates: "\ud800\udc00"
- ✅ Mixed: "Test 🚀 \ud83d"
- ✅ International: "中文 العربية Тест"
## Impact
- **Fixes**: Production crashes with emoji/international text
- **Breaking**: None - purely additive
- **Performance**: Negligible (<1ms per text)
## Checklist
- [x] Code follows project style
- [x] Self-reviewed the code
- [x] Added inline documentation
- [x] Tested with problematic Unicode
- [x] No breaking changes
## Related Issues
Fixes production issue with VoyageAI and OpenAI embedders rejecting
texts with emoji/surrogates.1 file changed
+17
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
17 | 32 | | |
18 | 33 | | |
19 | 34 | | |
| |||
54 | 69 | | |
55 | 70 | | |
56 | 71 | | |
| 72 | + | |
| 73 | + | |
57 | 74 | | |
58 | 75 | | |
59 | 76 | | |
| |||
0 commit comments