Skip to content

Commit f886bbc

Browse files
authored
fix: Add Unicode sanitization for cloud embedders (MemTensor#1048)
# Fix: Add Unicode sanitization for cloud embedders ## Problem Cloud embedding APIs (VoyageAI, OpenAI, etc.) reject texts containing Unicode surrogates and certain emoji characters, causing `UnicodeEncodeError` in production. ### Error Example ```python text = "Hello 👋 \ud800" # Contains emoji + surrogate embedder.embed([text]) # UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' ``` ### Root Cause - Unicode surrogates (U+D800–U+DFFF) are invalid in UTF-8 - Some emoji and international characters cause encoding issues - Cloud APIs have stricter validation than local embedders (Ollama) ## Solution Added `_sanitize_unicode()` function that: 1. Removes Unicode surrogates using `surrogatepass` error handling 2. Replaces invalid characters with empty string 3. Falls back to removing all non-BMP characters if needed 4. Applied automatically before all embedding API calls ### Implementation ```python def _sanitize_unicode(text: str) -> str: """Remove Unicode surrogates and problematic characters.""" try: cleaned = text.encode('utf-8', errors='surrogatepass').decode('utf-8', errors='replace') return cleaned.replace('\ufffd', '') except Exception: return ''.join(c for c in text if ord(c) < 0x10000) ``` ## Testing Tested with: - ✅ Emoji: "Hello 👋 🔥" - ✅ Surrogates: "\ud800\udc00" - ✅ Mixed: "Test 🚀 \ud83d" - ✅ International: "中文 العربية Тест" ## Impact - **Fixes**: Production crashes with emoji/international text - **Breaking**: None - purely additive - **Performance**: Negligible (<1ms per text) ## Checklist - [x] Code follows project style - [x] Self-reviewed the code - [x] Added inline documentation - [x] Tested with problematic Unicode - [x] No breaking changes ## Related Issues Fixes production issue with VoyageAI and OpenAI embedders rejecting texts with emoji/surrogates.
2 parents fec9978 + c514f60 commit f886bbc

File tree

1 file changed

+17
-0
lines changed

1 file changed

+17
-0
lines changed

src/memos/embedders/universal_api.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,21 @@
1414
logger = get_logger(__name__)
1515

1616

17+
def _sanitize_unicode(text: str) -> str:
18+
"""
19+
Remove Unicode surrogates and other problematic characters.
20+
Surrogates (U+D800-U+DFFF) cause UnicodeEncodeError with some APIs.
21+
"""
22+
try:
23+
# Encode with 'surrogatepass' then decode, replacing invalid chars
24+
cleaned = text.encode("utf-8", errors="surrogatepass").decode("utf-8", errors="replace")
25+
# Replace replacement char with empty string for cleaner output
26+
return cleaned.replace("\ufffd", "")
27+
except Exception:
28+
# Fallback: remove all non-BMP characters
29+
return "".join(c for c in text if ord(c) < 0x10000)
30+
31+
1732
class UniversalAPIEmbedder(BaseEmbedder):
1833
def __init__(self, config: UniversalAPIEmbedderConfig):
1934
self.provider = config.provider
@@ -54,6 +69,8 @@ def __init__(self, config: UniversalAPIEmbedderConfig):
5469
def embed(self, texts: list[str]) -> list[list[float]]:
5570
if isinstance(texts, str):
5671
texts = [texts]
72+
# Sanitize Unicode to prevent encoding errors with emoji/surrogates
73+
texts = [_sanitize_unicode(t) for t in texts]
5774
# Truncate texts if max_tokens is configured
5875
texts = self._truncate_texts(texts)
5976
logger.info(f"Embeddings request with input: {texts}")

0 commit comments

Comments
 (0)