fix: prevent infinite recursion in embedSingle() for CJK text#215
fix: prevent infinite recursion in embedSingle() for CJK text#215RooikeCAO wants to merge 1 commit intoCortexReach:masterfrom
Conversation
When a large CJK text (14KB+ Chinese .md file) is processed by auto-recall, embedSingle() enters an infinite recursion loop because: 1. smartChunk() treats token limits as character limits, but CJK characters use 2-3x more tokens than ASCII characters 2. Chunks of 5740 chars (70% of 8192 token limit) still exceed the model's token context for CJK text 3. smartChunk() returns 1 chunk identical to input → embedSingle() recurses with the same text → infinite loop This produced ~50,000 embedding errors in 12 minutes, blocking the entire Node.js event loop and making all agents unresponsive. Fixes: - Add recursion depth limit (max 3) to embedSingle() with forced truncation as fallback - Detect single-chunk output (same size as input) and truncate instead of recursing - Add CJK-aware chunk sizing in smartChunk() (divide char limit by 2.5 when CJK ratio > 30%) - Truncate auto-recall query to 1000 chars before embedding - Add 10s global timeout on embedPassage()/embedQuery() Closes CortexReach#214 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for chasing this. I agree the bug is real, and the overall direction makes sense. I was able to reproduce the current failure mode on That said, I don’t think we should merge this as-is yet. I see two blocking issues:
Non-blocking note:
Before merge, I’d want regression coverage for:
So my view is: good bug to fix, good direction, but this needs another revision before merge. |
…e (PR CortexReach#215 follow-up) This commit addresses the two blocking issues raised in PR CortexReach#215: 1. Timeout now uses AbortController for TRUE request cancellation - Timer is properly cleaned up in .finally() - AbortSignal is passed through to embedWithRetry 2. Recursion now guarantees monotonic convergence - Introduced STRICT_REDUCTION_FACTOR = 0.5 - Each recursion level must reduce input by 50% - Works regardless of model context size Modified by AI assistant (not human code) based on PR CortexReach#215. Thanks to the original author and maintainers. Co-authored-by: Hi-Jiajun <Hi-Jiajun@users.noreply.github.com>
AliceLJY
left a comment
There was a problem hiding this comment.
Thanks for the fix — the defense-in-depth approach (CJK-aware chunking + single-chunk detection + depth limit) is well-designed.
However, since PR #238 includes this PR's full content as its first commit plus additional improvements (AbortController timeout, convergence guarantee), we'll review and merge #238 as the replacement. This PR will be closed when #238 lands.
No action needed on your side for this PR.
…ortexReach#238) - Test single-chunk detection (force-reduce when chunk >= 90% of original) - Test depth limit termination (depth >= MAX_EMBED_DEPTH throws) - Test CJK-aware chunk sizing (>30% CJK -> smaller chunks) - Test strict reduction factor (50% per recursion level) - Test batch embedding works correctly
fix: prevent infinite recursion in embedSingle() for CJK text (replaces PR #215)
…e (PR CortexReach#215 follow-up) This commit addresses the two blocking issues raised in PR CortexReach#215: 1. Timeout now uses AbortController for TRUE request cancellation - Timer is properly cleaned up in .finally() - AbortSignal is passed through to embedWithRetry 2. Recursion now guarantees monotonic convergence - Introduced STRICT_REDUCTION_FACTOR = 0.5 - Each recursion level must reduce input by 50% - Works regardless of model context size Modified by AI assistant (not human code) based on PR CortexReach#215. Thanks to the original author and maintainers. Co-authored-by: Hi-Jiajun <Hi-Jiajun@users.noreply.github.com>
…ortexReach#238) - Test single-chunk detection (force-reduce when chunk >= 90% of original) - Test depth limit termination (depth >= MAX_EMBED_DEPTH throws) - Test CJK-aware chunk sizing (>30% CJK -> smaller chunks) - Test strict reduction factor (50% per recursion level) - Test batch embedding works correctly
…e (PR CortexReach#215 follow-up) This commit addresses the two blocking issues raised in PR CortexReach#215: 1. Timeout now uses AbortController for TRUE request cancellation - Timer is properly cleaned up in .finally() - AbortSignal is passed through to embedWithRetry 2. Recursion now guarantees monotonic convergence - Introduced STRICT_REDUCTION_FACTOR = 0.5 - Each recursion level must reduce input by 50% - Works regardless of model context size Modified by AI assistant (not human code) based on PR CortexReach#215. Thanks to the original author and maintainers. Co-authored-by: Hi-Jiajun <Hi-Jiajun@users.noreply.github.com>
…ortexReach#238) - Test single-chunk detection (force-reduce when chunk >= 90% of original) - Test depth limit termination (depth >= MAX_EMBED_DEPTH throws) - Test CJK-aware chunk sizing (>30% CJK -> smaller chunks) - Test strict reduction factor (50% per recursion level) - Test batch embedding works correctly
fix: prevent infinite recursion in embedSingle() for CJK text (replaces PR CortexReach#215)
Summary
embedSingle()whensmartChunk()returns a single chunk equal to input sizeCloses #214
Problem
When a large CJK text file (14KB+ Chinese
.md) is processed by auto-recall,embedSingle()enters an infinite recursion loop producing ~50,000 embedding errors in 12 minutes, blocking the Node.js event loop and making all agents unresponsive.Root cause:
smartChunk()treats the token limit (8192) as a character limit, but CJK characters use 2-3 tokens each. A chunk of 5740 Chinese characters ≈ 14,000 tokens, still exceeding the 8192 token limit. Since the chunk is already ≤maxChunkSizein characters, the chunker returns 1 chunk identical to the input →embedSingle()recurses infinitely.Changes (3 files, 125 insertions)
src/embedder.tsdepthparameter toembedSingle(), max depth 3, then force-truncatesmartChunk()returns 1 chunk ≈ same size as input, truncate instead of recursingPromise.racewrapper onembedPassage()/embedQuery()getSafeCharLimit(): Returns safe character count per model (2300 for nomic-embed-text considering CJK)src/chunker.tsmaxChunkSizeby 2.5nomic-embed-text: 5734 chars → 2293 chars (≈5732 tokens, safely under 8192 limit)index.tsevent.promptto 1000 chars before embedding (only intent needed for memory search, not full attachment text)Test plan
.mdfile embedding completes in <10s (was: infinite loop)Document exceeded context limitlog flooding after fixembedder-error-hints.test.mjspasses🤖 Generated with Claude Code