fix: improve adaptive retrieval and noise filter accuracy for CJK and edge cases#401
fix: improve adaptive retrieval and noise filter accuracy for CJK and edge cases#401ssyn0813 wants to merge 6 commits intoCortexReach:masterfrom
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Use \p{Extended_Pictographic} instead of \p{Emoji} to avoid matching digits
- Narrow slash command regex to /word format, no longer matches file paths
- Add CJK-aware hard threshold (length < 2 for CJK, < 5 for non-CJK)
- Exempt digit-containing strings (port numbers, issue IDs) from length thresholds
- Lower CJK defaultMinLength from 6 to 3 for short meaningful CJK queries
- Lower non-CJK defaultMinLength from 15 to 13 to allow file path queries
- Prevent FORCE_RETRIEVE from hijacking slash commands like /recall
Short CJK text (2+ chars) is no longer falsely marked as noise. Uses same CJK detection pattern as adaptive-retrieval. Also tightens boilerplate greeting regex to only match standalone greetings (≤1 trailing word), so real memories starting with "hello" are not incorrectly filtered.
AliceLJY
left a comment
There was a problem hiding this comment.
Clean fixes with solid test coverage. The three changes are all well-targeted:
\p{Emoji}→\p{Extended_Pictographic}— correct fix for the digits-matching-as-emoji bug- Slash command regex tightened to avoid matching file paths
- CJK thresholds lowered + digit exemption for port/issue numbers
196 lines of new tests covering edge cases. LGTM.
Review: REQUEST-CHANGESThe four bugs you're fixing are real — Must fix:
Worth considering (not blocking):
|
AliceLJY
left a comment
There was a problem hiding this comment.
LGTM — changes are clean, on-topic, and well-tested. Approving.
|
Closing this PR for now due to inactivity. The core CJK / emoji / file-path fixes make sense, but the current head still regresses argument-bearing slash commands in That is a real behavior change relative to the documented custom-command flow in the README, so I don’t want to merge it as-is. If you want to revive this later, please rebase and update the slash-command handling so it still skips commands with arguments while continuing to avoid false positives for file paths like |
Summary
Four bugs in the filtering/retrieval pipeline silently drop valid content:
\p{Emoji}regex matches digits —"12345","8080"are treated as pure emoji and skipped/usr/bin/nodeare misidentified as slash commandslength < 5hard threshold fires before CJK-aware branch; Chinese queries like"他喜欢猫"(4 chars) are skippedlength < 5threshold with no CJK awareness; Chinese memories are rejected at write timeBug 3+4 together create a double loss: short CJK content can neither be stored nor retrieved.
Root Cause
\p{Emoji}includes ASCII digits0-9,#,*as keycap bases^\//matches any/-prefixed text, not just/commandformatadaptive-retrieval.ts:78andnoise-filter.ts:76uselength < 5without considering that CJK characters carry far more semantic information per character than Latin charactersChanges
src/adaptive-retrieval.ts: Replace\p{Emoji}with\p{Extended_Pictographic}, narrow slash regex to^\/[a-z][\w-]*\s*$/i, add CJK-aware hard threshold, add digit exemption for port/issue numbers, add slash command guard on FORCE_RETRIEVE to prevent/recallbypassing skip logicsrc/noise-filter.ts: Add CJK-aware hard threshold (CJK:length < 2, non-CJK:length < 5), tighten boilerplate regex to avoid false positives on longer sentences starting with greetingstest/adaptive-retrieval.test.mjs: New — 20 test cases covering emoji, slash, CJK, and existing behavior preservationtest/noise-filter.test.mjs: New — 15 test cases covering CJK, English, patterns, options, andfilterNoisegenericpackage.json: Register both new test files in npm test scriptTest Plan
node --test test/adaptive-retrieval.test.mjs— 20/20 passnode --test test/noise-filter.test.mjs— 15/15 passreflection-bypass-hook.test.mjshas pre-existing failures on master, unrelated to this PR)Related: #127